[00:58:33] PROBLEM - puppet last run on rdb2001 is CRITICAL puppet fail [01:15:23] RECOVERY - puppet last run on rdb2001 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [01:27:06] I just noticed on a couple of WMF private wikis that PDFs don't generate? Is this a known issue or just something that was never possible? [01:27:06] Rendering failed [01:27:06] Generation of the document file has failed. [01:27:06] Status: Bundling process died with non zero code: 1 [01:49:12] PROBLEM - puppet last run on mw2127 is CRITICAL puppet fail [02:07:43] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:21:29] !log l10nupdate Synchronized php-1.26wmf7/cache/l10n: (no message) (duration: 06m 41s) [02:21:40] Logged the message, Master [02:26:47] !log LocalisationUpdate completed (1.26wmf7) at 2015-05-31 02:25:44+00:00 [02:26:53] Logged the message, Master [02:43:11] !log l10nupdate Synchronized php-1.26wmf8/cache/l10n: (no message) (duration: 05m 51s) [02:43:16] Logged the message, Master [02:47:44] !log LocalisationUpdate completed (1.26wmf8) at 2015-05-31 02:46:41+00:00 [02:47:51] Logged the message, Master [03:21:14] (03PS1) 10Ladsgroup: Install Extension:Translate on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) [03:27:21] 6operations, 10Wikimedia-Apache-configuration: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1323577 (10Glaisher) [03:34:52] PROBLEM - puppet last run on mw1188 is CRITICAL Puppet has 1 failures [03:35:42] PROBLEM - puppet last run on mw1212 is CRITICAL Puppet has 1 failures [03:39:04] (03CR) 10Glaisher: "Sysops/bureaucrats should be able to add/remove users to/from translationadmin group." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [03:48:54] PROBLEM - puppet last run on mw2050 is CRITICAL puppet fail [03:50:47] (03PS1) 10Glaisher: Enable "Other Projects Links" by default on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214894 (https://phabricator.wikimedia.org/T99901) [03:51:10] (03PS2) 10Glaisher: Enable "Other Projects Links" by default on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214894 (https://phabricator.wikimedia.org/T99901) [03:51:43] RECOVERY - puppet last run on mw1188 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [03:52:42] RECOVERY - puppet last run on mw1212 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:59:24] (03PS2) 10Ladsgroup: Install Extension:Translate on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) [04:07:24] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [04:36:43] PROBLEM - puppet last run on mw2204 is CRITICAL puppet fail [04:55:22] RECOVERY - puppet last run on mw2204 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:59:33] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [05:35:39] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun May 31 05:34:36 UTC 2015 (duration 34m 35s) [05:35:43] Logged the message, Master [05:56:43] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:03:13] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [06:20:23] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 2 failures [06:32:13] PROBLEM - puppet last run on db2040 is CRITICAL Puppet has 1 failures [06:33:32] PROBLEM - puppet last run on lvs2001 is CRITICAL Puppet has 1 failures [06:34:03] PROBLEM - puppet last run on ms-fe2003 is CRITICAL Puppet has 1 failures [06:34:12] PROBLEM - puppet last run on mw2093 is CRITICAL Puppet has 1 failures [06:34:12] PROBLEM - puppet last run on mw2096 is CRITICAL Puppet has 1 failures [06:34:13] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:34:13] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:34:13] PROBLEM - puppet last run on mw2003 is CRITICAL Puppet has 1 failures [06:34:23] PROBLEM - puppet last run on mw1052 is CRITICAL Puppet has 1 failures [06:35:52] PROBLEM - puppet last run on mw2092 is CRITICAL Puppet has 1 failures [06:47:04] RECOVERY - puppet last run on lvs2001 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:47:32] RECOVERY - puppet last run on db2040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:43] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:43] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:47:44] RECOVERY - puppet last run on mw2096 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:52] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:47:52] RECOVERY - puppet last run on mw2092 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:47:52] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:47:53] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:53] RECOVERY - puppet last run on mw1052 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:23] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [07:47:32] PROBLEM - High load average on labstore1001 is CRITICAL 62.50% of data above the critical threshold [24.0] [08:23:03] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [09:15:52] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:30:12] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 604 [09:35:13] RECOVERY - check_mysql on db1008 is OK: Uptime: 3876576 Threads: 1 Questions: 13455690 Slow queries: 25732 Opens: 61972 Flush tables: 2 Open tables: 64 Queries per second avg: 3.471 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:04:52] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1323741 (10AddisWang) Do you have mail server that we can use as xx@cn.wikimedia.org, or we can find outside service provider? [10:33:11] (03PS1) 10Yuvipanda: ores: Increase processes per CPU to 4 [puppet] - 10https://gerrit.wikimedia.org/r/214908 [10:33:42] (03PS2) 10Yuvipanda: ores: Increase processes per CPU to 4 [puppet] - 10https://gerrit.wikimedia.org/r/214908 [10:34:14] (03PS3) 10Yuvipanda: ores: Increase processes per CPU to 4 [puppet] - 10https://gerrit.wikimedia.org/r/214908 [10:34:20] (03CR) 10Yuvipanda: [C: 032] ores: Increase processes per CPU to 4 [puppet] - 10https://gerrit.wikimedia.org/r/214908 (owner: 10Yuvipanda) [10:34:28] (03CR) 10Yuvipanda: [V: 032] ores: Increase processes per CPU to 4 [puppet] - 10https://gerrit.wikimedia.org/r/214908 (owner: 10Yuvipanda) [11:15:22] (03PS1) 10Yuvipanda: ores: Add experimental nginx proxy caching [puppet] - 10https://gerrit.wikimedia.org/r/214909 [11:16:51] (03PS2) 10Yuvipanda: ores: Add experimental nginx proxy caching [puppet] - 10https://gerrit.wikimedia.org/r/214909 [11:19:23] (03CR) 10Yuvipanda: [C: 032] ores: Add experimental nginx proxy caching [puppet] - 10https://gerrit.wikimedia.org/r/214909 (owner: 10Yuvipanda) [11:30:24] (03PS1) 10Yuvipanda: ores: Specify labs lvm requirement correctly [puppet] - 10https://gerrit.wikimedia.org/r/214910 [11:30:27] (03PS1) 10Yuvipanda: ores: Enable caching even for resources with a cache header [puppet] - 10https://gerrit.wikimedia.org/r/214911 [11:30:30] (03CR) 10jenkins-bot: [V: 04-1] ores: Specify labs lvm requirement correctly [puppet] - 10https://gerrit.wikimedia.org/r/214910 (owner: 10Yuvipanda) [11:30:34] (03CR) 10jenkins-bot: [V: 04-1] ores: Enable caching even for resources with a cache header [puppet] - 10https://gerrit.wikimedia.org/r/214911 (owner: 10Yuvipanda) [11:30:36] (03PS2) 10Yuvipanda: ores: Specify labs lvm requirement correctly [puppet] - 10https://gerrit.wikimedia.org/r/214910 [11:30:42] (03PS2) 10Yuvipanda: ores: Enable caching even for resources with a cache header [puppet] - 10https://gerrit.wikimedia.org/r/214911 [11:31:36] (03CR) 10Yuvipanda: [C: 032] ores: Specify labs lvm requirement correctly [puppet] - 10https://gerrit.wikimedia.org/r/214910 (owner: 10Yuvipanda) [11:31:46] (03CR) 10Yuvipanda: [C: 032] ores: Enable caching even for resources with a cache header [puppet] - 10https://gerrit.wikimedia.org/r/214911 (owner: 10Yuvipanda) [11:38:04] (03PS1) 10Yuvipanda: ores: Specify protocol explicitly for nginx backend [puppet] - 10https://gerrit.wikimedia.org/r/214912 [11:38:09] (03CR) 10jenkins-bot: [V: 04-1] ores: Specify protocol explicitly for nginx backend [puppet] - 10https://gerrit.wikimedia.org/r/214912 (owner: 10Yuvipanda) [11:39:39] (03CR) 10Alex Monk: [C: 04-1] "Question on the task to be addressed first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [11:42:23] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1323802 (10Krenair) I think that's technically possible, but you should make a new ticket about it and CC me there so we can work out ho... [11:45:18] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1323812 (10Krenair) [11:51:11] 6operations, 10wikitech.wikimedia.org: distribution upgrade for wikitech-static instance - https://phabricator.wikimedia.org/T94585#1323817 (10Krenair) [11:51:12] 6operations, 7Tracking: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) (tracking) - https://phabricator.wikimedia.org/T65899#1323816 (10Krenair) [11:59:33] (03CR) 10Alex Monk: Enable Echo on Wikimedia wikis by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/139326 (https://phabricator.wikimedia.org/T97760) (owner: 10Withoutaname) [12:06:02] (03CR) 10Ladsgroup: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [12:06:38] (03CR) 10Alex Monk: Install Extension:Translate on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [12:12:09] (03CR) 10Alex Monk: Remove echowikis.dblist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/139581 (owner: 10Withoutaname) [12:16:16] (03CR) 10Alex Monk: "Let's just do what CentralAuth does: https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/maintenance/createLocalAcco" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/139326 (https://phabricator.wikimedia.org/T97760) (owner: 10Withoutaname) [12:26:30] (03CR) 10Alex Monk: "Please see I3537206f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/139326 (https://phabricator.wikimedia.org/T97760) (owner: 10Withoutaname) [12:26:42] (03CR) 10Alex Monk: "I3537206f will fix this" [puppet] - 10https://gerrit.wikimedia.org/r/139581 (owner: 10Withoutaname) [12:56:03] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [12:57:13] RECOVERY - Host mw2027 is UPING WARNING - Packet loss = 86%, RTA = 64.86 ms [12:57:16] weird [13:00:07] Krenair: what is 20** ? I know 10** is eqiad, 40** is codfw, and 30** is esams [13:00:32] I think 2* was codfw? [13:00:55] it's mw2027.codfw.wmnet [13:01:24] 4* is ulsfo [13:01:36] https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Cluster_Servers [13:05:16] matanya, ^ [13:05:33] ah, right. sundays... [13:05:40] thanks. [13:22:48] (03CR) 10GWicke: "Yes, we haven't added all special wikis yet. Luckily it'll be straightforward to do so." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214833 (https://phabricator.wikimedia.org/T100026) (owner: 10Jforrester) [13:39:25] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1323912 (10jeremyb-phone) >>! re T98676#1323802, @Krenair no, we never offer that to chapters. you need to find your own provider and use your own domain n... [13:52:56] (03PS1) 10Nemo bis: [English Planet] Add Bluerasberry, Nimish Gautam [puppet] - 10https://gerrit.wikimedia.org/r/214916 [14:07:40] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1323941 (10MZMcBride) "Never" is a bit of a strong word. For example, we have OTRS queues. And, of course, past practice shouldn't necessarily dictate future... [14:17:38] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1323947 (10zhuyifei1999) @jeremyb, @Krenair, @AddisWang: Would a mailing on lists.wikimedia.org work? I'll create a ticket for it if it works. [14:31:43] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1323956 (10jeremyb) > Would a mailing on lists.wikimedia.org work? I'll create a ticket for it if it works. to clarify xx@cn.wikimedia.org could mean many t... [14:32:12] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1323957 (10Krenair) Is it really so hard to create a new ticket and stop bothering people on this one? [14:39:19] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1323963 (10AddisWang) I'll make a new ticket, after having a discussion with other members. [14:50:15] 6operations, 7discovery-system, 5services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1323975 (10Joe) a:3Joe [14:50:28] (03PS21) 10Paladox: Adding task support instead of using Bug: which was for bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/209741 [15:33:42] (03CR) 10Andrew Bogott: [C: 031] "Looks fine. We could probably purge a lot more puppet_db stuff." [puppet] - 10https://gerrit.wikimedia.org/r/214637 (owner: 10Alexandros Kosiaris) [15:35:48] (03CR) 10Andrew Bogott: [C: 04-1] "I think we need to leave a couple more IPs in the allow range" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214638 (owner: 10Alexandros Kosiaris) [15:36:09] (03CR) 10Andrew Bogott: [C: 031] lint: fully qualify puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/214639 (owner: 10Alexandros Kosiaris) [15:38:05] (03CR) 10Andrew Bogott: [C: 031] "This is better! Small variable-naming request." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214640 (owner: 10Alexandros Kosiaris) [15:39:47] (03CR) 10Andrew Bogott: [C: 04-1] Rename role::puppet::server::labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214641 (owner: 10Alexandros Kosiaris) [15:41:40] (03CR) 10Andrew Bogott: [C: 04-1] "I don't mind the new name, although I'm not sure it's worth it for consistency. A fair amount of docs reference the old name, and lots of" [puppet] - 10https://gerrit.wikimedia.org/r/214642 (owner: 10Alexandros Kosiaris) [16:03:01] (03CR) 10Yuvipanda: "1. Shim the old name to just include the new one" [puppet] - 10https://gerrit.wikimedia.org/r/214642 (owner: 10Alexandros Kosiaris) [16:03:19] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1324067 (10Nemo_bis) >>! In T95184#1322229, @Dzahn wrote: > T95267 - removed as a blocker because the dump exists now which makes it possible to build one without... [16:05:50] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1324075 (10Nemo_bis) > You are free to call more people to express their opinions. I'm not the one who proposed this action and I don't intend to do the proposal... [16:07:28] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1324076 (10Nemo_bis) [16:11:42] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1324079 (10Nemo_bis) >>! In T85141#1315926, @MZMcBride wrote: > Anyone who wants this information should be given sufficient opportunity (a few months) to extract it from old-bugzilla... [16:17:55] (03PS1) 10Paladox: Add link in gitblit for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/214923 [16:18:14] (03PS2) 10Paladox: Add link in gitblit for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/214923 [16:20:00] (03CR) 10Paladox: "Please see https://gerrit.wikimedia.org/r/#/c/214923/ which breaks a bit of this patch." [puppet] - 10https://gerrit.wikimedia.org/r/209741 (owner: 10Paladox) [16:22:52] (03CR) 10Paladox: Add link in gitblit for phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/214923 (owner: 10Paladox) [16:45:33] PROBLEM - puppet last run on ms-be1017 is CRITICAL Puppet has 1 failures [17:02:22] RECOVERY - puppet last run on ms-be1017 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:12:00] !log performed a rolling restart of RESTBase Cassandra nodes to address elevated request error rates apparently related to schema disagreement [17:12:05] Logged the message, Master [17:20:06] !log Investigating RL issues (clients are loading mediawiki.notification&version=19700101T000000Z, mw.loader.moduleRegistry contains NaN for versions) [17:20:10] Logged the message, Master [17:26:19] (03PS1) 10Odder: Make import group assignable on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214925 (https://phabricator.wikimedia.org/T100925) [17:30:56] 6operations, 10Wikimedia-Mailing-lists: mailman emails taking long time for delivery, getting stuck in sodium - https://phabricator.wikimedia.org/T61731#1324190 (10Nemo_bis) Well, dunno, this doesn't really look entirely healthy: https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&title=Emails%20pas... [17:33:38] !log krinkle Synchronized php-1.26wmf7/resources: touch mediawiki.js (duration: 00m 13s) [17:33:42] Logged the message, Master [17:35:50] (03PS2) 10Odder: Make import group assignable on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214925 (https://phabricator.wikimedia.org/T100925) [17:35:58] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1324193 (10JohnLewis) [17:36:48] !log Confirmed RL problem solved. The jquery|mediawiki&version=bizqqnC request was cached with an old mw.loader implementation somehow. After the touch and sync, the version is now dQAzAsdU and the implementation is up to date. [17:36:55] Logged the message, Master [17:40:05] hey Krinkle [17:40:22] Hi [17:40:38] Krinkle: did that, by chance, have anything to do with https://phabricator.wikimedia.org/T100883 ? [17:40:49] (or is that yet another issue?) [17:41:12] Dont know [17:41:14] Probably not related. [17:41:23] Unless the culprit is somehow not having synced properly [17:41:29] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1324197 (10JohnLewis) The above tasking is getting annoying with the unless debates. As has been said, old-Bugzilla is not required for this as all data can be cl... [18:37:03] !log krinkle Synchronized php-1.26wmf8/resources/src/mediawiki/mediawiki.js: rl live fix - I717b86573 (duration: 00m 12s) [18:37:07] Logged the message, Master [20:09:03] PROBLEM - puppet last run on mw2146 is CRITICAL puppet fail [20:27:33] RECOVERY - puppet last run on mw2146 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [20:38:14] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [20:43:06] (03PS1) 10Odder: Provide static PNG logos for emlwiki and kgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214981 (https://phabricator.wikimedia.org/T100953) [21:01:28] 6operations, 10wikitech.wikimedia.org, 7Documentation: Wikitech: update Bacula article - https://phabricator.wikimedia.org/T100954#1324428 (10Gage) 3NEW [21:09:18] 6operations, 7discovery-system, 5services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1324445 (10Joe) [21:28:33] PROBLEM - RAID on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:04] PROBLEM - dhclient process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:03] PROBLEM - statsite backend instances on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:03] PROBLEM - configured eth on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:03] RECOVERY - RAID on graphite2001 is OK Active: 8, Working: 8, Failed: 0, Spare: 0 [21:30:33] RECOVERY - dhclient process on graphite2001 is OK: PROCS OK: 0 processes with command name dhclient [21:30:53] icinga-wm is so dramatic... :p [21:31:19] one moment its critical then a swift recovery and everything is ok :) [21:31:33] RECOVERY - statsite backend instances on graphite2001 is OK All defined statsite jobs are runnning. [21:31:33] RECOVERY - configured eth on graphite2001 is OK - interfaces up [21:43:03] PROBLEM - statsdlb process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:23] PROBLEM - puppet last run on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:32] PROBLEM - statsite backend instances on graphite2001 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [21:43:33] PROBLEM - configured eth on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:33] PROBLEM - Disk space on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:33] PROBLEM - salt-minion processes on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:33] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:43] PROBLEM - RAID on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:43:52] PROBLEM - uWSGI web apps on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:44:13] PROBLEM - dhclient process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:44:43] PROBLEM - SSH on graphite2001 is CRITICAL - Socket timeout after 10 seconds [21:44:43] PROBLEM - DPKG on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:03] RECOVERY - puppet last run on graphite2001 is OK Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:45:13] RECOVERY - Disk space on graphite2001 is OK: DISK OK [21:45:13] RECOVERY - Graphite Carbon on graphite2001 is OK All defined Carbon jobs are runnning. [21:45:13] RECOVERY - salt-minion processes on graphite2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:45:13] RECOVERY - RAID on graphite2001 is OK Active: 8, Working: 8, Failed: 0, Spare: 0 [21:45:23] RECOVERY - uWSGI web apps on graphite2001 is OK All defined uWSGI apps are runnning. [21:45:43] RECOVERY - dhclient process on graphite2001 is OK: PROCS OK: 0 processes with command name dhclient [21:46:13] RECOVERY - SSH on graphite2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [21:46:13] RECOVERY - statsdlb process on graphite2001 is OK: PROCS OK: 1 process with command name statsdlb [21:46:13] RECOVERY - DPKG on graphite2001 is OK: All packages OK [21:46:43] RECOVERY - configured eth on graphite2001 is OK - interfaces up [21:46:43] RECOVERY - statsite backend instances on graphite2001 is OK All defined statsite jobs are runnning. [21:51:23] PROBLEM - SSH on graphite2001 is CRITICAL - Socket timeout after 10 seconds [21:51:23] PROBLEM - statsdlb process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:23] PROBLEM - DPKG on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:43] PROBLEM - puppet last run on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:52] PROBLEM - configured eth on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:53] PROBLEM - statsite backend instances on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:53] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:53] PROBLEM - Disk space on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:53] PROBLEM - salt-minion processes on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:03] PROBLEM - RAID on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:12] PROBLEM - uWSGI web apps on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:33] PROBLEM - dhclient process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:53:24] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1182562 (10Nemo_bis) > The above tasking is getting annoying with the unless debates. Pro-tip: http://meatballwiki.org/wiki/DiminishingReplies [22:02:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:03:12] RECOVERY - SSH on graphite2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [22:03:12] RECOVERY - statsdlb process on graphite2001 is OK: PROCS OK: 1 process with command name statsdlb [22:03:13] RECOVERY - DPKG on graphite2001 is OK: All packages OK [22:08:22] PROBLEM - SSH on graphite2001 is CRITICAL - Socket timeout after 10 seconds [22:08:22] PROBLEM - statsdlb process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:08:23] PROBLEM - DPKG on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:18:43] RECOVERY - Disk space on graphite2001 is OK: DISK OK [22:18:43] RECOVERY - salt-minion processes on graphite2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:18:43] RECOVERY - Graphite Carbon on graphite2001 is OK All defined Carbon jobs are runnning. [22:18:52] RECOVERY - RAID on graphite2001 is OK Active: 8, Working: 8, Failed: 0, Spare: 0 [22:18:53] RECOVERY - uWSGI web apps on graphite2001 is OK All defined uWSGI apps are runnning. [22:19:13] RECOVERY - dhclient process on graphite2001 is OK: PROCS OK: 0 processes with command name dhclient [22:19:54] RECOVERY - SSH on graphite2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [22:20:02] RECOVERY - statsdlb process on graphite2001 is OK: PROCS OK: 1 process with command name statsdlb [22:20:02] RECOVERY - DPKG on graphite2001 is OK: All packages OK [22:20:13] RECOVERY - puppet last run on graphite2001 is OK Puppet is currently enabled, last run 39 minutes ago with 0 failures [22:20:23] RECOVERY - configured eth on graphite2001 is OK - interfaces up [22:20:24] RECOVERY - statsite backend instances on graphite2001 is OK All defined statsite jobs are runnning. [22:22:57] 6operations, 6Phabricator, 7database: Add Story points (from Sprint Extension) to the phabricator data dump - https://phabricator.wikimedia.org/T100846#1324508 (10mmodell) [22:28:43] PROBLEM - puppet last run on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:28:53] PROBLEM - configured eth on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:28:54] PROBLEM - statsite backend instances on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:29:02] PROBLEM - salt-minion processes on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:29:02] PROBLEM - Disk space on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:29:02] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:29:12] PROBLEM - RAID on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:29:13] PROBLEM - uWSGI web apps on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:29:33] PROBLEM - dhclient process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:29:45] it fell off the net again while i was investigating :( [22:30:13] PROBLEM - SSH on graphite2001 is CRITICAL - Socket timeout after 10 seconds [22:30:13] PROBLEM - DPKG on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:30:13] PROBLEM - statsdlb process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:31:03] RECOVERY - dhclient process on graphite2001 is OK: PROCS OK: 0 processes with command name dhclient [22:35:51] !log graphite2001 keeps falling off the net due to OOM; swap 100% in use. dist-upgraded & rebooted. dmesg in ~gage/dmesg.2015-05-31 [22:35:55] Logged the message, Master [22:36:12] PROBLEM - dhclient process on graphite2001 is CRITICAL: Timeout while attempting connection [22:37:54] PROBLEM - Host graphite2001 is DOWN: PING CRITICAL - Packet loss = 100% [22:38:23] RECOVERY - SSH on graphite2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [22:38:23] RECOVERY - statsdlb process on graphite2001 is OK: PROCS OK: 1 process with command name statsdlb [22:38:24] RECOVERY - DPKG on graphite2001 is OK: All packages OK [22:38:32] RECOVERY - Host graphite2001 is UPING OK - Packet loss = 0%, RTA = 44.02 ms [22:38:44] RECOVERY - configured eth on graphite2001 is OK - interfaces up [22:38:44] RECOVERY - Disk space on graphite2001 is OK: DISK OK [22:38:44] RECOVERY - statsite backend instances on graphite2001 is OK All defined statsite jobs are runnning. [22:38:44] RECOVERY - salt-minion processes on graphite2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:38:44] RECOVERY - Graphite Carbon on graphite2001 is OK All defined Carbon jobs are runnning. [22:39:03] RECOVERY - RAID on graphite2001 is OK Active: 8, Working: 8, Failed: 0, Spare: 0 [22:39:12] RECOVERY - uWSGI web apps on graphite2001 is OK All defined uWSGI apps are runnning. [22:39:33] RECOVERY - dhclient process on graphite2001 is OK: PROCS OK: 0 processes with command name dhclient [22:43:52] PROBLEM - Host graphite2001 is DOWN: PING CRITICAL - Packet loss = 100% [22:45:22] RECOVERY - Host graphite2001 is UPING OK - Packet loss = 0%, RTA = 43.31 ms [22:46:20] 6operations, 10ops-codfw: graphite2001 bios config issue - https://phabricator.wikimedia.org/T100959#1324524 (10Gage) 3NEW [22:48:00] ok i'm done with graphite2001 - had to reboot a second time for kernel update [23:04:59] 6operations, 10wikitech.wikimedia.org, 7Documentation: Create documentation on the requesting/allocation of virtual machines in the misc cluster - https://phabricator.wikimedia.org/T97072#1324575 (10Krenair) [23:05:13] 6operations, 7Documentation: Create documentation on the requesting/allocation of virtual machines in the misc cluster - https://phabricator.wikimedia.org/T97072#1232100 (10Krenair) [23:05:39] 6operations, 7Documentation: Wikitech: update Bacula article - https://phabricator.wikimedia.org/T100954#1324580 (10Krenair)