[00:00:09] (03PS1) 10: ores: deprecate flower [puppet] - 10https://gerrit.wikimedia.org/r/292769 (https://phabricator.wikimedia.org/T137003) [00:06:40] ^ that's me, gerrit-wm can't understand my username! https://phabricator.wikimedia.org/T136721 [00:08:39] heh, i have not noticed this before [00:09:53] Amir1: i just noticed this.. when you go to Gerrit and type in the search field, owner:ladsgroup see what happens [00:09:57] it autocompletes to [00:10:01] owner:"Anonymous Coward " [00:10:17] and then there is a second user with the same email address but its a bot [00:10:40] the second is the bot I wrote a very long time ago [00:10:55] the first one looks like a not-very-funny joke [00:10:57] Amir1: in Gerrit, click your own user, then settings.. [00:11:11] then profile on the left [00:11:20] is "Full Name" empty or something? [00:11:36] empty [00:11:55] but you cant edit it , right [00:12:03] no [00:12:06] i guess this happened back when the wikitech user was created ..hmm [00:12:14] that creates the LDAP user [00:12:17] that is also used here [00:12:32] looks in LDAP [00:13:44] mutante: is the "Anonymous Coward" the default? [00:14:04] hmm, both "sn" and "cn" fields are set to Ladsgroup [00:14:30] Amir1: yea, default _if_ the fullname is missing [00:14:37] somehow [00:15:11] the thing is my full name has been empty for a very long time (since I remember) and I tried to change it but I couldn't [00:15:16] Amir1: ah, try this, in the Gerrit settings, go to Contact Information [00:15:25] there is another Full Name field [00:15:44] yeah [00:15:58] and then there is "Identities" [00:16:05] I clicked the reload [00:16:13] then "Ladsgroup" got added [00:16:25] ah! [00:16:32] lets see the bot again now [00:16:59] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/292769 (https://phabricator.wikimedia.org/T137003) (owner: 10Ladsgroup) [00:17:03] :) [00:17:05] yup [00:17:09] thanks [00:17:14] yw [00:17:19] mutante: one thing. Is there way to change the full name? [00:19:32] Amir1: i .. want that too https://phabricator.wikimedia.org/T113792 i hope yes [00:19:39] but its not easy [00:20:22] https://wikitech.wikimedia.org/wiki/Renaming_users and more afaict [00:21:19] mutante: Just to be clear, I don't want to change my username, [00:21:19] I want to change only my full name [00:21:56] yea, same here [00:22:03] i also just want my actual full name [00:22:29] oh [00:22:54] so does not worth it, maybe we can wait until gerrit is moved to differential [00:23:23] i was wondering too if its better or worse after the move [00:23:23] (it is not worth it for me, for you, maybe it's different) [00:24:02] it bugs me every day but i cant rewrite git history anyways [00:24:08] so im just not sure anymore [00:24:21] I have some patches there, it looks nice but I think it might be hassle to get used it [00:24:42] *used to it [00:24:59] it might be possible to change the name [00:25:01] but not the history [00:25:10] so then you have your contributions split up into 2 names [00:25:15] that is kind of worse [00:26:05] hmm, you're right. I never thought of git history [00:30:41] Amir1: i did change it at one point in .git/config, name = Full Name, also i have that .mailmap file in the repo [00:31:08] the result is it is sometimes like this and sometimes like that, heh [00:31:28] haha [00:31:34] I can imagine [00:35:23] (03PS1) 10Dzahn: aptrepo: make owners of incoming dir configurable [puppet] - 10https://gerrit.wikimedia.org/r/292770 (https://phabricator.wikimedia.org/T132757) [00:36:00] (03PS2) 10Dzahn: aptrepo: make owners of incoming dir configurable [puppet] - 10https://gerrit.wikimedia.org/r/292770 (https://phabricator.wikimedia.org/T132757) [00:39:07] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/3051/" [puppet] - 10https://gerrit.wikimedia.org/r/292770 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [00:42:51] https://www.flickr.com/photos/girliemac/sets/72157628409467125/with/6513001321/ [00:42:58] HTTP status codes ^ [00:46:11] Amir1: :) nice! needs to be on https://phabricator.wikimedia.org/T113114 [00:46:38] :D [00:47:44] nice bug page [00:58:31] 06Operations, 10ops-eqiad: decom magnesium (data center) - https://phabricator.wikimedia.org/T137006#2354969 (10Dzahn) [00:59:00] 06Operations, 10ops-eqiad: decom magnesium (data center) - https://phabricator.wikimedia.org/T137006#2354985 (10Dzahn) public IP has been removed, mgmt IP is still there [00:59:24] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2354987 (10Dzahn) [00:59:26] 06Operations, 13Patch-For-Review: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#1936373 (10Dzahn) 05Open>03Resolved [00:59:53] 06Operations, 10ops-eqiad: decom magnesium (data center) - https://phabricator.wikimedia.org/T137006#2354991 (10Dzahn) [00:59:55] 06Operations, 13Patch-For-Review: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#1936373 (10Dzahn) [01:00:31] 06Operations, 13Patch-For-Review: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#1936373 (10Dzahn) [01:00:33] 06Operations, 10ops-eqiad: decom magnesium (data center) - https://phabricator.wikimedia.org/T137006#2354969 (10Dzahn) [01:02:37] 06Operations: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#2354994 (10Dzahn) [01:04:22] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2354995 (10Dzahn) [01:05:22] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2354996 (10Dzahn) [01:05:56] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10Dzahn) magnesium shut down, so that's off the list [01:06:08] PROBLEM - puppet last run on mw1006 is CRITICAL: CRITICAL: puppet fail [01:21:33] (03PS1) 10Dzahn: add moon.wikimedia.org, point to cluster [dns] - 10https://gerrit.wikimedia.org/r/292771 (https://phabricator.wikimedia.org/T136557) [01:21:47] (03PS2) 10Dzahn: add moon.wikimedia.org, point to cluster [dns] - 10https://gerrit.wikimedia.org/r/292771 (https://phabricator.wikimedia.org/T136557) [01:25:26] (03PS1) 10Dzahn: redirect moon.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/292772 (https://phabricator.wikimedia.org/T136557) [01:30:17] (03PS2) 10Ladsgroup: ores: deprecate flower [puppet] - 10https://gerrit.wikimedia.org/r/292769 (https://phabricator.wikimedia.org/T137003) [01:32:27] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2355010 (10Dzahn) @MBrent done so far on our side. next step would be content from Mule and letting them know to upload just like they did for annualreport and 15.wp, the only difference is repo name... [01:35:07] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:44:57] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 666 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5647998 keys - replication_delay is 666 [01:48:03] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2355023 (10faidon) [01:48:05] 06Operations, 10ops-ulsfo: ship single dell 500GB sata to ulsfo - https://phabricator.wikimedia.org/T133699#2355021 (10faidon) 05Open>03Resolved Done :) [01:52:11] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2355026 (10faidon) [02:04:25] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5624449 keys - replication_delay is 0 [02:23:25] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2096 MB (3% inode=96%) [02:25:11] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.4) (duration: 09m 08s) [02:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:50] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jun 4 02:30:50 UTC 2016 (duration 5m 39s) [02:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:03] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.484 second response time [03:05:55] 06Operations, 06Mobile-Apps, 10Traffic, 06Wikipedia-Android-App-Backlog: WikipediaApp for Android hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2355039 (10Mholloway) I looked over our past usage of bits.wikimedia.org, and found that we used it for (1) automatically downlo... [04:30:28] any ops around? I need opsy to change incorrect ownership of a table in postgres (for the non-production maps server that we are setting up) [04:30:39] very simple thing [04:33:11] hmm, jynus is on duty but not in the chanel :( [04:38:14] PROBLEM - puppet last run on mw2121 is CRITICAL: CRITICAL: Puppet has 1 failures [05:03:46] RECOVERY - puppet last run on mw2121 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [05:21:07] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [05:46:18] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:33:34] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:35] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:43] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:21:42] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: puppet fail [07:34:29] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 20 seconds. [07:36:28] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller, Battery/Capacitor [07:48:49] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:31:19] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: puppet fail [08:58:18] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:07:21] (03PS10) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) [09:11:52] (03CR) 10Elukey: "Implemented Andrew's suggestion to reduce the amount of runtime string cmp and tested again :)" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [09:28:29] PROBLEM - puppet last run on ganeti2003 is CRITICAL: CRITICAL: puppet fail [09:38:07] !log Lowering down temporarily the Analytics kafka upload retention time to 24h to free space (T136690) [09:38:10] T136690: Kafka 0.9's partitions rebalance causes data log mtime reset messing up with time based log retention - https://phabricator.wikimedia.org/T136690 [09:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:39:29] PROBLEM - HHVM rendering on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:39:59] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:40:00] PROBLEM - Check size of conntrack table on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:09] PROBLEM - SSH on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:40:20] PROBLEM - nutcracker port on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:30] PROBLEM - configured eth on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:40] PROBLEM - DPKG on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:40] PROBLEM - HHVM processes on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:41] PROBLEM - dhclient process on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:50] PROBLEM - salt-minion processes on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:51] PROBLEM - Disk space on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:29] PROBLEM - puppet last run on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:39] anybody on --^ ? [09:41:40] PROBLEM - nutcracker process on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:43:51] root login hanging, powercycling [09:44:09] RECOVERY - nutcracker port on mw1144 is OK: TCP OK - 0.000 second response time on port 11212 [09:44:20] RECOVERY - dhclient process on mw1144 is OK: PROCS OK: 0 processes with command name dhclient [09:44:21] RECOVERY - HHVM processes on mw1144 is OK: PROCS OK: 6 processes with command name hhvm [09:44:29] RECOVERY - DPKG on mw1144 is OK: All packages OK [09:44:39] RECOVERY - Disk space on mw1144 is OK: DISK OK [09:44:39] RECOVERY - salt-minion processes on mw1144 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:45:10] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 38 minutes ago with 0 failures [09:45:20] RECOVERY - nutcracker process on mw1144 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:45:31] RECOVERY - Check size of conntrack table on mw1144 is OK: OK: nf_conntrack is 0 % full [09:45:49] didn't do anything [09:45:49] RECOVERY - SSH on mw1144 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [09:45:50] [Sat Jun 4 09:49:46 2016] Out of memory: Kill process 7947 (hhvm) score 919 or sacrifice child [09:46:09] RECOVERY - configured eth on mw1144 is OK: OK - interfaces up [09:47:36] !log removed temporary Analytics Kafka upload retention override [09:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:50:49] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 65974 bytes in 3.375 second response time [09:51:19] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.075 second response time [09:51:27] !log restarted hhvm on mw1144 after the host was hanging (OOM killer restored basic host functionalities but not hhvm) [09:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:53:39] RECOVERY - puppet last run on ganeti2003 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:51:29] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [11:06:38] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [11:06:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [11:09:18] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:10:07] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:12:29] high spike, but seems already gone [11:14:18] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:14:38] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:16:58] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:17:47] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:37:09] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.133, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:37:37] PROBLEM - Restbase root url on restbase1014 is CRITICAL: Connection refused [11:47:39] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:48:48] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [11:49:08] RECOVERY - Restbase root url on restbase1014 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.007 second response time [12:15:18] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.973 second response time [12:19:09] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 14.832 second response time [12:49:38] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 5 failures [12:59:56] akosiaris or gehel, could you kill a process on maps2001 pls? [13:00:36] any sudo ops around? [13:03:26] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 11.799 second response time [13:04:56] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.701 second response time [13:13:42] yurik: sorry, coing swimming with Oscar. Can do it in a few hours when I get back... [13:14:08] gehel, sure, could you look at that issue i created about missing indexes? [13:14:29] gehel, basically stop the updater and run the commands [13:14:31] thx! [13:14:47] no rush, enjoy the swim! [13:14:50] Will do... [13:16:57] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:49:46] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [14:05:51] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2355373 (10Papaul) Tracking number for the memory that were returned to Dell on 5/27/2016 {F4119731} {F4119733} [14:08:17] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2355374 (10Papaul) @jcrespo or @ Volan I have the firmware file please let me know when is the best time next week to schedule a downtime for those systems. Thanks. [14:16:57] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [14:25:01] (03PS1) 10BBlack: redirects.dat - split non-canonical to separate section [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) [14:26:21] (03CR) 10jenkins-bot: [V: 04-1] redirects.dat - split non-canonical to separate section [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [15:08:25] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 623 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5667969 keys - replication_delay is 623 [15:23:45] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5608704 keys - replication_delay is 0 [15:53:49] PROBLEM - puppet last run on ms-be2014 is CRITICAL: CRITICAL: puppet fail [16:21:25] RECOVERY - puppet last run on ms-be2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:48:07] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: puppet fail [17:50:57] PROBLEM - Disk space on labmon1001 is CRITICAL: DISK CRITICAL - free space: / 3469 MB (3% inode=98%) [18:13:17] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.816 second response time [18:15:17] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.858 second response time [18:15:36] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:20:06] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 5 failures [18:47:46] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:20:28] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 4 failures [19:47:37] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [19:54:57] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: puppet fail [19:58:36] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:59:06] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:37] PROBLEM - SSH on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:37] PROBLEM - puppet last run on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:47] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:57] PROBLEM - Check size of conntrack table on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:01:37] PROBLEM - DPKG on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:02:07] PROBLEM - nutcracker port on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:02:16] PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:07:16] PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:07:36] PROBLEM - configured eth on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:08:06] PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:08:07] PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:18:35] !log rebooting mw1135, unresponsive to ssh or console login [20:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:21:26] RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.000 second response time on port 11212 [20:21:28] RECOVERY - Disk space on mw1135 is OK: DISK OK [20:21:28] RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient [20:21:37] RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [20:21:56] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.911 second response time [20:21:57] RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [20:21:57] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 54 minutes ago with 0 failures [20:22:07] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm [20:22:17] RECOVERY - Check size of conntrack table on mw1135 is OK: OK: nf_conntrack is 0 % full [20:22:36] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:22:46] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:56] RECOVERY - DPKG on mw1135 is OK: All packages OK [20:22:57] RECOVERY - configured eth on mw1135 is OK: OK - interfaces up [20:23:17] PROBLEM - HHVM rendering on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:17] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 66308 bytes in 0.092 second response time [20:24:17] PROBLEM - DPKG on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:24:47] PROBLEM - Disk space on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:24:57] PROBLEM - Check size of conntrack table on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:25:27] PROBLEM - nutcracker process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:25:27] PROBLEM - HHVM processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:25:48] PROBLEM - SSH on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:59] PROBLEM - configured eth on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:28:56] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:28:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:29:27] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:30:16] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:31:47] PROBLEM - dhclient process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:31:47] PROBLEM - salt-minion processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:31:56] PROBLEM - nutcracker port on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:34:46] RECOVERY - Disk space on mw1138 is OK: DISK OK [20:35:28] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:36:16] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:36:56] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:38:47] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:41:17] RECOVERY - nutcracker process on mw1138 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [20:41:17] RECOVERY - HHVM processes on mw1138 is OK: PROCS OK: 6 processes with command name hhvm [20:41:27] RECOVERY - salt-minion processes on mw1138 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:41:27] RECOVERY - dhclient process on mw1138 is OK: PROCS OK: 0 processes with command name dhclient [20:41:36] RECOVERY - nutcracker port on mw1138 is OK: TCP OK - 0.000 second response time on port 11212 [20:41:47] RECOVERY - configured eth on mw1138 is OK: OK - interfaces up [20:41:57] RECOVERY - DPKG on mw1138 is OK: All packages OK [20:42:37] RECOVERY - Check size of conntrack table on mw1138 is OK: OK: nf_conntrack is 0 % full [20:43:27] RECOVERY - SSH on mw1138 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [20:49:27] PROBLEM - puppet last run on mw2079 is CRITICAL: CRITICAL: puppet fail [20:51:27] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: Puppet has 63 failures [20:54:17] PROBLEM - Check size of conntrack table on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:54:46] PROBLEM - HHVM processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:54:46] PROBLEM - nutcracker process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:06] PROBLEM - dhclient process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:06] PROBLEM - salt-minion processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:07] PROBLEM - SSH on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:07] PROBLEM - nutcracker port on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:26] PROBLEM - configured eth on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:36] PROBLEM - DPKG on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:06] PROBLEM - Disk space on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:46] RECOVERY - HHVM processes on mw1138 is OK: PROCS OK: 6 processes with command name hhvm [20:56:56] RECOVERY - salt-minion processes on mw1138 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:56:56] RECOVERY - dhclient process on mw1138 is OK: PROCS OK: 0 processes with command name dhclient [21:02:47] PROBLEM - HHVM processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:03:07] PROBLEM - salt-minion processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:03:07] PROBLEM - dhclient process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:04:07] RECOVERY - Check size of conntrack table on mw1138 is OK: OK: nf_conntrack is 0 % full [21:04:37] RECOVERY - nutcracker process on mw1138 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:04:37] RECOVERY - HHVM processes on mw1138 is OK: PROCS OK: 6 processes with command name hhvm [21:04:48] RECOVERY - salt-minion processes on mw1138 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:04:48] RECOVERY - dhclient process on mw1138 is OK: PROCS OK: 0 processes with command name dhclient [21:04:57] RECOVERY - nutcracker port on mw1138 is OK: TCP OK - 0.000 second response time on port 11212 [21:04:57] RECOVERY - SSH on mw1138 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [21:05:16] RECOVERY - configured eth on mw1138 is OK: OK - interfaces up [21:05:26] RECOVERY - DPKG on mw1138 is OK: All packages OK [21:05:46] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.070 second response time [21:05:56] RECOVERY - Disk space on mw1138 is OK: DISK OK [21:06:26] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 66310 bytes in 1.020 second response time [21:07:26] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:16:48] RECOVERY - puppet last run on mw2079 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [21:20:27] PROBLEM - HHVM rendering on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:20:57] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:21:16] PROBLEM - DPKG on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:21:27] PROBLEM - Disk space on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:21:46] PROBLEM - HHVM processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:21:57] PROBLEM - SSH on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:22:06] PROBLEM - dhclient process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:17] PROBLEM - nutcracker port on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:26] PROBLEM - Check size of conntrack table on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:23:06] PROBLEM - salt-minion processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:26:27] PROBLEM - nutcracker process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:26:36] PROBLEM - configured eth on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:36] RECOVERY - HHVM processes on mw1131 is OK: PROCS OK: 6 processes with command name hhvm [21:29:46] RECOVERY - SSH on mw1131 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [21:29:47] RECOVERY - dhclient process on mw1131 is OK: PROCS OK: 0 processes with command name dhclient [21:34:05] PROBLEM - HHVM processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:34:35] PROBLEM - SSH on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:34:45] PROBLEM - dhclient process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:36:51] mw113[158] seem(ed) to have a hard time. [21:39:55] RECOVERY - configured eth on mw1131 is OK: OK - interfaces up [21:40:06] RECOVERY - SSH on mw1131 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [21:40:15] RECOVERY - dhclient process on mw1131 is OK: PROCS OK: 0 processes with command name dhclient [21:40:34] RECOVERY - nutcracker port on mw1131 is OK: TCP OK - 0.000 second response time on port 11212 [21:40:36] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.576 second response time [21:40:36] RECOVERY - nutcracker process on mw1131 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:40:54] RECOVERY - Disk space on mw1131 is OK: DISK OK [21:40:55] RECOVERY - Check size of conntrack table on mw1131 is OK: OK: nf_conntrack is 0 % full [21:41:06] RECOVERY - DPKG on mw1131 is OK: All packages OK [21:41:15] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 66312 bytes in 3.806 second response time [21:41:25] RECOVERY - salt-minion processes on mw1131 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:41:35] RECOVERY - HHVM processes on mw1131 is OK: PROCS OK: 6 processes with command name hhvm [21:46:55] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures