[00:18:12] RECOVERY - puppet last run on mw2079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:31:55] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: puppet fail [00:56:42] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [01:10:20] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.001 second response time on port 9042 [01:21:53] (03PS6) 10Ori.livneh: gmond_memcached.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291176 (owner: 10BryanDavis) [01:22:03] (03CR) 10Ori.livneh: [C: 032 V: 032] gmond_memcached.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291176 (owner: 10BryanDavis) [01:26:40] urandom: still need someone?you [01:26:43] *you? [01:47:53] PROBLEM - puppet last run on mw2114 is CRITICAL: CRITICAL: puppet fail [01:53:45] YuviPanda: sure! [01:54:18] (03PS2) 10Yuvipanda: enable instance restbase2004-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/291563 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [01:54:28] (03CR) 10Yuvipanda: [C: 032 V: 032] enable instance restbase2004-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/291563 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [01:54:53] urandom: done [01:54:57] YuviPanda: thanks! [01:59:10] (03PS1) 10Eevans: enable instance restbase1008-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/291579 (https://phabricator.wikimedia.org/T134016) [02:07:43] PROBLEM - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: Connection refused [02:08:34] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: Connection refused eevans Node is bootstrapping - The acknowledgement expires at: 2016-05-31 02:08:15. [02:13:26] YuviPanda: you still around? [02:13:37] urandom: yeah [02:13:42] i did a stupid [02:14:03] fun [02:14:16] i did an iptables -F on restbase2004.codfw.wmnet, milliseconds before i realized that the default policy is DROP [02:15:05] i guess the next puppet run will probably fix it, worst-case, but do you have a way of bouncing it, or getting in out-of-bound? [02:15:52] PROBLEM - Host restbase2004 is DOWN: PING CRITICAL - Packet loss = 100% [02:16:00] yeeeaa [02:16:02] urandom: ah, yeah, it's ok [02:16:14] urandom: if puppet doesn't reset, we can try salt or oob [02:16:19] is mgmt interface and bounce it [02:16:26] urandom: is it in production? [02:16:58] yeah, but it won't create an outage [02:17:03] RECOVERY - puppet last run on mw2114 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:17:08] how often does puppet run, twice an hour? [02:17:20] urandom: ok, 20mins. we'll wait 20mins and if not, bounce it [02:17:24] is that ok? [02:17:27] yeah [02:22:00] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.3) (duration: 08m 35s) [02:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:27] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun May 29 02:27:27 UTC 2016 (duration 5m 27s) [02:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:54] urandom: it isn't back up is it [02:39:48] YuviPanda: nope [02:40:01] puppet runs every 30 mins or so? [02:41:38] urandom: yeah, I'm going in oob now [02:42:25] YuviPanda: i saved the rules before flushing, if that helps [02:42:45] you could iptables-restore < ~eevans/iptables.txt [02:43:34] urandom: yeah, that'll help, I'll do it [02:48:59] aha [02:49:04] urandom: try now? [02:49:05] it lives [02:49:09] RECOVERY - Host restbase2004 is UP: PING OK - Packet loss = 0%, RTA = 47.99 ms [02:49:15] icinga concurs [02:50:16] YuviPanda: thank you! [02:50:26] urandom: yw! [02:51:18] i actually think i realized i shouldn't do that *before* enter was pressed [02:51:35] but too late? :D [02:51:37] by just a fraction of a second before, so the signal was already sent to my finger :) [02:54:11] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.125, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [02:54:29] PROBLEM - Restbase root url on restbase2004 is CRITICAL: Connection refused [02:54:49] PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: puppet fail [02:56:10] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [02:56:20] RECOVERY - Restbase root url on restbase2004 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.117 second response time [02:58:49] !log Bootsrapping restbase2004-b.codfw.wmet : T134016 [02:58:50] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [02:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:06] RECOVERY - puppet last run on restbase2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:12:49] urandom: I'm going afk now, jfyi. You can still call me for emergencies and what not :) [03:18:40] YuviPanda: no, i'm good; thanks for all your help! [03:38:39] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Puppet has 1 failures [03:44:18] (03PS5) 10Ori.livneh: librenms: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291174 (owner: 10BryanDavis) [03:44:26] (03CR) 10Ori.livneh: [C: 032 V: 032] librenms: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291174 (owner: 10BryanDavis) [04:00:49] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [04:03:29] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:25:29] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [04:37:10] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 676 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6270014 keys - replication_delay is 676 [04:50:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6231170 keys - replication_delay is 0 [05:12:00] PROBLEM - puppet last run on mw2195 is CRITICAL: CRITICAL: Puppet has 1 failures [05:36:59] RECOVERY - puppet last run on mw2195 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:30:39] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:58] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: puppet fail [06:31:18] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: puppet fail [06:31:47] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:48] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 3 failures [06:33:56] PROBLEM - puppet last run on mw1149 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:56] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:45] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:55:57] (03PS2) 10Yuvipanda: k8s: Expand types of resources accessible to normal users [puppet] - 10https://gerrit.wikimedia.org/r/291490 [06:56:37] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:57:31] (03PS12) 10Yuvipanda: tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) [06:57:40] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) (owner: 10Yuvipanda) [06:57:46] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:58:06] RECOVERY - puppet last run on mw1149 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:58:25] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:55] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:11:12] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: puppet fail [08:37:51] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:56] (03CR) 10Alexandros Kosiaris: [C: 032] rsync::module: Replace obsolete to_a calls [puppet] - 10https://gerrit.wikimedia.org/r/291491 (owner: 10Alexandros Kosiaris) [09:28:00] (03PS2) 10Alexandros Kosiaris: rsync::module: Replace obsolete to_a calls [puppet] - 10https://gerrit.wikimedia.org/r/291491 [09:28:11] (03CR) 10Alexandros Kosiaris: [V: 032] rsync::module: Replace obsolete to_a calls [puppet] - 10https://gerrit.wikimedia.org/r/291491 (owner: 10Alexandros Kosiaris) [09:31:54] (03PS2) 10Alexandros Kosiaris: networks::constants: Hieraize all_network_subnets [puppet] - 10https://gerrit.wikimedia.org/r/291263 [09:31:56] (03PS24) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [09:31:58] (03PS4) 10Alexandros Kosiaris: network: Move into module [puppet] - 10https://gerrit.wikimedia.org/r/291234 [09:32:00] (03PS9) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [09:33:53] (03CR) 10Alexandros Kosiaris: "A couple of more changes after that, I think I have something final and ready for review. PCC is running again for multiple hosts as job #" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [09:43:43] PROBLEM - Check if rsync server is running on labsdb1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name rsync, regex args /usr/bin/rsync --no-detach --daemon [09:50:30] (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [09:57:37] (03CR) 10jenkins-bot: [V: 04-1] network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [10:10:41] RECOVERY - Check if rsync server is running on labsdb1006 is OK: PROCS OK: 1 process with command name rsync, regex args /usr/bin/rsync --no-detach --daemon [10:19:42] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [10:45:59] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [11:20:09] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [11:47:33] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:24:08] (03PS1) 10KartikMistry: Enable ArticlePlaceholder extension in guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291589 (https://phabricator.wikimedia.org/T136517) [15:07:12] 06Operations, 10Traffic, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2337520 (10BBlack) At the new values testing on cp3048 we seem to stabilize around ~96G virt, with the usual RSS variation in the [78]xGB range. That's only 22% overhead,... [15:15:49] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 684 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6286589 keys - replication_delay is 684 [15:22:32] (03CR) 10Lydia Pintscher: [C: 04-1] "Yay :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291589 (https://phabricator.wikimedia.org/T136517) (owner: 10KartikMistry) [15:29:19] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6236976 keys - replication_delay is 0 [15:34:19] (03PS1) 10BBlack: varnish: jemalloc tuning for frontend caches [puppet] - 10https://gerrit.wikimedia.org/r/291592 (https://phabricator.wikimedia.org/T135384) [15:34:21] (03PS1) 10BBlack: raise fe mem size to 37% on text and upload [puppet] - 10https://gerrit.wikimedia.org/r/291593 (https://phabricator.wikimedia.org/T135384) [15:54:39] (03CR) 10jenkins-bot: [V: 04-1] raise fe mem size to 37% on text and upload [puppet] - 10https://gerrit.wikimedia.org/r/291593 (https://phabricator.wikimedia.org/T135384) (owner: 10BBlack) [16:44:03] (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291593 (https://phabricator.wikimedia.org/T135384) (owner: 10BBlack) [17:52:58] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:53:08] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:56:38] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [17:56:48] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:01:18] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [18:03:19] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6235188 keys - replication_delay is 0 [18:18:36] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2337661 (10Gilles) At my request, the original author set a license for his code and picked MIT. I passed that information to the maintainer of pyssim, who did the same... [18:42:19] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2337674 (10Gilles) PR merged: https://github.com/thumbor/thumbor/commit/be582f5620281efbee635e043537f0a7bd1a696c [18:42:55] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2337676 (10Gilles) [18:44:38] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [18:49:24] (03PS1) 10Dereckson: Set site title for ma.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291601 (https://phabricator.wikimedia.org/T136514) [19:23:47] Is there anyone around with +2 on puppet that would mind merging https://gerrit.wikimedia.org/r/#/c/291579/ for me (another Cassandra bootstrap)? [19:24:06] (03PS2) 10Eevans: enable instance restbase1008-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/291579 (https://phabricator.wikimedia.org/T134016) [19:44:44] <_joe_> urandom: I'm here now [19:44:52] <_joe_> still needed? [19:48:12] _joe_: sure! [20:12:18] (03PS1) 10Yuvipanda: k8s: Ensure /etc/kubernetes is present wherever required [puppet] - 10https://gerrit.wikimedia.org/r/291617 [20:12:44] (03PS3) 10Yuvipanda: k8s: Expand types of resources accessible to normal users [puppet] - 10https://gerrit.wikimedia.org/r/291490 [20:13:04] (03PS2) 10Yuvipanda: k8s: Ensure /etc/kubernetes is present wherever required [puppet] - 10https://gerrit.wikimedia.org/r/291617 [20:35:41] (03CR) 10jenkins-bot: [V: 04-1] k8s: Ensure /etc/kubernetes is present wherever required [puppet] - 10https://gerrit.wikimedia.org/r/291617 (owner: 10Yuvipanda) [20:43:13] _joe_: can you please comment on https://phabricator.wikimedia.org/T134309#2315876 when you aren't busy / don't have to deal with more important stuff? [20:43:38] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2337820 (10RobH) @Ladsgroup: Are you aware of a user group that will allow you to deploy just ores related changes to the scb cluster? As fa... [20:46:04] (03PS3) 10Paladox: [Timeline] Update path to extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244739 [20:53:47] (03PS1) 10Paladox: [Timeline] Update path to extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291673 [21:19:03] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:20:52] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.199 second response time [21:33:08] (03CR) 10Yuvipanda: [C: 032] k8s: Expand types of resources accessible to normal users [puppet] - 10https://gerrit.wikimedia.org/r/291490 (owner: 10Yuvipanda) [21:34:16] (03CR) 10Yuvipanda: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291617 (owner: 10Yuvipanda) [21:40:00] Is there anyone around with +2 on puppet that would mind merging https://gerrit.wikimedia.org/r/#/c/291579/ (it's another Cassandra bootstrap)? [21:40:40] (03PS3) 10Yuvipanda: enable instance restbase1008-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/291579 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [21:40:45] urandom: merging, but going afk in a few mins [21:40:52] YuviPanda: \o/ [21:40:54] (03CR) 10Yuvipanda: [C: 032 V: 032] enable instance restbase1008-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/291579 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [21:40:56] thanks! [21:41:05] urandom: odne [21:42:20] (03PS3) 10Yuvipanda: k8s: Ensure /etc/kubernetes is present wherever required [puppet] - 10https://gerrit.wikimedia.org/r/291617 [22:09:02] PROBLEM - cassandra-c CQL 10.64.32.196:9042 on restbase1008 is CRITICAL: Connection refused [22:09:20] PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [22:11:52] ACKNOWLEDGEMENT - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans Bootstrapping - The acknowledgement expires at: 2016-05-30 22:11:37. [22:12:32] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.32.196:9042 on restbase1008 is CRITICAL: Connection refused eevans Boostrapping - The acknowledgement expires at: 2016-05-30 22:12:14. [22:13:11] RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active [22:18:51] PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [22:27:02] ACKNOWLEDGEMENT - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans Bootstrapping - The acknowledgement expires at: 2016-05-30 22:26:47. [22:28:21] RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active [22:34:20] PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [22:59:02] RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active [23:04:59] PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [23:29:30] RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active [23:34:54] PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [23:35:34] ACKNOWLEDGEMENT - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans Being investigated. - The acknowledgement expires at: 2016-06-01 23:35:08. [23:54:33] RECOVERY - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042