[00:03:07] <wikibugs>	 6operations: boron passive checks aren't being collected - https://phabricator.wikimedia.org/T89983#1158947 (10Dzahn) confirmed send_nsca is installed on boron and can send packets over to neon:  @boron:/etc/cron.d# /usr/sbin/send_nsca -H neon.wikimedia.org  <-->  @neon:~# tcpdump port 5667 | grep boron  also:...
[00:07:29] <AaronS>	 bd808: meow?
[00:08:54] * AaronS had is box hangung in auto-shutdown since "vagrant up" was still running
[00:10:07] <bd808>	 hey. Krenair was dangling some memcached failures in front of me that I thought were a new/increasing problem but that may have been a false alarm.
[00:10:29] <Krenair>	 sorry
[00:10:33] <Krenair>	 it's not coming up very often
[00:10:34] * bd808 was multitasking and trying to pass the problem on nerd snipe style
[00:14:30] <AaronS>	 gtg
[00:14:49] <wikibugs>	 6operations: boron passive checks aren't being collected - https://phabricator.wikimedia.org/T89983#1158953 (10Dzahn) confirmed it's trusty's version of send_nsca. i could use the one from precise and it worked:  echo -e "boron\tcheck_disk\t0\ttest" | /tmp/send_nsca -H neon.wikimedia.org -c /etc/send_nsca.cfg  [...
[00:18:27] <wikibugs>	 6operations: boron passive checks aren't being collected - https://phabricator.wikimedia.org/T89983#1158954 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=boron  works again. re-enabled notifications. but it's a hack for now. so don't close yet.
[00:20:34] <hoo>	 !log Attached local accounts to "Advance", per request: enwiki, commonswiki, metawiki, nlwiktionary and nlwikinews
[00:20:41] <morebots>	 Logged the message, Master
[00:22:29] <wikibugs>	 6operations: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#1158963 (10Dzahn) a:5Dzahn>3None could somebody help me here and take a look?
[00:22:49] <wikibugs>	 6operations, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1158971 (10Dzahn) p:5Triage>3Normal
[00:46:22] <grrrit-wm>	 (03PS1) 10Gage: ipsec-global: add /bin to path [puppet] - 10https://gerrit.wikimedia.org/r/200276 
[00:50:43] <jgage>	 tricksy conditional :-
[00:54:13] <grrrit-wm>	 (03PS5) 10BryanDavis: proxies: allow filtering by datacenter [tools/scap] - 10https://gerrit.wikimedia.org/r/200130 (owner: 10Giuseppe Lavagetto)
[00:55:56] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] "More refactoring.Tested on me dev server and it still works without enabling the filter. Not sure where we can test the filter outside of " [tools/scap] - 10https://gerrit.wikimedia.org/r/200130 (owner: 10Giuseppe Lavagetto)
[02:11:35] <grrrit-wm>	 (03CR) 1020after4: [C: 031] proxies: allow filtering by datacenter [tools/scap] - 10https://gerrit.wikimedia.org/r/200130 (owner: 10Giuseppe Lavagetto)
[02:23:02] <logmsgbot>	 !log l10nupdate Synchronized php-1.25wmf22/cache/l10n: (no message) (duration: 07m 03s)
[02:23:19] <morebots>	 Logged the message, Master
[02:27:36] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf22) at 2015-03-28 02:26:33+00:00
[02:27:42] <morebots>	 Logged the message, Master
[02:48:06] <logmsgbot>	 !log l10nupdate Synchronized php-1.25wmf23/cache/l10n: (no message) (duration: 06m 50s)
[02:48:16] <morebots>	 Logged the message, Master
[02:52:48] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf23) at 2015-03-28 02:51:45+00:00
[02:52:54] <morebots>	 Logged the message, Master
[03:00:40] <wikibugs>	 6operations, 7HTTPS, 3HTTPS-by-default: Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon) - https://phabricator.wikimedia.org/T49832#1159087 (10Tony_Tan_98) That's great to hear! Thanks.
[03:15:57] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0]  
[03:27:38] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]  
[03:45:27] <grrrit-wm>	 (03CR) 1020after4: [C: 031] beta: Fix ::beta::autoupdater to work again [puppet] - 10https://gerrit.wikimedia.org/r/200248 (https://phabricator.wikimedia.org/T94261) (owner: 10BryanDavis)
[04:19:07] <grrrit-wm>	 (03PS1) 10BryanDavis: monolog: MWLoggerMonologSamplingHandler -> Monolog\Handler\SamplingHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200286 
[04:32:32] <grrrit-wm>	 (03Abandoned) 1020after4: fix puppet error due to missing parent directory [puppet] - 10https://gerrit.wikimedia.org/r/198461 (owner: 1020after4)
[04:34:50] <grrrit-wm>	 (03CR) 1020after4: "ok the reason I asked is that I'd +2 this but I don't want to break things when it gets merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188388 (https://phabricator.wikimedia.org/T75905) (owner: 10Reedy)
[04:35:27] <grrrit-wm>	 (03CR) 1020after4: "actually this cannot merge according to gerrit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188388 (https://phabricator.wikimedia.org/T75905) (owner: 10Reedy)
[04:36:44] <grrrit-wm>	 (03CR) 1020after4: "@nikerabbit: yeah I suppose it could" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169716 (https://bugzilla.wikimedia.org/67154) (owner: 10Reedy)
[04:39:13] <grrrit-wm>	 (03Abandoned) 1020after4: Observe the remote IP reported by X_FORWARDED_FOR header from proxy server [puppet] - 10https://gerrit.wikimedia.org/r/184837 (https://phabricator.wikimedia.org/T840) (owner: 1020after4)
[04:57:16] <grrrit-wm>	 (03PS3) 10BryanDavis: logstash: Ship logs via syslog udp datagrams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) 
[04:57:31] <grrrit-wm>	 (03CR) 10BryanDavis: "Rebased on I780b4fd02cb16b111dda33fe37c773f62c7c930f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis)
[05:01:57] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0]  
[05:09:58] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail  
[05:11:47] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]  
[05:15:21] <grrrit-wm>	 (03PS4) 10BryanDavis: logstash: Ship logs via syslog udp datagrams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) 
[05:23:51] <grrrit-wm>	 (03CR) 10BryanDavis: "I have made several changes based on conversations I had with Ori:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis)
[05:28:07] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[05:32:05] <wikibugs>	 6operations, 10MediaWiki-extensions-Sentry, 6Multimedia, 10hardware-requests, 3Multimedia-Sprint-2015-03-25: Procure hardware for Sentry - placeholder (not a live request) - https://phabricator.wikimedia.org/T93138#1159149 (10RobH) a:5Tgr>3RobH
[06:29:46] <icinga-wm>	 PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:47] <icinga-wm>	 PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:47] <icinga-wm>	 PROBLEM - puppet last run on amssq34 is CRITICAL: CRITICAL: puppet fail  
[06:30:57] <icinga-wm>	 PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:33:48] <icinga-wm>	 PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:35:28] <icinga-wm>	 PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:35:28] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:35:56] <icinga-wm>	 PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:36:09] <icinga-wm>	 PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:36:26] <icinga-wm>	 PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:43:46] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core:  cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR  
[06:45:27] <icinga-wm>	 RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures  
[06:45:47] <icinga-wm>	 RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures  
[06:45:57] <icinga-wm>	 RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures  
[06:45:57] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0]  
[06:46:17] <icinga-wm>	 RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[06:46:17] <icinga-wm>	 RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures  
[06:47:07] <icinga-wm>	 RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[06:47:07] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures  
[06:47:47] <icinga-wm>	 RECOVERY - puppet last run on mw2017 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures  
[06:47:57] <icinga-wm>	 RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[06:48:57] <icinga-wm>	 RECOVERY - puppet last run on amssq34 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[06:55:47] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]  
[07:28:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0  
[07:44:38] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1009 is CRITICAL: CRITICAL - load average: 275.08, 172.97, 82.44  
[07:53:02] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Mar 28 07:51:56 UTC 2015 (duration 51m 55s)
[07:53:12] <morebots>	 Logged the message, Master
[07:56:58] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0]  
[08:08:37] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]  
[08:41:14] <grrrit-wm>	 (03CR) 10Glaisher: Restore unregistered editing on mobile sites (staggered) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198691 (https://phabricator.wikimedia.org/T93210) (owner: 10Nemo bis)
[08:45:11] <grrrit-wm>	 (03CR) 10Glaisher: Restore unregistered editing on mobile sites (staggered) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198691 (https://phabricator.wikimedia.org/T93210) (owner: 10Nemo bis)
[09:06:37] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:06:47] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:06:47] <icinga-wm>	 PROBLEM - Disk space on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:06:48] <icinga-wm>	 PROBLEM - dhclient process on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:06:57] <icinga-wm>	 PROBLEM - swift-object-server on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:06:57] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:06:57] <icinga-wm>	 PROBLEM - salt-minion processes on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:07:07] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:07:17] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:07:18] <icinga-wm>	 PROBLEM - puppet last run on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:07:27] <icinga-wm>	 PROBLEM - DPKG on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:07:36] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:07:37] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:07:47] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:07:57] <icinga-wm>	 PROBLEM - swift-account-server on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:07:57] <icinga-wm>	 PROBLEM - configured eth on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:08:07] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:08:08] <icinga-wm>	 PROBLEM - swift-container-server on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:08:16] <icinga-wm>	 PROBLEM - RAID on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:25:32] <_joe_>	 mh ms-be1009 again
[10:15:21] <wikibugs>	 6operations: upload.wikimedia.org not loading 3/27/2015 - https://phabricator.wikimedia.org/T94269#1159219 (10Aklapper) For future reference: https://www.mediawiki.org/wiki/How_to_report_a_bug
[10:17:37] <icinga-wm>	 PROBLEM - puppet last run on virt1011 is CRITICAL: CRITICAL: Puppet has 3 failures  
[10:34:16] <icinga-wm>	 RECOVERY - puppet last run on virt1011 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures  
[10:45:32] <godog>	 !log powercycle ms-be1009
[10:45:37] <morebots>	 Logged the message, Master
[10:48:07] <icinga-wm>	 PROBLEM - Host ms-be1009 is DOWN: PING CRITICAL - Packet loss = 100%  
[10:48:36] <icinga-wm>	 RECOVERY - DPKG on ms-be1009 is OK: All packages OK  
[10:48:37] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be1009 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor  
[10:48:37] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater  
[10:48:37] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator  
[10:48:47] <icinga-wm>	 RECOVERY - Host ms-be1009 is UP: PING OK - Packet loss = 0%, RTA = 2.66 ms  
[10:49:06] <icinga-wm>	 RECOVERY - configured eth on ms-be1009 is OK: NRPE: Unable to read output  
[10:49:06] <icinga-wm>	 RECOVERY - swift-account-server on ms-be1009 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server  
[10:49:17] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator  
[10:49:27] <icinga-wm>	 RECOVERY - swift-container-server on ms-be1009 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server  
[10:49:27] <icinga-wm>	 RECOVERY - RAID on ms-be1009 is OK: OK: optimal, 14 logical, 14 physical  
[10:49:27] <icinga-wm>	 RECOVERY - Disk space on ms-be1009 is OK: DISK OK  
[10:49:27] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator  
[10:49:27] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor  
[10:49:37] <icinga-wm>	 RECOVERY - swift-object-server on ms-be1009 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server  
[10:49:37] <icinga-wm>	 RECOVERY - dhclient process on ms-be1009 is OK: PROCS OK: 0 processes with command name dhclient  
[10:49:47] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater  
[10:49:47] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper  
[10:49:47] <icinga-wm>	 RECOVERY - salt-minion processes on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion  
[10:49:47] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1009 is OK: OK - load average: 18.98, 7.88, 2.90  
[10:50:06] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor  
[10:58:19] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Phabricator, 5Patch-For-Review: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#1159226 (10mmodell) 5stalled>3Resolved
[10:58:47] <wikibugs>	 6operations, 6Phabricator: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#991959 (10mmodell)
[14:48:28] <icinga-wm>	 PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[14:56:47] <icinga-wm>	 RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.060 second response time  
[14:57:41] <andrewbogott>	 !log graceful’d apache2 on virt1000
[14:58:05] <andrewbogott>	 morebots, you there?
[14:58:11] <andrewbogott>	 hm
[14:58:54] <morebots>	 I am a logbot running on tools-exec-10.
[14:58:54] <morebots>	 Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log.
[14:58:54] <morebots>	 To log a message, type !log <msg>.
[15:01:06] <andrewbogott>	 !log graceful’d apache2 on virt1000
[15:01:45] <andrewbogott>	 grrrrr
[15:02:23] <andrewbogott>	 morebots, what gives?
[15:02:37] <JohnFLewis>	 andrewbogott: restart morebots perhaps? (despite its response anyway)
[15:02:44] <andrewbogott>	 I just did
[15:02:56] <morebots>	 I am a logbot running on tools-exec-13.
[15:02:56] <morebots>	 Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log.
[15:02:56] <morebots>	 To log a message, type !log <msg>.
[15:02:58] <JohnFLewis>	 Oh didn't see that
[15:03:27] <icinga-wm>	 PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Puppet has 1 failures  
[15:03:46] <icinga-wm>	 PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:05:51] <JohnFLewis>	 andrewbogott: fyi, logins to Wikitech are timing out for me which may be why it fails
[15:06:04] <andrewbogott>	 yep
[15:06:31] <andrewbogott>	 JohnFLewis: is that better?
[15:06:41] <andrewbogott>	 !log graceful’d apache2 on virt1000
[15:06:46] <andrewbogott>	 !log and restarted keystone on virt1000
[15:06:48] <morebots>	 Logged the message, Master
[15:06:52] <JohnFLewis>	 Yeah
[15:06:53] <morebots>	 Logged the message, Master
[15:06:57] <icinga-wm>	 RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.066 second response time  
[15:07:05] <andrewbogott>	 so virt1000 oom’d over night :(
[15:08:10] <JohnFLewis>	 Seems like a regular thing with virt1000 :/ lets hope the new virt orders process quickly anyway
[15:09:45] <andrewbogott>	 new virt hosts won’t help with virt1000.  But https://phabricator.wikimedia.org/T90627 might
[15:10:30] <JohnFLewis>	 I thought the new virts was going to include a virt1000 replacement as well though?
[15:11:09] <wikibugs>	 6operations, 6Labs: OOM on virt1000 - https://phabricator.wikimedia.org/T88256#1159347 (10Andrew) This happened again last night.  Something must be running amok and gobbling memory.
[15:12:04] <andrewbogott>	 JohnFLewis: not necessarily.  It has plenty of memory already, something is running wild and eating it all
[15:12:20] <JohnFLewis>	 Right.
[15:12:27] <andrewbogott>	 maybe keystone has a leak.
[15:20:07] <icinga-wm>	 RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[17:07:38] <icinga-wm>	 PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: puppet fail  
[17:24:27] <icinga-wm>	 RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures  
[20:05:22] <wikibugs>	 6operations, 6Labs, 7Monitoring, 5Patch-For-Review: Setup alarms for labstore* to check for network saturation - https://phabricator.wikimedia.org/T92629#1159525 (10yuvipanda)
[20:18:46] <Krenair>	 mutante, in the admin data, is 'real name' a valid key?
[20:18:53] <Krenair>	 isn't it supposed to be just 'realname'?
[20:18:56] <Krenair>	 I noticed one user has it
[20:19:59] <YuviPanda>	 Coren: did a bunch of CR for your patches :)
[20:20:49] <hoo>	 Krenair:         comment    => $uinfo['realname'],
[20:21:01] <hoo>	 That's the only reference to any key with such a name
[20:25:08] <Krenair>	 am talking about https://git.wikimedia.org/blob/operations%2Fpuppet.git/production/modules%2Fadmin%2Fdata%2Fdata.yaml hoo
[20:25:45] <Krenair>	 or you mean, that's the only place it's used?
[20:27:54] <hoo>	 yes
[20:40:59] <YuviPanda>	 matanya: around?
[21:17:12] <hoo>	 Krenair: https://meta.wikimedia.org/wiki/System_administrators yay :) Thanks
[21:25:39] <Krenair>	 hoo, had to write a script to generate that from the yaml
[21:26:12] <Krenair>	 and another script to get the parsoid output of the existing page and take note of the existing data
[21:26:31] <Krenair>	 but it's now less ridiculous
[21:26:35] <Krenair>	 and includes all ops+deployment
[22:27:26] <icinga-wm>	 PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: puppet fail  
[22:45:28] <icinga-wm>	 RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures  
[22:53:03] <wikibugs>	 6operations, 10Continuous-Integration, 6Release-Engineering, 7Graphite, 7Upstream: Let us customize Zuul metrics reported to statsd - https://phabricator.wikimedia.org/T1369#1159684 (10hashar) a:5hashar>3None
[22:59:53] <wikibugs>	 10Ops-Access-Requests, 6operations: Checkuser and Sysop on Wikitech for Jalexander - https://phabricator.wikimedia.org/T94319#1159697 (10Jalexander) 3NEW a:3hoo
[23:02:44] <wikibugs>	 10Ops-Access-Requests, 6operations: Checkuser and Sysop on Wikitech for Jalexander - https://phabricator.wikimedia.org/T94319#1159706 (10Jalexander) (For the record I know that closing accounts on wikitech is much more complicated given the shell connection)
[23:04:14] <hoo>	 !log Gave sysop and checkuser to Jalexander@labswiki via shell from silver after doing it via meta failed. ([[phab:T94319|T94319]])
[23:04:23] <morebots>	 Logged the message, Master
[23:05:21] <wikibugs>	 10Ops-Access-Requests, 6operations: Checkuser and Sysop on Wikitech for Jalexander - https://phabricator.wikimedia.org/T94319#1159707 (10hoo) 5Open>3Resolved Done.  For reference: It failed on meta:  ``` (Cannot access the database: Can't connect to MySQL server on '208.80.154.136' (4) (208.80.154.136)) ```
[23:05:50] <wikibugs>	 6operations, 10Wikimedia-Site-requests: Checkuser and Sysop on Wikitech for Jalexander - https://phabricator.wikimedia.org/T94319#1159709 (10hoo)
[23:11:19] <jgage>	 an ex-coworker showed me this lil "unicode/wikipedia mashup" demo he made -- displays a grid of randomly selected unicode glyphs, and displays wikipedia article content when you click on them: https://tranquil-forest-1441.herokuapp.com , https://github.com/siznax/charpoy
[23:12:26] <wikibugs>	 6operations: upload.wikimedia.org not loading 3/27/2015 - https://phabricator.wikimedia.org/T94269#1159711 (10SlayerFanatic1999) Close this report.
[23:30:11] <wikibugs>	 6operations: upload.wikimedia.org not loading 3/27/2015 - https://phabricator.wikimedia.org/T94269#1159720 (10Krenair) Yeah, I did already.
[23:35:54] <twentyafterfour>	 I'm not sure what's normally supposed to be there, but https://git.wikimedia.org/ is currently"Internal error"
[23:40:45] <jgage>	 hm, thanks. i tried restarting apache on antimony, but that hasn't solved the problem.
[23:41:04] <hoo|away>	 jgage: Restart gitblit itself
[23:42:09] <jgage>	 hmm ok
[23:42:46] <jgage>	 well now we get a different error message :)
[23:43:01] <Krenair>	 yes, that you can't even get to gitblit :p
[23:43:04] <hoo|away>	 that's gitblit restarting, I think
[23:45:57] <hoo|away>	 here we go :)
[23:46:23] <jgage>	 nice
[23:46:24] <jgage>	 thanks folks
[23:46:35] <jgage>	 it sure took its time starting up, but.. java
[23:47:10] <jgage>	 i'll open a ticket to create a monitor to catch this
[23:47:56] <hoo|away>	 +1
[23:54:23] <wikibugs>	 6operations, 7Monitoring: Monitor https://git.wikimedia.org/ - https://phabricator.wikimedia.org/T94320#1159721 (10Gage) 3NEW
[23:56:57] <jgage>	 hm i wish i'd checked the http response on that "internal error" message
[23:57:11] <jgage>	 guessing it was 200 because we already have an http monitor for that url
[23:57:29] <wikibugs>	 6operations, 7Monitoring: Improve monitoring of https://git.wikimedia.org/ - https://phabricator.wikimedia.org/T94320#1159731 (10Gage)
[23:58:31] <legoktm>	 I thought we had auto-restarting for gitblit?
[23:59:42] <legoktm>	 https://gerrit.wikimedia.org/r/#/c/188480/ abandoned 
[23:59:43] <jgage>	 the upstart config says respawn, but that would only be triggered if it exited. this time it seemed to have hung.