[00:00:00] <wikibugs>	 10Ops-Access-Requests, 6operations: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1656517 (10Peachey88) Is deleting accounts something that is coming up enough that we need to expand access?
[00:02:44] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61461 bytes in 0.087 second response time
[00:48:31] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1656526 (10Peachey88) >>! In T110949#1590860, @Jalexander wrote: > For the master password the only people I have ever known to have the master password outside of ops is Erik, Philippe and myself...
[01:37:43] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1656528 (10Jalexander) That should work; thanks Daniel. Slight possibility that I'll need to contact you offline for a redo on the file (I has a computer crash recently, not 100% sure the key serv...
[02:19:43] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.26wmf23/cache/l10n: l10nupdate for 1.26wmf23 (duration: 06m 12s)
[02:19:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:20:04] <icinga-wm>	 PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: Puppet has 1 failures
[02:22:53] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.26wmf23) at 2015-09-20 02:22:53+00:00
[02:23:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:29:06] <icinga-wm>	 PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp4012_v6
[02:32:46] <icinga-wm>	 RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK
[02:46:24] <icinga-wm>	 RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[02:50:15] <icinga-wm>	 PROBLEM - RAID on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:50:35] <icinga-wm>	 PROBLEM - puppet last run on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:54:16] <icinga-wm>	 PROBLEM - nutcracker process on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:54:45] <icinga-wm>	 PROBLEM - configured eth on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:54:45] <icinga-wm>	 PROBLEM - salt-minion processes on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:54:45] <icinga-wm>	 PROBLEM - DPKG on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:55:04] <icinga-wm>	 PROBLEM - nutcracker port on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:55:05] <icinga-wm>	 PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2015_v6
[02:55:15] <icinga-wm>	 PROBLEM - Disk space on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:55:45] <icinga-wm>	 RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 25 minutes ago with 0 failures
[02:55:55] <icinga-wm>	 RECOVERY - nutcracker process on mw1014 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[02:56:15] <icinga-wm>	 RECOVERY - configured eth on mw1014 is OK: OK - interfaces up
[02:56:16] <icinga-wm>	 RECOVERY - salt-minion processes on mw1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[02:56:25] <icinga-wm>	 RECOVERY - DPKG on mw1014 is OK: All packages OK
[02:56:35] <icinga-wm>	 RECOVERY - nutcracker port on mw1014 is OK: TCP OK - 0.000 second response time on port 11212
[02:56:54] <icinga-wm>	 RECOVERY - Disk space on mw1014 is OK: DISK OK
[02:56:54] <icinga-wm>	 RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK
[02:57:05] <icinga-wm>	 RECOVERY - RAID on mw1014 is OK: OK: no RAID installed
[02:58:46] <krrrit-wm>	 (03PS2) 10Tim Landscheidt: Tools: Replace references to tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/235941 (https://phabricator.wikimedia.org/T87387) 
[03:03:55] <krrrit-wm>	 (03CR) 10Tim Landscheidt: "Tested on Toolsbeta for the default of tools.wmflabs.org and changes via Hiera:Toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/235941 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt)
[03:04:55] <icinga-wm>	 PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied
[03:25:46] <icinga-wm>	 RECOVERY - Disk space on labstore1002 is OK: DISK OK
[03:35:26] <icinga-wm>	 PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:37:15] <icinga-wm>	 PROBLEM - puppet last run on mw1034 is CRITICAL: CRITICAL: Puppet has 1 failures
[04:01:54] <icinga-wm>	 RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:01:55] <icinga-wm>	 RECOVERY - puppet last run on mw1034 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[04:04:34] <icinga-wm>	 PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied
[04:18:24] <icinga-wm>	 PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 18.52% of data above the critical threshold [100000000.0]
[04:29:16] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Sep 20 04:29:16 UTC 2015 (duration 29m 15s)
[04:29:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:44:46] <icinga-wm>	 RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[05:20:36] <icinga-wm>	 PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: Puppet has 1 failures
[05:47:04] <icinga-wm>	 RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:20:36] <icinga-wm>	 PROBLEM - puppet last run on mw2087 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:24] <icinga-wm>	 PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:25] <icinga-wm>	 PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:34] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:06] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:47:06] <icinga-wm>	 RECOVERY - puppet last run on mw2087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:04] <icinga-wm>	 RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[06:56:05] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[06:56:05] <icinga-wm>	 RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[06:57:56] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:24:26] <paravoid>	 !log temporarily disabling puppet on fermium and applying antispam countermeasures
[07:24:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:31:07] <wikibugs>	 6operations, 7HTTPS: Add Forward Secrecy to all HTTPS sites - https://phabricator.wikimedia.org/T55259#1656628 (10Chmarkine) I think this task can finally be closed as resolved, as there're no more domains that lack FS. (T91504 is now about DNSSEC.)  https://wikitech.wikimedia.org/wiki/HTTPS/domains
[07:34:37] <krrrit-wm>	 (03PS3) 10Faidon Liambotis: Replace Package['git-core'] with Package['git'] [puppet] - 10https://gerrit.wikimedia.org/r/233853 
[07:34:39] <krrrit-wm>	 (03PS3) 10Faidon Liambotis: Remove sodium from puppet (spare/decom) [puppet] - 10https://gerrit.wikimedia.org/r/239411 (https://phabricator.wikimedia.org/T110142) (owner: 10John F. Lewis)
[07:34:42] <krrrit-wm>	 (03PS8) 10Faidon Liambotis: Remove support for Ubuntu Lucid/10.04 [puppet] - 10https://gerrit.wikimedia.org/r/179888 
[07:45:54] <icinga-wm>	 PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: puppet fail
[07:46:55] <krrrit-wm>	 (03PS1) 10Faidon Liambotis: mailman: apply spam countermeasures [puppet] - 10https://gerrit.wikimedia.org/r/239650 
[07:47:26] <krrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] mailman: apply spam countermeasures [puppet] - 10https://gerrit.wikimedia.org/r/239650 (owner: 10Faidon Liambotis)
[08:05:35] <icinga-wm>	 PROBLEM - puppet last run on db2051 is CRITICAL: CRITICAL: puppet fail
[08:14:07] <icinga-wm>	 RECOVERY - puppet last run on mw2170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:30:25] <icinga-wm>	 RECOVERY - puppet last run on db2051 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[09:34:25] <icinga-wm>	 PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: puppet fail
[10:02:34] <icinga-wm>	 RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[12:35:46] <icinga-wm>	 PROBLEM - Varnish HTTP mobile-backend on cp1046 is CRITICAL: Connection refused
[12:36:26] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0]
[12:53:55] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:33:44] <icinga-wm>	 RECOVERY - Varnish HTTP mobile-backend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time
[13:38:55] <icinga-wm>	 PROBLEM - Varnish HTTP mobile-backend on cp1046 is CRITICAL: Connection refused
[15:56:01] <wikibugs>	 6operations, 10Traffic: cp1046 Varnish backend panic - https://phabricator.wikimedia.org/T113184#1656969 (10faidon) 3NEW a:3BBlack
[16:44:55] <bblack>	 !log depooling cp1046 varnish-be + varnish-be-rand in confctl, wiping storage, re-pooling - T113184
[16:44:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:50:25] <icinga-wm>	 RECOVERY - Varnish HTTP mobile-backend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time
[17:01:19] <bblack>	 !log repooling cp1046 varnish-be + varnish-be-rand in confctl, fresh storage, purge queue caught up - T113184
[17:01:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:05:27] <wikibugs>	 6operations, 10Traffic: cp1046 Varnish backend panic - https://phabricator.wikimedia.org/T113184#1656997 (10BBlack) 5Open>3Resolved Almost certainly corrupt cache contents from earlier crash.  I depooled, nuked the cache, and repooled.  Looks ok now.  SAL: ``` 2015-09-20 17:01 bblack: repooling cp1046 varn...
[17:26:57] <icinga-wm>	 RECOVERY - Disk space on labstore1002 is OK: DISK OK
[17:42:06] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0]
[17:54:35] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:39:46] <wikibugs>	 6operations, 7HTTPS: Add Forward Secrecy to all HTTPS sites - https://phabricator.wikimedia.org/T55259#1657052 (10JanZerebecki) 5Open>3Resolved a:3JanZerebecki Great. Thank you, all who worked on this!
[18:44:48] <wikibugs>	 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1657059 (10JanZerebecki) Seems everything from this ticket except DNSSEC and DANE are fixed. Does otrs does its own SMTP?
[19:12:37] <wikibugs>	 10Ops-Access-Requests, 6operations: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1657072 (10Aklapper) @Peachey88: Judge yourself :)  From a quick search: https://phabricator.wikimedia.org/T106100 https://phabricator.wikimedia.org/T105352 https://www.mediawiki.org/...
[19:36:15] <icinga-wm>	 PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0]
[19:59:15] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[20:32:19] <krrrit-wm>	 (03PS1) 10Alex Monk: Split langlist for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) 
[20:38:44] <wikibugs>	 6operations, 7Shinken: Shinken hostname column is not large enough - https://phabricator.wikimedia.org/T1362#1657122 (10hashar) 5Open>3declined a:3hashar Nobody working on Shinken and I found a workaround (scraping the page).
[20:46:22] <krrrit-wm>	 (03CR) 10Alex Monk: [C: 04-1] "This doesn't work because of this part of getRealmSpecificFilename:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk)
[20:52:18] <krrrit-wm>	 (03PS1) 10Florianschmidtwelzow: Adjust SpecialVersionVersionUrl hook handler for upcoming Semantic versioning in wmf branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239752 (https://phabricator.wikimedia.org/T67306) 
[20:53:36] <krrrit-wm>	 (03CR) 10Florianschmidtwelzow: Adjust SpecialVersionVersionUrl hook handler for upcoming Semantic versioning in wmf branches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239752 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow)
[21:01:56] <krrrit-wm>	 (03PS2) 10Alex Monk: Split langlist for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) 
[21:01:58] <krrrit-wm>	 (03PS1) 10Alex Monk: Make extension optional in getRealmSpecificFilename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239754 
[21:02:25] <krrrit-wm>	 (03CR) 10Alex Monk: "No idea why my -1 from PS1 has stuck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk)
[21:03:27] <Krenair>	 multiversion/MWRealm.sh:WMF_DATACENTER=pmtpa
[21:08:54] <icinga-wm>	 PROBLEM - Host lvs1012 is DOWN: PING CRITICAL - Packet loss = 100%
[22:28:53] <yuvipanda>	 uhm
[22:29:01] <yuvipanda>	 lvs1012 down not sure anyone ack'd it
[22:29:03] * yuvipanda sshes
[22:30:21] <yuvipanda>	 bblack: paravoid around?
[22:34:40] <yuvipanda>	 !log reloda pybal on lvs1012
[22:34:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:34:47] <bblack>	 yuvipanda: don't mess with it
[22:34:49] <paravoid>	 I don't think lvs1002 is in prod
[22:34:50] <yuvipanda>	 oops
[22:34:54] <paravoid>	 er, 1012
[22:34:58] <paravoid>	 1002 is definitely in prod :)
[22:34:59] <yuvipanda>	 I did a pybal reaoad
[22:35:03] <yuvipanda>	 *reload
[22:35:19] <bblack>	 it's ok
[22:35:21] <yuvipanda>	 was stracing it and found too many file errors
[22:35:27] <yuvipanda>	 saved lsof and strace output before reloading
[22:35:36] <bblack>	 I had it downtime for a few days while messing with it, but the downtime expired today
[22:35:42] <bblack>	 I forgot to re-up it in icinga
[22:35:47] <yuvipanda>	 ah, ok :)
[22:36:01] <bblack>	 it's not in any prod use, it's going to get reinstalled again before we get anywhere near that :)
[22:36:02] <yuvipanda>	 hopefully the reload didn't mess anything up?
[22:36:04] <yuvipanda>	 ok
[22:36:06] <bblack>	 no
[22:36:11] <yuvipanda>	 ok :)
[22:37:15] <yuvipanda>	 also not sure why I didn't get paged the other day?
[22:37:18] <bblack>	 re-downtimed :)
[22:37:20] <yuvipanda>	 not sure what I was supposed to be paged for
[22:38:20] <yuvipanda>	 I've verified the phone number on icinag
[22:38:22] <yuvipanda>	 err
[22:38:22] <yuvipanda>	 icinga
[22:38:24] <yuvipanda>	 seems right
[22:38:42] <paravoid>	 is the timezone correct too?
[22:38:46] <bblack>	 it should've been pages for "LVS HTTP IPv3 on mobile-lb.eqiad.wikimedia.org is CRITICAL: No route to host"
[22:38:50] <bblack>	 err IPv4 :)
[22:39:26] <bblack>	 which looks a lot like the ones we've been ignoring lately that say "LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out"
[22:40:24] <yuvipanda>	 paravoid: yeah, says 'PST_awake_hours'
[22:40:45] <yuvipanda>	 which is correct enough
[22:40:53] <yuvipanda>	 bblack: ipv3?!
[22:40:54] <paravoid>	 seems the emails to vtext.com were sent and delivered there
[22:41:09] <paravoid>	 2015-09-19 04:01:57
[22:41:09] <paravoid>	 etc.
[22:41:19] <bblack>	 I had 3x pairs of those (CRIT -> OK within a couple minutes) at ~04:02 UTC, ~04:23 UTC, ~05:05 UTC
[22:41:25] <yuvipanda>	 hmm
[22:41:33] <yuvipanda>	 can one of you send me an sms?
[22:41:41] <bblack>	 sure
[22:41:54] <yuvipanda>	 thanks :)
[22:42:02] <yuvipanda>	 bblack: I can PM you the number if you'd like :)
[22:43:24] <bblack>	 sent to number from icinga config
[22:43:34] <yuvipanda>	 bblack: yup, got it
[22:43:36] <yuvipanda>	 I responded.
[22:44:14] <bblack>	 ack
[22:45:00] <yuvipanda>	 ok. so I guess that was vtext failing?
[22:45:42] <bblack>	 sounds like it
[22:47:31] <yuvipanda>	 ok!
[22:47:52] <paravoid>	 try sending one now?
[22:48:15] <paravoid>	 just did
[22:49:10] <yuvipanda>	 I don't have anything
[22:49:44] <yuvipanda>	 nothing still
[22:50:35] <yuvipanda>	 noooppppeeee
[23:04:50] <yuvipanda1>	 hmm
[23:04:52] <yuvipanda1>	 nice :)
[23:05:09] <yuvipanda>	 woah, didn't know I had an ipv6 address!
[23:05:49] <yuvipanda>	 I wonder if someone else can ping it
[23:06:19] <yuvipanda>	 ah nope
[23:06:23] <yuvipanda>	 'network is unreachable'
[23:06:32] <paravoid>	 64 bytes from 2601:642:4301:387a:6257:18ff:fe38:1675: icmp_seq=3 ttl=43 time=220 ms
[23:08:04] <yuvipanda1>	 oh
[23:08:06] <yuvipanda1>	 hmm
[23:08:09] <yuvipanda1>	 that's strange / interesting
[23:08:16] <yuvipanda1>	 ah, I guess labs doesn't do ipv6
[23:08:20] <yuvipanda1>	 which might explain it
[23:08:39] <yuvipanda1>	 so wait if I open up a port now you can just reach it without having to do any portmapping on the router?
[23:08:41] <yuvipanda1>	 niiice
[23:11:39] <yuvipanda1>	 I wonder how the mobile support is
[23:21:42] <krrrit-wm>	 (03CR) 10Platonides: Allow 'block' AbuseFilterAction on eswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239455 (https://phabricator.wikimedia.org/T113096) (owner: 10Platonides)
[23:50:05] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0]