[00:00:00] 10Ops-Access-Requests, 6operations: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1656517 (10Peachey88) Is deleting accounts something that is coming up enough that we need to expand access? [00:02:44] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61461 bytes in 0.087 second response time [00:48:31] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1656526 (10Peachey88) >>! In T110949#1590860, @Jalexander wrote: > For the master password the only people I have ever known to have the master password outside of ops is Erik, Philippe and myself... [01:37:43] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1656528 (10Jalexander) That should work; thanks Daniel. Slight possibility that I'll need to contact you offline for a redo on the file (I has a computer crash recently, not 100% sure the key serv... [02:19:43] !log l10nupdate@tin Synchronized php-1.26wmf23/cache/l10n: l10nupdate for 1.26wmf23 (duration: 06m 12s) [02:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:20:04] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: Puppet has 1 failures [02:22:53] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf23) at 2015-09-20 02:22:53+00:00 [02:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:06] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp4012_v6 [02:32:46] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [02:46:24] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [02:50:15] PROBLEM - RAID on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:35] PROBLEM - puppet last run on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:54:16] PROBLEM - nutcracker process on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:54:45] PROBLEM - configured eth on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:54:45] PROBLEM - salt-minion processes on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:54:45] PROBLEM - DPKG on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:04] PROBLEM - nutcracker port on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:05] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2015_v6 [02:55:15] PROBLEM - Disk space on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:45] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 25 minutes ago with 0 failures [02:55:55] RECOVERY - nutcracker process on mw1014 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:56:15] RECOVERY - configured eth on mw1014 is OK: OK - interfaces up [02:56:16] RECOVERY - salt-minion processes on mw1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:25] RECOVERY - DPKG on mw1014 is OK: All packages OK [02:56:35] RECOVERY - nutcracker port on mw1014 is OK: TCP OK - 0.000 second response time on port 11212 [02:56:54] RECOVERY - Disk space on mw1014 is OK: DISK OK [02:56:54] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [02:57:05] RECOVERY - RAID on mw1014 is OK: OK: no RAID installed [02:58:46] (03PS2) 10Tim Landscheidt: Tools: Replace references to tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/235941 (https://phabricator.wikimedia.org/T87387) [03:03:55] (03CR) 10Tim Landscheidt: "Tested on Toolsbeta for the default of tools.wmflabs.org and changes via Hiera:Toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/235941 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [03:04:55] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:25:46] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:35:26] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Puppet has 1 failures [03:37:15] PROBLEM - puppet last run on mw1034 is CRITICAL: CRITICAL: Puppet has 1 failures [04:01:54] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:01:55] RECOVERY - puppet last run on mw1034 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [04:04:34] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:18:24] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 18.52% of data above the critical threshold [100000000.0] [04:29:16] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Sep 20 04:29:16 UTC 2015 (duration 29m 15s) [04:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:44:46] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:20:36] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: Puppet has 1 failures [05:47:04] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:20:36] PROBLEM - puppet last run on mw2087 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:34] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:06] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:06] RECOVERY - puppet last run on mw2087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:04] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:05] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:56:05] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:57:56] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:24:26] !log temporarily disabling puppet on fermium and applying antispam countermeasures [07:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:31:07] 6operations, 7HTTPS: Add Forward Secrecy to all HTTPS sites - https://phabricator.wikimedia.org/T55259#1656628 (10Chmarkine) I think this task can finally be closed as resolved, as there're no more domains that lack FS. (T91504 is now about DNSSEC.) https://wikitech.wikimedia.org/wiki/HTTPS/domains [07:34:37] (03PS3) 10Faidon Liambotis: Replace Package['git-core'] with Package['git'] [puppet] - 10https://gerrit.wikimedia.org/r/233853 [07:34:39] (03PS3) 10Faidon Liambotis: Remove sodium from puppet (spare/decom) [puppet] - 10https://gerrit.wikimedia.org/r/239411 (https://phabricator.wikimedia.org/T110142) (owner: 10John F. Lewis) [07:34:42] (03PS8) 10Faidon Liambotis: Remove support for Ubuntu Lucid/10.04 [puppet] - 10https://gerrit.wikimedia.org/r/179888 [07:45:54] PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: puppet fail [07:46:55] (03PS1) 10Faidon Liambotis: mailman: apply spam countermeasures [puppet] - 10https://gerrit.wikimedia.org/r/239650 [07:47:26] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mailman: apply spam countermeasures [puppet] - 10https://gerrit.wikimedia.org/r/239650 (owner: 10Faidon Liambotis) [08:05:35] PROBLEM - puppet last run on db2051 is CRITICAL: CRITICAL: puppet fail [08:14:07] RECOVERY - puppet last run on mw2170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:30:25] RECOVERY - puppet last run on db2051 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [09:34:25] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: puppet fail [10:02:34] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:35:46] PROBLEM - Varnish HTTP mobile-backend on cp1046 is CRITICAL: Connection refused [12:36:26] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [12:53:55] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:33:44] RECOVERY - Varnish HTTP mobile-backend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [13:38:55] PROBLEM - Varnish HTTP mobile-backend on cp1046 is CRITICAL: Connection refused [15:56:01] 6operations, 10Traffic: cp1046 Varnish backend panic - https://phabricator.wikimedia.org/T113184#1656969 (10faidon) 3NEW a:3BBlack [16:44:55] !log depooling cp1046 varnish-be + varnish-be-rand in confctl, wiping storage, re-pooling - T113184 [16:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:25] RECOVERY - Varnish HTTP mobile-backend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [17:01:19] !log repooling cp1046 varnish-be + varnish-be-rand in confctl, fresh storage, purge queue caught up - T113184 [17:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:05:27] 6operations, 10Traffic: cp1046 Varnish backend panic - https://phabricator.wikimedia.org/T113184#1656997 (10BBlack) 5Open>3Resolved Almost certainly corrupt cache contents from earlier crash. I depooled, nuked the cache, and repooled. Looks ok now. SAL: ``` 2015-09-20 17:01 bblack: repooling cp1046 varn... [17:26:57] RECOVERY - Disk space on labstore1002 is OK: DISK OK [17:42:06] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [17:54:35] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:39:46] 6operations, 7HTTPS: Add Forward Secrecy to all HTTPS sites - https://phabricator.wikimedia.org/T55259#1657052 (10JanZerebecki) 5Open>3Resolved a:3JanZerebecki Great. Thank you, all who worked on this! [18:44:48] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1657059 (10JanZerebecki) Seems everything from this ticket except DNSSEC and DANE are fixed. Does otrs does its own SMTP? [19:12:37] 10Ops-Access-Requests, 6operations: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1657072 (10Aklapper) @Peachey88: Judge yourself :) From a quick search: https://phabricator.wikimedia.org/T106100 https://phabricator.wikimedia.org/T105352 https://www.mediawiki.org/... [19:36:15] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0] [19:59:15] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [20:32:19] (03PS1) 10Alex Monk: Split langlist for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) [20:38:44] 6operations, 7Shinken: Shinken hostname column is not large enough - https://phabricator.wikimedia.org/T1362#1657122 (10hashar) 5Open>3declined a:3hashar Nobody working on Shinken and I found a workaround (scraping the page). [20:46:22] (03CR) 10Alex Monk: [C: 04-1] "This doesn't work because of this part of getRealmSpecificFilename:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [20:52:18] (03PS1) 10Florianschmidtwelzow: Adjust SpecialVersionVersionUrl hook handler for upcoming Semantic versioning in wmf branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239752 (https://phabricator.wikimedia.org/T67306) [20:53:36] (03CR) 10Florianschmidtwelzow: Adjust SpecialVersionVersionUrl hook handler for upcoming Semantic versioning in wmf branches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239752 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow) [21:01:56] (03PS2) 10Alex Monk: Split langlist for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) [21:01:58] (03PS1) 10Alex Monk: Make extension optional in getRealmSpecificFilename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239754 [21:02:25] (03CR) 10Alex Monk: "No idea why my -1 from PS1 has stuck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [21:03:27] multiversion/MWRealm.sh:WMF_DATACENTER=pmtpa [21:08:54] PROBLEM - Host lvs1012 is DOWN: PING CRITICAL - Packet loss = 100% [22:28:53] uhm [22:29:01] lvs1012 down not sure anyone ack'd it [22:29:03] * yuvipanda sshes [22:30:21] bblack: paravoid around? [22:34:40] !log reloda pybal on lvs1012 [22:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:34:47] yuvipanda: don't mess with it [22:34:49] I don't think lvs1002 is in prod [22:34:50] oops [22:34:54] er, 1012 [22:34:58] 1002 is definitely in prod :) [22:34:59] I did a pybal reaoad [22:35:03] *reload [22:35:19] it's ok [22:35:21] was stracing it and found too many file errors [22:35:27] saved lsof and strace output before reloading [22:35:36] I had it downtime for a few days while messing with it, but the downtime expired today [22:35:42] I forgot to re-up it in icinga [22:35:47] ah, ok :) [22:36:01] it's not in any prod use, it's going to get reinstalled again before we get anywhere near that :) [22:36:02] hopefully the reload didn't mess anything up? [22:36:04] ok [22:36:06] no [22:36:11] ok :) [22:37:15] also not sure why I didn't get paged the other day? [22:37:18] re-downtimed :) [22:37:20] not sure what I was supposed to be paged for [22:38:20] I've verified the phone number on icinag [22:38:22] err [22:38:22] icinga [22:38:24] seems right [22:38:42] is the timezone correct too? [22:38:46] it should've been pages for "LVS HTTP IPv3 on mobile-lb.eqiad.wikimedia.org is CRITICAL: No route to host" [22:38:50] err IPv4 :) [22:39:26] which looks a lot like the ones we've been ignoring lately that say "LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out" [22:40:24] paravoid: yeah, says 'PST_awake_hours' [22:40:45] which is correct enough [22:40:53] bblack: ipv3?! [22:40:54] seems the emails to vtext.com were sent and delivered there [22:41:09] 2015-09-19 04:01:57 [22:41:09] etc. [22:41:19] I had 3x pairs of those (CRIT -> OK within a couple minutes) at ~04:02 UTC, ~04:23 UTC, ~05:05 UTC [22:41:25] hmm [22:41:33] can one of you send me an sms? [22:41:41] sure [22:41:54] thanks :) [22:42:02] bblack: I can PM you the number if you'd like :) [22:43:24] sent to number from icinga config [22:43:34] bblack: yup, got it [22:43:36] I responded. [22:44:14] ack [22:45:00] ok. so I guess that was vtext failing? [22:45:42] sounds like it [22:47:31] ok! [22:47:52] try sending one now? [22:48:15] just did [22:49:10] I don't have anything [22:49:44] nothing still [22:50:35] noooppppeeee [23:04:50] hmm [23:04:52] nice :) [23:05:09] woah, didn't know I had an ipv6 address! [23:05:49] I wonder if someone else can ping it [23:06:19] ah nope [23:06:23] 'network is unreachable' [23:06:32] 64 bytes from 2601:642:4301:387a:6257:18ff:fe38:1675: icmp_seq=3 ttl=43 time=220 ms [23:08:04] oh [23:08:06] hmm [23:08:09] that's strange / interesting [23:08:16] ah, I guess labs doesn't do ipv6 [23:08:20] which might explain it [23:08:39] so wait if I open up a port now you can just reach it without having to do any portmapping on the router? [23:08:41] niiice [23:11:39] I wonder how the mobile support is [23:21:42] (03CR) 10Platonides: Allow 'block' AbuseFilterAction on eswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239455 (https://phabricator.wikimedia.org/T113096) (owner: 10Platonides) [23:50:05] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0]