[00:01:04] <icinga-wm>	 PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail
[00:02:33] <wikibugs>	 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1716899 (10Dzahn) now in icinga config we can see how our new contact...
[00:20:29] <krrrit-wm>	 (03CR) 10: [C: 031] "Per https://phabricator.wikimedia.org/T110619#1714416" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio)
[00:23:19] <krrrit-wm>	 (03CR) 10: "There are redirects right now for those urls, but this should be fixed for real here. (Also, using this comment as a test of what is going" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241079 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow)
[00:23:27] <greg-g>	 heh
[00:24:57] <krrrit-wm>	 (03CR) 10Greg Grossmeier: "(And anther, sorry for the noise)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241079 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow)
[00:24:59] <greg-g>	 yay
[00:25:04] <greg-g>	 legoktm: fixed it
[00:26:14] <icinga-wm>	 RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[00:26:38] <yuvipanda>	 greg-g: you were batman
[00:27:22] <krrrit-wm>	 (03PS1) 10EBernhardson: Log messages sent to the 'warning' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244832 
[00:29:40] <greg-g>	 yuvipanda: legoktm I had to click 'reload' here: http://i.imgur.com/hSxiM6j.png  (I think it borke after I added another identity, to test that recent bug report about it)
[00:30:39] <yuvipanda>	 ah
[00:30:41] <yuvipanda>	 nice
[00:30:46] <krrrit-wm>	 (03PS2) 10EBernhardson: Log messages sent to the 'warning' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244832 
[00:35:31] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail
[00:43:09] <wikibugs>	 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1716953 (10Dzahn) And finally I added `can_submit_commands   1` to th...
[00:46:57] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1716958 (10AKoval_WMF) Thanks @JohnLewis. We definitely don't want to delete archives! :) Glad I've got the terminology straight now.   To clarify which list order we want...
[00:47:06] <krrrit-wm>	 (03PS2) 10Yuvipanda: puppet: Have a 'secret' repository for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/244827 (https://phabricator.wikimedia.org/T112005) 
[00:47:43] <krrrit-wm>	 (03PS3) 10Yuvipanda: puppet: Have a 'secret' repository for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/244827 (https://phabricator.wikimedia.org/T112005) 
[00:49:12] <icinga-wm>	 PROBLEM - WDQS HTTP on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 393 bytes in 0.001 second response time
[00:49:32] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 393 bytes in 0.009 second response time
[00:50:52] <icinga-wm>	 RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 4450 bytes in 0.002 second response time
[00:51:12] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 4450 bytes in 0.002 second response time
[00:51:39] <wikibugs>	 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1716959 (10Smalyshev) Checked and now I can control notifications for...
[00:51:45] <mutante>	 that was a test for notifications for SMalyshev ^
[00:52:10] <SMalyshev>	 yep seems to work well, thanks!
[00:54:15] <krrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] "Tested" [puppet] - 10https://gerrit.wikimedia.org/r/244827 (https://phabricator.wikimedia.org/T112005) (owner: 10Yuvipanda)
[00:54:58] <mutante>	 SMalyshev: the states there are for services besides critical and recovery are: w(arning), f(lapping) and u(known)
[00:56:56] <mutante>	 and for hosts there is u(nreachable)  and s to get notified when scheduled downtimes start and end
[01:00:47] <wikibugs>	 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1716963 (10Dzahn) 5Open>3Resolved
[01:00:57] <wikibugs>	 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1599052 (10Dzahn)
[01:02:00] <wikibugs>	 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1599052 (10Dzahn) added "w" to service_notification_options and "u" to host_notification_o...
[01:09:09] <krrrit-wm>	 (03CR) 10Alex Monk: [C: 04-1] "It looks like you missed loads..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243920 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher)
[01:18:07] <krrrit-wm>	 (03PS1) 10Yuvipanda: k8s: Move abac to puppet + hiera [puppet] - 10https://gerrit.wikimedia.org/r/244837 
[01:18:27] <krrrit-wm>	 (03PS2) 10Yuvipanda: k8s: Move abac to puppet + hiera [puppet] - 10https://gerrit.wikimedia.org/r/244837 
[01:20:15] <krrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] k8s: Move abac to puppet + hiera [puppet] - 10https://gerrit.wikimedia.org/r/244837 (owner: 10Yuvipanda)
[01:30:17] <krrrit-wm>	 (03PS1) 10Yuvipanda: k8s: Fix erb syntax [puppet] - 10https://gerrit.wikimedia.org/r/244840 
[01:30:20] <krrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] k8s: Fix erb syntax [puppet] - 10https://gerrit.wikimedia.org/r/244840 (owner: 10Yuvipanda)
[01:30:47] <krrrit-wm>	 (03PS2) 10Yuvipanda: k8s: Fix erb syntax [puppet] - 10https://gerrit.wikimedia.org/r/244840 
[01:31:32] <krrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] k8s: Fix erb syntax [puppet] - 10https://gerrit.wikimedia.org/r/244840 (owner: 10Yuvipanda)
[01:33:10] <icinga-wm>	 RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[01:37:52] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1717004 (10Selsharbaty-WMF) @JohnLewis, Thanks for clarifying. Things are much clearer now.   Can we create the new list with the same name and mailing address of the one...
[01:50:29] <wikibugs>	 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717006 (10Dzahn)
[01:52:39] <wikibugs>	 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717008 (10Dzahn) renamed ticket to clarify we had agreed on "only limited to services the user has access to" vs. full access and it was stalled...
[01:52:47] <wikibugs>	 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717010 (10Dzahn) 5stalled>3Open
[01:56:11] <krrrit-wm>	 (03PS1) 10Yuvipanda: hiera: Add support for 'secret' datadir [puppet] - 10https://gerrit.wikimedia.org/r/244841 (https://phabricator.wikimedia.org/T112005) 
[01:59:09] <krrrit-wm>	 (03PS2) 10Yuvipanda: hiera: Add support for 'secret' datadir [puppet] - 10https://gerrit.wikimedia.org/r/244841 (https://phabricator.wikimedia.org/T112005) 
[02:00:19] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
[02:03:36] <krrrit-wm>	 (03PS1) 10Dzahn: icinga: add contact group for mailman admins [puppet] - 10https://gerrit.wikimedia.org/r/244842 (https://phabricator.wikimedia.org/T105229) 
[02:04:53] <krrrit-wm>	 (03PS2) 10Dzahn: icinga: add contact group for mailman admins [puppet] - 10https://gerrit.wikimedia.org/r/244842 (https://phabricator.wikimedia.org/T105229) 
[02:05:18] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail
[02:05:29] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
[02:06:02] <krrrit-wm>	 (03CR) 10Dzahn: [C: 032] "also enables email notification for existing service admin" [puppet] - 10https://gerrit.wikimedia.org/r/244842 (https://phabricator.wikimedia.org/T105229) (owner: 10Dzahn)
[02:06:06] <krrrit-wm>	 (03PS3) 10Dzahn: icinga: add contact group for mailman admins [puppet] - 10https://gerrit.wikimedia.org/r/244842 (https://phabricator.wikimedia.org/T105229) 
[02:08:58] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
[02:10:11] <krrrit-wm>	 (03PS1) 10Yuvipanda: k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 
[02:10:14] <krrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 (owner: 10Yuvipanda)
[02:10:43] <krrrit-wm>	 (03PS2) 10Yuvipanda: k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 
[02:10:57] <krrrit-wm>	 (03PS1) 10Dzahn: lists: add mailman-admins to contact groups [puppet] - 10https://gerrit.wikimedia.org/r/244844 (https://phabricator.wikimedia.org/T105229) 
[02:11:46] <krrrit-wm>	 (03PS2) 10Dzahn: lists: add mailman-admins to contact groups [puppet] - 10https://gerrit.wikimedia.org/r/244844 (https://phabricator.wikimedia.org/T105229) 
[02:13:59] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
[02:14:45] <krrrit-wm>	 (03PS3) 10Yuvipanda: k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 
[02:14:49] <krrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 (owner: 10Yuvipanda)
[02:15:10] <krrrit-wm>	 (03CR) 10Dzahn: [C: 032] lists: add mailman-admins to contact groups [puppet] - 10https://gerrit.wikimedia.org/r/244844 (https://phabricator.wikimedia.org/T105229) (owner: 10Dzahn)
[02:15:20] <krrrit-wm>	 (03PS4) 10Yuvipanda: k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 
[02:16:58] <krrrit-wm>	 (03PS1) 10Dzahn: lists: move admin group from node to role [puppet] - 10https://gerrit.wikimedia.org/r/244846 
[02:19:33] <krrrit-wm>	 (03PS2) 10Dzahn: lists: move admin group from node to role [puppet] - 10https://gerrit.wikimedia.org/r/244846 
[02:19:42] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 05m 58s)
[02:19:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:20:33] <krrrit-wm>	 (03CR) 10Dzahn: [C: 032] "no change - http://puppet-compiler.wmflabs.org/988/fermium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/244846 (owner: 10Dzahn)
[02:20:58] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds
[02:22:18] <icinga-wm>	 ACKNOWLEDGEMENT - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors daniel_zahn adding missing contact
[02:22:30] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-10 02:22:30+00:00
[02:22:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:29:05] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
[02:30:47] <wikibugs>	 10Ops-Access-Requests, 6operations, 7Icinga, 5Patch-For-Review: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717047 (10Dzahn) {P2182}
[02:31:25] <icinga-wm>	 RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[02:33:18] <wikibugs>	 10Ops-Access-Requests, 6operations, 7Icinga, 5Patch-For-Review: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717048 (10Dzahn) following the same pattern as in T111243, this is now resolved.   email notifications and access to send c...
[02:33:43] <wikibugs>	 10Ops-Access-Requests, 6operations, 7Icinga, 5Patch-For-Review: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717049 (10Dzahn) 5Open>3Resolved
[02:33:56] <wikibugs>	 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1439115 (10Dzahn)
[02:36:46] <krrrit-wm>	 (03PS1) 10Yuvipanda: k8s: Pick up client password from secret hieradata [puppet] - 10https://gerrit.wikimedia.org/r/244851 
[02:37:07] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[02:43:49] <wikibugs>	 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1717052 (10Dzahn) p:5Normal>3Low
[02:44:21] <wikibugs>	 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1717054 (10Dzahn) 5Open>3stalled
[02:45:18] <yuvipanda>	 hmmm
[02:45:22] <yuvipanda>	 it's timing out connecting to gerrit
[02:45:47] <yuvipanda>	 the secret is fine
[02:58:32] <mutante>	 yuvipanda: gerrit server also has ferm now
[02:58:36] <mutante>	 maybe that
[02:58:44] <yuvipanda>	 mutante: nope
[02:58:49] <yuvipanda>	 it's a direct effect of me playing with k8s
[02:58:53] <mutante>	 ok
[02:58:56] <yuvipanda>	 and missing a 'notify' in puppet
[02:58:57] <yuvipanda>	 fixing it now
[02:59:04] <yuvipanda>	 this is one of the reasons I'm starting to hate my sleep patterns more
[02:59:12] <mutante>	 heh. ok. then i'll say good night and cu on an island
[02:59:14] <yuvipanda>	 if I were to have woken up like this in India, I'd have all of europe again
[02:59:17] <yuvipanda>	 now I've nobody
[02:59:20] <yuvipanda>	 mutante: are you also flying tomorrow?
[02:59:24] <yuvipanda>	 am I the only person flying on sunday...
[02:59:28] <mutante>	 no Sunday morning 
[02:59:34] <mutante>	 but early
[03:00:01] <mutante>	 and San Jose airport out
[03:00:38] <yuvipanda>	 mutante: ah...
[03:00:40] <yuvipanda>	 ok
[03:00:42] <yuvipanda>	 I'm also early sunday morning
[03:00:46] <yuvipanda>	 which means no sleeping on saturday.
[03:00:48] <yuvipanda>	 booooooo
[03:00:54] <yuvipanda>	 wait how am I even going to get to the airport
[03:01:13] <mutante>	 of course uber because you'll be too late for bart?
[03:01:17] <yuvipanda>	 too early
[03:01:19] <yuvipanda>	 yeah
[03:01:24] <mutante>	 or airport shuttle
[03:01:26] <mutante>	 and order it now 
[03:01:33] <mutante>	 to pick you up
[03:01:44] <yuvipanda>	 I'm looking at my ticket to just make sure
[03:01:45] <mutante>	 those blue and yellow small busses
[03:01:50] <yuvipanda>	 the year/month/date is throwing me off
[03:01:53] <yuvipanda>	 since it says 2015/10/11
[03:01:57] <yuvipanda>	 but tomorrow is the 10th
[03:02:00] <yuvipanda>	 err
[03:02:02] <yuvipanda>	 10/11/2015
[03:02:05] <mutante>	 or you do it like Leslie would have and bike to airport
[03:02:44] <mutante>	 im also wondering, since i fly out a different airport than i come back
[03:03:03] <mutante>	 it would not even make sense to park a car if it was free and not hundreds of dollars :p
[03:03:53] <mutante>	 anyways.. and apropos sleep patterns .. time for a break from laptop :)
[03:03:56] <mutante>	 cu
[03:04:02] <krrrit-wm1>	 (03PS4) 10Yuvipanda: k8s: Pick up client password from secret hieradata [puppet] - 10https://gerrit.wikimedia.org/r/244851 
[03:04:02] <yuvipanda>	 bye!
[03:06:22] <legoktm>	 greg-g: yay
[03:07:46] <krrrit-wm1>	 (03PS1) 10Yuvipanda: dynamicproxy: Do not depend on labsdebrepo! [puppet] - 10https://gerrit.wikimedia.org/r/244852 
[03:08:51] <krrrit-wm1>	 (03PS1) 10Yuvipanda: labsdebrepo: Fix names to match file paths [puppet] - 10https://gerrit.wikimedia.org/r/244853 
[03:09:14] <krrrit-wm1>	 (03PS2) 10Yuvipanda: labsdebrepo: Fix names to match file paths [puppet] - 10https://gerrit.wikimedia.org/r/244853 
[03:09:16] <krrrit-wm1>	 (03PS2) 10Yuvipanda: dynamicproxy: Do not depend on labsdebrepo! [puppet] - 10https://gerrit.wikimedia.org/r/244852 
[03:12:26] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032] dynamicproxy: Do not depend on labsdebrepo! [puppet] - 10https://gerrit.wikimedia.org/r/244852 (owner: 10Yuvipanda)
[03:12:37] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032] labsdebrepo: Fix names to match file paths [puppet] - 10https://gerrit.wikimedia.org/r/244853 (owner: 10Yuvipanda)
[03:21:15] <icinga-wm>	 PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds
[03:22:55] <icinga-wm>	 RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212
[04:16:51] <krrrit-wm1>	 (03PS4) 10Yuvipanda: hiera: Add support for 'secret' datadir [puppet] - 10https://gerrit.wikimedia.org/r/244841 (https://phabricator.wikimedia.org/T112005) 
[04:17:25] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032 V: 032] hiera: Add support for 'secret' datadir [puppet] - 10https://gerrit.wikimedia.org/r/244841 (https://phabricator.wikimedia.org/T112005) (owner: 10Yuvipanda)
[04:24:19] <krrrit-wm1>	 (03PS6) 10Yuvipanda: k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 
[04:24:36] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 (owner: 10Yuvipanda)
[04:24:52] <krrrit-wm1>	 (03PS5) 10Yuvipanda: k8s: Pick up client password from secret hieradata [puppet] - 10https://gerrit.wikimedia.org/r/244851 
[04:25:03] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Pick up client password from secret hieradata [puppet] - 10https://gerrit.wikimedia.org/r/244851 (owner: 10Yuvipanda)
[05:06:15] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail
[05:30:16] <icinga-wm>	 PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds
[05:35:35] <icinga-wm>	 RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212
[05:47:23] <krrrit-wm1>	 (03PS1) 10Yuvipanda: k8s: Fix kube2proxy to be nicer [puppet] - 10https://gerrit.wikimedia.org/r/244857 
[05:48:20] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032] k8s: Fix kube2proxy to be nicer [puppet] - 10https://gerrit.wikimedia.org/r/244857 (owner: 10Yuvipanda)
[05:54:06] <krrrit-wm1>	 (03PS1) 10Yuvipanda: k8s: Setup ssl certs to be owned by kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/244858 
[05:54:58] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032] k8s: Setup ssl certs to be owned by kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/244858 (owner: 10Yuvipanda)
[05:57:11] <krrrit-wm1>	 (03PS1) 10Yuvipanda: k8s: Remove extra k8s::ssl include [puppet] - 10https://gerrit.wikimedia.org/r/244859 
[05:58:01] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032] k8s: Remove extra k8s::ssl include [puppet] - 10https://gerrit.wikimedia.org/r/244859 (owner: 10Yuvipanda)
[06:02:05] <krrrit-wm1>	 (03PS1) 10Yuvipanda: k8s: Fix path to ssl cert path for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/244860 
[06:03:51] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032] k8s: Fix path to ssl cert path for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/244860 (owner: 10Yuvipanda)
[06:10:04] <krrrit-wm1>	 (03PS1) 10Yuvipanda: k8s: Add support for infrastructure-readonly accounts [puppet] - 10https://gerrit.wikimedia.org/r/244861 
[06:10:47] <krrrit-wm1>	 (03PS1) 10Yuvipanda: tools: Use the proxy-infrastructure account for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/244862 
[06:11:12] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032] k8s: Add support for infrastructure-readonly accounts [puppet] - 10https://gerrit.wikimedia.org/r/244861 (owner: 10Yuvipanda)
[06:11:24] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Use the proxy-infrastructure account for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/244862 (owner: 10Yuvipanda)
[06:17:38] <wikibugs>	 6operations, 6Labs, 10Labs-Team-Backlog: Make sure that the 'secret' repo in self hosted puppetmasters is back-upable - https://phabricator.wikimedia.org/T115177#1717146 (10yuvipanda)
[06:23:58] <krrrit-wm1>	 (03PS1) 10Yuvipanda: aptly: Add ability to mark a repo as trusted [puppet] - 10https://gerrit.wikimedia.org/r/244863 (https://phabricator.wikimedia.org/T112699) 
[06:24:53] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032] aptly: Add ability to mark a repo as trusted [puppet] - 10https://gerrit.wikimedia.org/r/244863 (https://phabricator.wikimedia.org/T112699) (owner: 10Yuvipanda)
[06:26:11] <krrrit-wm1>	 (03PS1) 10Yuvipanda: aptly: Mark aptly repo as trusted [puppet] - 10https://gerrit.wikimedia.org/r/244864 (https://phabricator.wikimedia.org/T112699) 
[06:29:46] <icinga-wm>	 PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:25] <icinga-wm>	 PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:07] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:26] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:26] <icinga-wm>	 PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:35] <icinga-wm>	 PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:46] <icinga-wm>	 PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:26] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:35] <icinga-wm>	 PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:36] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:55] <icinga-wm>	 RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:33:34] <krrrit-wm1>	 (03CR) 10Yuvipanda: [C: 032] aptly: Mark aptly repo as trusted [puppet] - 10https://gerrit.wikimedia.org/r/244864 (https://phabricator.wikimedia.org/T112699) (owner: 10Yuvipanda)
[06:40:04] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Oct 10 06:40:04 UTC 2015 (duration 40m 3s)
[06:40:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:49:40] <krrrit-wm1>	 (03CR) 10MZMcBride: "Related: <https://gerrit.wikimedia.org/r/#/c/244727> and <https://gerrit.wikimedia.org/r/#/c/244728>." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 (owner: 10Ori.livneh)
[06:54:03] <krrrit-wm1>	 (03CR) 10Hashar: [C: 031] "It is magic! :-)" [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis)
[06:56:36] <icinga-wm>	 RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[06:56:46] <icinga-wm>	 RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[06:56:46] <icinga-wm>	 RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:05] <icinga-wm>	 RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[06:57:26] <icinga-wm>	 RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:45] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[06:57:47] <icinga-wm>	 RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:47] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:06] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:25] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:15:00] <krrrit-wm1>	 (03CR) 10Glaisher: "Which entries are you referring to? I just checked again and there's no such entry. All the commonsuploads wikis which are now specified h" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243920 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher)
[09:03:06] <icinga-wm>	 PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds
[09:04:45] <icinga-wm>	 RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212
[09:35:26] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail
[09:56:46] <icinga-wm>	 PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds
[10:00:17] <icinga-wm>	 RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212
[10:35:53] <wikibugs>	 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717266 (10JohnLewis) 5Resolved>3Open Tried to test this by un silencing the mailman queue check (permissions noted above are correct) yet I...
[11:32:07] <icinga-wm>	 RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[11:40:44] <krrrit-wm1>	 (03CR) 10TTO: "@Krenair, I'm not sure that's a good idea. Remember that labswiki has no restrictions on account creation; doing that would open the door " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO)
[11:43:23] <krrrit-wm1>	 (03CR) 10Alex Monk: [C: 04-1] "Yeah, I don't think you understand the code you're changing here. It's already not including the meta blacklist." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO)
[11:46:14] <krrrit-wm1>	 (03CR) 10TTO: "How so? $wgTitleBlacklistUsernameSources only affects username creation, not page creation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO)
[11:46:54] <krrrit-wm1>	 (03CR) 10TTO: "> Yeah, I don't think you understand the code you're changing here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO)
[11:50:21] <krrrit-wm1>	 (03CR) 10Steinsplitter: "as far i know global title blacklist cab be locally whitelisted. so this patch is not needed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO)
[11:52:02] <krrrit-wm1>	 (03CR) 10Alex Monk: [C: 031] "My mistake, I think this should be fine actually." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243920 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher)
[11:55:23] <krrrit-wm1>	 (03CR) 10Alex Monk: [C: 031] Remove duplicate entries from commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244435 (owner: 10Glaisher)
[11:56:12] <krrrit-wm1>	 (03CR) 10Alex Monk: [C: 031] Fix nbwiki to nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244736 (owner: 10Amire80)
[11:57:54] <krrrit-wm1>	 (03CR) 10Alex Monk: [C: 031] Use new page name for wmf release notes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241079 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow)
[11:59:44] <krrrit-wm1>	 (03CR) 10Alex Monk: "Needs rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244141 (owner: 10TTO)
[12:01:12] <krrrit-wm1>	 (03CR) 10Alex Monk: [C: 04-1] "currently open dependency, getting this out of the review queue until it's ready" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240640 (https://phabricator.wikimedia.org/T54709) (owner: 10Glaisher)
[12:02:02] <krrrit-wm1>	 (03PS2) 10TTO: Revert "Route Bug40009 logs to fluorine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244141 
[12:05:56] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail
[12:20:05] <icinga-wm>	 PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds
[12:21:45] <icinga-wm>	 RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212
[12:43:36] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [500.0]
[12:47:37] <krrrit-wm1>	 (03CR) 10Alex Monk: [C: 031] "Files in tin:/srv/mediawiki-staging/wmf-config/.svn/tmp/ show this was once used to determine whether to require( $IP.'/extensions/Cite/Sp" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243921 (owner: 10Glaisher)
[12:55:16] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:33:35] <icinga-wm>	 RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:05:25] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail
[14:18:09] <krrrit-wm1>	 (03PS1) 10Yurik: tilerator should not expose admin UI [puppet] - 10https://gerrit.wikimedia.org/r/244884 
[14:28:28] <icinga-wm>	 PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:32:16] <icinga-wm>	 RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:39:15] <icinga-wm>	 PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:53:46] <icinga-wm>	 RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[15:01:44] <krrrit-wm1>	 (03CR) 10MZMcBride: "Why not just uninstall the TitleBlacklist and SpamBlacklist extensions from private and fishbowl wikis?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO)
[15:07:55] <icinga-wm>	 RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:14:15] <icinga-wm>	 PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out
[15:15:49] <icinga-wm>	 RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 3.005 second response time on port 9042
[15:19:46] <icinga-wm>	 PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: puppet fail
[15:24:26] <icinga-wm>	 PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Puppet has 1 failures
[15:36:27] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail
[15:41:52] <krrrit-wm1>	 (03CR) 10BBlack: IdleConnection: set keepalive (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/244717 (https://phabricator.wikimedia.org/T113151) (owner: 10Ori.livneh)
[15:43:26] <icinga-wm>	 PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds
[15:46:46] <icinga-wm>	 RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[15:46:46] <icinga-wm>	 RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212
[15:51:16] <icinga-wm>	 RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[15:57:50] <wikibugs>	 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1717447 (10mmodell) @chasemp: Is there anything remaining for this to be completed? Feel free to claim and close this task. :)
[16:01:46] <icinga-wm>	 RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[16:10:26] <icinga-wm>	 PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: puppet fail
[16:35:36] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail
[16:36:02] <wikibugs>	 6operations, 10Wikimedia-DNS, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1717459 (10MZMcBride) The fundamental approach here seems to be flawed. From reading this task, it seems like we broadly have two options:  *...
[16:39:15] <icinga-wm>	 RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:47:56] <wikibugs>	 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1717540 (10ssastry) 5Open>3Resolved a:3ssastry The main task of investigating this is done -- we have found...
[18:01:46] <icinga-wm>	 RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[18:43:37] <icinga-wm>	 PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds
[18:45:16] <icinga-wm>	 RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212
[18:56:35] <yuvipanda>	 ori: ^ Since you asked to be poked about nutcracker on silver things :D
[19:05:36] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail
[19:07:06] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds
[19:12:06] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds
[19:45:55] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds
[19:52:56] <icinga-wm>	 PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: puppet fail
[20:02:47] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[20:21:26] <icinga-wm>	 RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[20:25:25] <krrrit-wm1>	 (03Abandoned) 10Tim Landscheidt: WIP: Tools: Deploy local package management key [puppet] - 10https://gerrit.wikimedia.org/r/240021 (https://phabricator.wikimedia.org/T112699) (owner: 10Tim Landscheidt)
[20:35:56] <icinga-wm>	 PROBLEM - puppet last run on mw2124 is CRITICAL: CRITICAL: puppet fail
[20:36:26] <krrrit-wm1>	 (03PS1) 10Gerrit Patch Uploader: Bug: T114930 Add three groups to itwikiversity, and allow sysops to add or remove users to them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) 
[20:36:29] <krrrit-wm1>	 (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader)
[20:36:34] <krrrit-wm1>	 (03CR) 10jenkins-bot: [V: 04-1] Bug: T114930 Add three groups to itwikiversity, and allow sysops to add or remove users to them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader)
[21:00:15] <krrrit-wm1>	 (03CR) 10Luke081515: "(Need to upload another patch, but have some problems with the upload, so could take a while)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader)
[21:04:35] <icinga-wm>	 RECOVERY - puppet last run on mw2124 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[21:13:16] <krrrit-wm1>	 (03PS2) 10Gerrit Patch Uploader: Bug: T114930 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) 
[21:13:18] <krrrit-wm1>	 (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader)
[21:13:20] <krrrit-wm1>	 (03CR) 10jenkins-bot: [V: 04-1] Bug: T114930 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader)
[21:15:35] <krrrit-wm1>	 (03PS3) 10Gerrit Patch Uploader: Bug: T114930 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) 
[21:15:37] <krrrit-wm1>	 (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader)
[21:15:39] <krrrit-wm1>	 (03CR) 10jenkins-bot: [V: 04-1] Bug: T114930 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader)
[21:24:19] <wikibugs>	 6operations, 6Labs: 10.68.18.65 resolves to two different instances - https://phabricator.wikimedia.org/T115194#1717673 (10Krenair) 3NEW
[21:26:02] <wikibugs>	 6operations, 6Labs: RDNS for 10.68.18.65 resolves to two different instances - https://phabricator.wikimedia.org/T115194#1717680 (10yuvipanda)
[21:30:51] <krrrit-wm1>	 (03PS4) 10Gerrit Patch Uploader: Bug: T114930 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) 
[21:30:53] <krrrit-wm1>	 (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader)
[21:31:59] <icinga-wm>	 RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[21:33:12] <krrrit-wm1>	 (03PS5) 10Luke081515: Add three groups to itwikiversity, and allow sysops to add or remove users to them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (owner: 10Gerrit Patch Uploader)
[21:45:08] <krrrit-wm1>	 (03CR) 10Ori.livneh: [C: 04-1] "Every change to metric processing, even if it leads to greater accuracy, makes it harder to compare current data with historic data and th" [puppet] - 10https://gerrit.wikimedia.org/r/244488 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle)
[21:47:37] <krrrit-wm1>	 (03PS3) 10Ori.livneh: webperf: Allow zero values for navtiming metrics [puppet] - 10https://gerrit.wikimedia.org/r/244488 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle)
[21:51:19] <krrrit-wm1>	 (03PS4) 10Ori.livneh: webperf: Allow zero values for navtiming metrics [puppet] - 10https://gerrit.wikimedia.org/r/244488 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle)
[21:52:05] <krrrit-wm1>	 (03CR) 10Ori.livneh: [C: 032] webperf: Allow zero values for navtiming metrics [puppet] - 10https://gerrit.wikimedia.org/r/244488 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle)
[21:59:47] <icinga-wm>	 PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds
[22:01:14] <Reedy>	 ori: ^ lol
[22:01:36] <icinga-wm>	 RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212
[22:02:56] <icinga-wm>	 PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 100%
[22:03:45] <icinga-wm>	 PROBLEM - Host pay-lvs2002 is DOWN: PING CRITICAL - Packet loss = 100%
[22:03:51] <icinga-wm>	 PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100%
[22:03:59] <icinga-wm>	 PROBLEM - Host payments2002 is DOWN: PING CRITICAL - Packet loss = 100%
[22:04:05] <icinga-wm>	 PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100%
[22:04:05] <_joe_>	 oh shit
[22:04:10] <icinga-wm>	 PROBLEM - Host bellatrix is DOWN: PING CRITICAL - Packet loss = 100%
[22:04:27] <robh>	 indeed, oh shit, seems like frack
[22:04:39] <robh>	 they are all frack hosts
[22:04:51] <robh>	 i'll text jeff green (he isnt going to our ops meeting)
[22:04:51] <_joe_>	 robh: I have no access to frack
[22:04:57] <icinga-wm>	 PROBLEM - carbon-cache write error on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [8.0]
[22:04:58] <_joe_>	 he is not?
[22:05:01] <robh>	 this seems like a pfw is down
[22:05:06] <_joe_>	 yep
[22:05:08] <robh>	 he isnt on the sheet for travel iirc, checking
[22:05:21] <_joe_>	 anyways, page him :)
[22:05:25] <icinga-wm>	 PROBLEM - Host fdb2001 is DOWN: PING CRITICAL - Packet loss = 100%
[22:05:31] <icinga-wm>	 PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100%
[22:06:20] <icinga-wm>	 PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100%
[22:06:27] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail
[22:06:37] <robh>	 i have done so, i can login to the pfw stack
[22:06:50] <robh>	 and now it just went away.....
[22:06:59] <robh>	 (it was responsive and now just died out on me)
[22:07:08] <robh>	 _joe_: is paravoid near you ;]
[22:07:14] <robh>	 (or you traveling tomorrow?)
[22:07:25] <_joe_>	 robh: I'm travelling tomorrow
[22:07:36] <robh>	 ok, im looking up his # to sms him now
[22:08:02] <akosiaris>	 I am around as well
[22:08:08] <akosiaris>	 pfw problems ?
[22:08:48] <icinga-wm>	 RECOVERY - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms
[22:08:51] <robh>	 it seems so we just lost josts in frack...
[22:08:53] <robh>	 but now one came back
[22:09:18] <icinga-wm>	 PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:09:28] <icinga-wm>	 RECOVERY - Host payments2002 is UP: PING OK - Packet loss = 0%, RTA = 35.08 ms
[22:09:34] <icinga-wm>	 RECOVERY - Host pay-lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 35.47 ms
[22:09:41] <icinga-wm>	 RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 34.84 ms
[22:09:49] <robh>	 faidon is 20minutes from hotel
[22:10:00] <icinga-wm>	 RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 34.91 ms
[22:10:00] <akosiaris>	 10:09PM  up 8 mins, 0 users, load averages: 0.05, 0.43, 0.32
[22:10:10] <icinga-wm>	 RECOVERY - Host fdb2001 is UP: PING OK - Packet loss = 0%, RTA = 34.90 ms
[22:10:11] <akosiaris>	 that's pfw1-codfw
[22:10:14] <akosiaris>	 that would explain it
[22:10:30] <icinga-wm>	 RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 35.02 ms
[22:10:36] <icinga-wm>	 RECOVERY - Host bellatrix is UP: PING OK - Packet loss = 0%, RTA = 35.56 ms
[22:10:44] <robh>	 so why the heck did it reboot =[
[22:10:59] <akosiaris>	 looking now
[22:11:24] <icinga-wm>	 RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms
[22:12:22] <akosiaris>	 Oct 10 22:12:41  pfw-codfw eventd[1089]: SYSTEM_ABNORMAL_SHUTDOWN: System abnormally shut down
[22:12:31] <akosiaris>	 power ?
[22:12:49] <robh>	 lemme login to the power strip and ceck log
[22:13:53] <krrrit-wm>	 (03CR) 10Luke081515: [C: 031] Modify timezone for cswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244649 (https://phabricator.wikimedia.org/T115048) (owner: 10Revi)
[22:14:20] <robh>	 mehhhhhh what logssss
[22:14:23] <icinga-wm>	 PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:14:31] <robh>	 so no email notices that the pdu went down but it doesnt have local logging.
[22:14:48] <robh>	 (still checkign to ensure thats right but noting in web gui, checking ssh)
[22:16:42] <robh>	 ok, there is a log in the ssh
[22:16:56] <robh>	 but no recent power loss events logged.
[22:17:37] <chasemp>	 I don't have the new fangled frack access either
[22:17:53] <robh>	 also we dont use per outlet switched except in the network racks
[22:18:01] <robh>	 so i cannot check for per outlet logging, only infeed overall 
[22:18:16] <robh>	 i also texted jeff so he should be aware of the issue
[22:19:23] <icinga-wm>	 PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:19:23] <chasemp>	 we may want to call him
[22:19:35] <chasemp>	 unless he responded to the text
[22:19:54] <icinga-wm>	 PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:20:25] <mark>	 Do call
[22:20:30] <akosiaris>	 pfw did log some high cpu usage alerts this day and the previous one but last one was 10 hours before the incident
[22:20:34] <akosiaris>	 I doubt it's related
[22:20:37] <robh>	 I'll cal now since he hasnt repsonded
[22:22:08] <robh>	 he did not pick up, i left a voicemail
[22:22:14] <krrrit-wm>	 (03CR) 10TTO: "@MZMcBride: The SpamBlacklist extension is already uninstalled from privates and fishbowls. Notice that the request in the task was only t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO)
[22:22:48] <mark>	 Thanks
[22:23:11] <robh>	 shall I text katie?
[22:23:23] <robh>	 (shes head, shes know if someone else in her department can handle what jeff would?)
[22:24:03] <mark>	 Doesn't hurt, not sure if codfw is active
[22:24:23] <icinga-wm>	 PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:24:33] <robh>	 I'll include that its codfw/dallas in the text and ask =]
[22:24:39] <mark>	 Ok
[22:24:54] <icinga-wm>	 PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:25:12] <krrrit-wm>	 (03CR) 10MZMcBride: "I think "\Archive 1" is a pretty obscure use-case." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO)
[22:25:56] <akosiaris>	 robh:  Last reboot reason             0x800:reboot due to exception
[22:26:01] <akosiaris>	 ok that is not power
[22:27:27] <robh>	 I've texted Katie with what happened, who I've contacted, and asking if she knows if its live or not, etc...
[22:27:49] <robh>	 Also I thought frack had things double wired to avoid this?
[22:27:56] <robh>	 (wired to both pfw)
[22:28:19] <robh>	 or perhaps each server is simply mirrored by another server on the other pfw...
[22:28:39] <robh>	 akosiaris: but only one of the two had that right?
[22:28:40] <akosiaris>	 the pfw's in frack are a cluster
[22:28:49] <akosiaris>	 and are accessible as one
[22:28:50] <robh>	 or did both pfw then exception and reboot?
[22:29:04] <akosiaris>	 I am inclined to say both 
[22:29:06] <robh>	 Katie replied to SMS that she doesn't think Dallas is critical to them but is checking
[22:29:14] <robh>	 So we are likely ok, but still checking =]
[22:29:23] <icinga-wm>	 PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:29:35] <akosiaris>	 I 'll try to connect of the pfw's console 
[22:29:50] <akosiaris>	 maybe some more info there
[22:29:53] <icinga-wm>	 PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:30:14] <robh>	 I also really dont want to wake up Jaime for frackdb stuff if its non critical, he'll be waking up soon enough to fly
[22:30:30] <robh>	 i guess its not wuite midnight there though for you guys? 
[22:30:41] <akosiaris>	 I doubt jaime will be able to help here anyway
[22:30:51] <akosiaris>	 plus it's pretty obviously the pfw
[22:31:12] <robh>	 i just meant in db recovery but yep
[22:31:31] <akosiaris>	 I don't think there is anything to recover is my point
[22:33:47] <akosiaris>	 nothing on the serial either
[22:34:00] <akosiaris>	 so, pfw rebooted, probably not due to power issues, no logs...
[22:34:04] <chasemp>	 none of the monitored external services have flapped at all, I'm suspecting dallas for fr is not critical?
[22:34:09] <akosiaris>	 that sounds comforting
[22:34:18] <akosiaris>	 chasemp: yup, katie indicated so as well
[22:34:23] <icinga-wm>	 PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:34:44] <akosiaris>	 damn, it's getting dark way to fast here...
[22:34:54] <icinga-wm>	 PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:36:03] <akosiaris>	 so, those 2 alerts are for slave behind master (null).. not very helpful
[22:36:10] <akosiaris>	 otherwise we seem to be ok
[22:38:12] <krrrit-wm>	 (03CR) 10BryanDavis: On Beta Cluster: Use different logo for login form (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243732 (https://phabricator.wikimedia.org/T115078) (owner: 10Jdlrobson)
[22:38:59] <akosiaris>	 robh: yes both nodes rebooted due to the same reason
[22:39:08] <robh>	 urgh
[22:39:23] <icinga-wm>	 PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:39:27] <robh>	 well, the mysql errors dont page either so not the end of the world
[22:39:53] <icinga-wm>	 PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:40:01] <robh>	 as long as K4-713 doesn't reverse statement about codfw being in use =]
[22:40:05] <paravoid>	 what's up?
[22:40:21] <K4-713>	 robh: So, funny story. 
[22:40:48] <robh>	 paravoid: both pfw1 and pfw2 codfw rebooted with an odd exception.  0x800:reboot due to exception
[22:41:06] <K4-713>	 robh: Our relatively new 2-factor auth system seems to be damaged beyond me being able to get in to anything.
[22:41:50] <K4-713>	 I think we're still just fine. I just can't get in to the frack to actually check anything. 
[22:42:02] <robh>	 K4-713: i can login to tellurium
[22:42:05] <robh>	 with the 2factor
[22:42:17] <robh>	 (my yubikey seems to work)
[22:42:19] * K4-713 tries again
[22:42:46] <robh>	 I can login but I have ZERO frack infrastructure knowledge.  my access has been to generate new ssl certificates
[22:43:05] <robh>	 basically so i can generate the keys and never copy them off frack
[22:43:18] <K4-713>	 Ah, I see. 
[22:43:32] <K4-713>	 Generally I'd be checking logs on indium right about now. 
[22:44:19] <robh>	 well, iridium isnt frack login though right/
[22:44:23] <icinga-wm>	 PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:44:28] <robh>	 that seems to just use my normal production key, not frack.
[22:44:38] * robh just logged into it with that
[22:44:53] <icinga-wm>	 PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:45:09] <chasemp>	 iridium is frack robh...unless there are two of them
[22:45:12] <chasemp>	 isn't
[22:45:15] <chasemp>	 damn, sorry
[22:45:24] <K4-713>	 Indium. 
[22:45:28] <K4-713>	 Not Iridium. 
[22:45:30] <robh>	 on, sorry
[22:45:34] <K4-713>	 ...totally not confusing. :)
[22:45:36] <robh>	 hehe
[22:45:57] <robh>	 hrmm, that one ic annot login to
[22:46:08] <robh>	 doesnt like my identification
[22:46:26] <K4-713>	 I'm getting "ssh_exchange_identification: Connection closed by remote host" when I try to use the yubikey
[22:47:00] <robh>	 yea, its borked and doesnt even let me authenticate to the point of using yubi
[22:47:11] <robh>	 i can connect to its mgmt of course, but i dont know the frack root password =[
[22:47:20] <robh>	 so i cannot connect to its serial console
[22:49:16] <paravoid>	 I'm logged in to frack, what do you need?
[22:49:21] <paravoid>	 Last login: Tue Oct  6 16:03:14 2015 from tellurium.frack.eqiad.wmnet
[22:49:23] <icinga-wm>	 PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:49:25] <paravoid>	 faidon@boron:~$ 
[22:49:49] <K4-713>	 Well, okay. I am able to verify that we're getting new donation messages in our queue. 
[22:49:53] <icinga-wm>	 PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null)
[22:49:54] <K4-713>	 I can't... actually get in to anything else. 
[22:50:19] <K4-713>	 robh: But, I think we can take it from here.
[22:50:48] <robh>	 K4-713: cool, cuz indium's frack root isnt same as production root and i dunno it
[22:50:59] <robh>	 K4-713: well, as long as we arent leaving you all in a horrible state, im coolw ith that =]
[22:51:06] <K4-713>	 I've seen worse. :p
[22:51:14] <robh>	 I'm still not sure why they rebooted (the pfw)
[22:51:32] <robh>	 but i imagine that's going to need jeff and mark (or faidon, or alex) to dig into the pfw itself and see whats up
[22:51:43] <paravoid>	 alex and I are already onit
[22:51:56] <robh>	 (i figured but i didnt want to speak for you ;)
[22:52:16] <K4-713>	 I can't wait to hear what happened there. And, it's very interesting that the 2fa is all borked for some reason. 
[22:52:30] <paravoid>	 well i'm not sure if we'll ever find out what happened
[22:52:38] <paravoid>	 both nodes seem to have crashed 1 minute apart
[22:52:44] <paravoid>	 there is a coredump file for one of the core daemons
[22:52:55] <paravoid>	 ...that is of 0 bytes filesize
[22:52:58] <paravoid>	 both are
[22:54:23] <icinga-wm>	 RECOVERY - check_mysql on payments2001 is OK: Uptime: 1498981 Threads: 3 Questions: 982032 Slow queries: 0 Opens: 37 Flush tables: 1 Open tables: 30 Queries per second avg: 0.655 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[22:54:53] <icinga-wm>	 RECOVERY - check_mysql on fdb2001 is OK: Uptime: 1497847 Threads: 1 Questions: 13683914 Slow queries: 6918 Opens: 12739 Flush tables: 2 Open tables: 64 Queries per second avg: 9.135 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[22:56:52] <robh>	 K4-713: juft fyi i had texted Jeff Green with the outage notification, and I also texted him a followup a few minutes ago that you had responded and we seemed to be in a non emergency but non optimal state
[22:56:58] <robh>	 that was before those cleared just started scrolling.
[22:57:12] <robh>	 (I didnt want him to get all the 'oh shit' messages and no resolution ;)
[22:57:22] <K4-713>	 Much appreciated. 
[22:57:52] <K4-713>	 robh: Aaand, now the frack 2fa is working normally.
[22:58:35] <robh>	 heh, at least it happened when its daytime for all of ops, thats always nice
[23:01:09] <K4-713>	 Seriously. I'm just happy it wasn't December. 
[23:38:14] <icinga-wm>	 PROBLEM - puppet last run on mw1071 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:59:05] <icinga-wm>	 RECOVERY - carbon-cache write error on graphite1001 is OK: OK: Less than 1.00% above the threshold [1.0]