[00:01:04] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [00:02:33] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1716899 (10Dzahn) now in icinga config we can see how our new contact... [00:20:29] (03CR) 10: [C: 031] "Per https://phabricator.wikimedia.org/T110619#1714416" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [00:23:19] (03CR) 10: "There are redirects right now for those urls, but this should be fixed for real here. (Also, using this comment as a test of what is going" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241079 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow) [00:23:27] heh [00:24:57] (03CR) 10Greg Grossmeier: "(And anther, sorry for the noise)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241079 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow) [00:24:59] yay [00:25:04] legoktm: fixed it [00:26:14] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [00:26:38] greg-g: you were batman [00:27:22] (03PS1) 10EBernhardson: Log messages sent to the 'warning' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244832 [00:29:40] yuvipanda: legoktm I had to click 'reload' here: http://i.imgur.com/hSxiM6j.png (I think it borke after I added another identity, to test that recent bug report about it) [00:30:39] ah [00:30:41] nice [00:30:46] (03PS2) 10EBernhardson: Log messages sent to the 'warning' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244832 [00:35:31] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [00:43:09] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1716953 (10Dzahn) And finally I added `can_submit_commands 1` to th... [00:46:57] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1716958 (10AKoval_WMF) Thanks @JohnLewis. We definitely don't want to delete archives! :) Glad I've got the terminology straight now. To clarify which list order we want... [00:47:06] (03PS2) 10Yuvipanda: puppet: Have a 'secret' repository for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/244827 (https://phabricator.wikimedia.org/T112005) [00:47:43] (03PS3) 10Yuvipanda: puppet: Have a 'secret' repository for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/244827 (https://phabricator.wikimedia.org/T112005) [00:49:12] PROBLEM - WDQS HTTP on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 393 bytes in 0.001 second response time [00:49:32] PROBLEM - WDQS SPARQL on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 393 bytes in 0.009 second response time [00:50:52] RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 4450 bytes in 0.002 second response time [00:51:12] RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 4450 bytes in 0.002 second response time [00:51:39] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1716959 (10Smalyshev) Checked and now I can control notifications for... [00:51:45] that was a test for notifications for SMalyshev ^ [00:52:10] yep seems to work well, thanks! [00:54:15] (03CR) 10Yuvipanda: [C: 032] "Tested" [puppet] - 10https://gerrit.wikimedia.org/r/244827 (https://phabricator.wikimedia.org/T112005) (owner: 10Yuvipanda) [00:54:58] SMalyshev: the states there are for services besides critical and recovery are: w(arning), f(lapping) and u(known) [00:56:56] and for hosts there is u(nreachable) and s to get notified when scheduled downtimes start and end [01:00:47] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1716963 (10Dzahn) 5Open>3Resolved [01:00:57] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1599052 (10Dzahn) [01:02:00] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1599052 (10Dzahn) added "w" to service_notification_options and "u" to host_notification_o... [01:09:09] (03CR) 10Alex Monk: [C: 04-1] "It looks like you missed loads..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243920 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher) [01:18:07] (03PS1) 10Yuvipanda: k8s: Move abac to puppet + hiera [puppet] - 10https://gerrit.wikimedia.org/r/244837 [01:18:27] (03PS2) 10Yuvipanda: k8s: Move abac to puppet + hiera [puppet] - 10https://gerrit.wikimedia.org/r/244837 [01:20:15] (03CR) 10Yuvipanda: [C: 032] k8s: Move abac to puppet + hiera [puppet] - 10https://gerrit.wikimedia.org/r/244837 (owner: 10Yuvipanda) [01:30:17] (03PS1) 10Yuvipanda: k8s: Fix erb syntax [puppet] - 10https://gerrit.wikimedia.org/r/244840 [01:30:20] (03CR) 10jenkins-bot: [V: 04-1] k8s: Fix erb syntax [puppet] - 10https://gerrit.wikimedia.org/r/244840 (owner: 10Yuvipanda) [01:30:47] (03PS2) 10Yuvipanda: k8s: Fix erb syntax [puppet] - 10https://gerrit.wikimedia.org/r/244840 [01:31:32] (03CR) 10Yuvipanda: [C: 032] k8s: Fix erb syntax [puppet] - 10https://gerrit.wikimedia.org/r/244840 (owner: 10Yuvipanda) [01:33:10] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:37:52] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1717004 (10Selsharbaty-WMF) @JohnLewis, Thanks for clarifying. Things are much clearer now. Can we create the new list with the same name and mailing address of the one... [01:50:29] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717006 (10Dzahn) [01:52:39] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717008 (10Dzahn) renamed ticket to clarify we had agreed on "only limited to services the user has access to" vs. full access and it was stalled... [01:52:47] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717010 (10Dzahn) 5stalled>3Open [01:56:11] (03PS1) 10Yuvipanda: hiera: Add support for 'secret' datadir [puppet] - 10https://gerrit.wikimedia.org/r/244841 (https://phabricator.wikimedia.org/T112005) [01:59:09] (03PS2) 10Yuvipanda: hiera: Add support for 'secret' datadir [puppet] - 10https://gerrit.wikimedia.org/r/244841 (https://phabricator.wikimedia.org/T112005) [02:00:19] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [02:03:36] (03PS1) 10Dzahn: icinga: add contact group for mailman admins [puppet] - 10https://gerrit.wikimedia.org/r/244842 (https://phabricator.wikimedia.org/T105229) [02:04:53] (03PS2) 10Dzahn: icinga: add contact group for mailman admins [puppet] - 10https://gerrit.wikimedia.org/r/244842 (https://phabricator.wikimedia.org/T105229) [02:05:18] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [02:05:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [02:06:02] (03CR) 10Dzahn: [C: 032] "also enables email notification for existing service admin" [puppet] - 10https://gerrit.wikimedia.org/r/244842 (https://phabricator.wikimedia.org/T105229) (owner: 10Dzahn) [02:06:06] (03PS3) 10Dzahn: icinga: add contact group for mailman admins [puppet] - 10https://gerrit.wikimedia.org/r/244842 (https://phabricator.wikimedia.org/T105229) [02:08:58] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [02:10:11] (03PS1) 10Yuvipanda: k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 [02:10:14] (03CR) 10jenkins-bot: [V: 04-1] k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 (owner: 10Yuvipanda) [02:10:43] (03PS2) 10Yuvipanda: k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 [02:10:57] (03PS1) 10Dzahn: lists: add mailman-admins to contact groups [puppet] - 10https://gerrit.wikimedia.org/r/244844 (https://phabricator.wikimedia.org/T105229) [02:11:46] (03PS2) 10Dzahn: lists: add mailman-admins to contact groups [puppet] - 10https://gerrit.wikimedia.org/r/244844 (https://phabricator.wikimedia.org/T105229) [02:13:59] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [02:14:45] (03PS3) 10Yuvipanda: k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 [02:14:49] (03CR) 10jenkins-bot: [V: 04-1] k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 (owner: 10Yuvipanda) [02:15:10] (03CR) 10Dzahn: [C: 032] lists: add mailman-admins to contact groups [puppet] - 10https://gerrit.wikimedia.org/r/244844 (https://phabricator.wikimedia.org/T105229) (owner: 10Dzahn) [02:15:20] (03PS4) 10Yuvipanda: k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 [02:16:58] (03PS1) 10Dzahn: lists: move admin group from node to role [puppet] - 10https://gerrit.wikimedia.org/r/244846 [02:19:33] (03PS2) 10Dzahn: lists: move admin group from node to role [puppet] - 10https://gerrit.wikimedia.org/r/244846 [02:19:42] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 05m 58s) [02:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:20:33] (03CR) 10Dzahn: [C: 032] "no change - http://puppet-compiler.wmflabs.org/988/fermium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/244846 (owner: 10Dzahn) [02:20:58] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [02:22:18] ACKNOWLEDGEMENT - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors daniel_zahn adding missing contact [02:22:30] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-10 02:22:30+00:00 [02:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [02:30:47] 10Ops-Access-Requests, 6operations, 7Icinga, 5Patch-For-Review: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717047 (10Dzahn) {P2182} [02:31:25] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [02:33:18] 10Ops-Access-Requests, 6operations, 7Icinga, 5Patch-For-Review: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717048 (10Dzahn) following the same pattern as in T111243, this is now resolved. email notifications and access to send c... [02:33:43] 10Ops-Access-Requests, 6operations, 7Icinga, 5Patch-For-Review: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717049 (10Dzahn) 5Open>3Resolved [02:33:56] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1439115 (10Dzahn) [02:36:46] (03PS1) 10Yuvipanda: k8s: Pick up client password from secret hieradata [puppet] - 10https://gerrit.wikimedia.org/r/244851 [02:37:07] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [02:43:49] 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1717052 (10Dzahn) p:5Normal>3Low [02:44:21] 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1717054 (10Dzahn) 5Open>3stalled [02:45:18] hmmm [02:45:22] it's timing out connecting to gerrit [02:45:47] the secret is fine [02:58:32] yuvipanda: gerrit server also has ferm now [02:58:36] maybe that [02:58:44] mutante: nope [02:58:49] it's a direct effect of me playing with k8s [02:58:53] ok [02:58:56] and missing a 'notify' in puppet [02:58:57] fixing it now [02:59:04] this is one of the reasons I'm starting to hate my sleep patterns more [02:59:12] heh. ok. then i'll say good night and cu on an island [02:59:14] if I were to have woken up like this in India, I'd have all of europe again [02:59:17] now I've nobody [02:59:20] mutante: are you also flying tomorrow? [02:59:24] am I the only person flying on sunday... [02:59:28] no Sunday morning [02:59:34] but early [03:00:01] and San Jose airport out [03:00:38] mutante: ah... [03:00:40] ok [03:00:42] I'm also early sunday morning [03:00:46] which means no sleeping on saturday. [03:00:48] booooooo [03:00:54] wait how am I even going to get to the airport [03:01:13] of course uber because you'll be too late for bart? [03:01:17] too early [03:01:19] yeah [03:01:24] or airport shuttle [03:01:26] and order it now [03:01:33] to pick you up [03:01:44] I'm looking at my ticket to just make sure [03:01:45] those blue and yellow small busses [03:01:50] the year/month/date is throwing me off [03:01:53] since it says 2015/10/11 [03:01:57] but tomorrow is the 10th [03:02:00] err [03:02:02] 10/11/2015 [03:02:05] or you do it like Leslie would have and bike to airport [03:02:44] im also wondering, since i fly out a different airport than i come back [03:03:03] it would not even make sense to park a car if it was free and not hundreds of dollars :p [03:03:53] anyways.. and apropos sleep patterns .. time for a break from laptop :) [03:03:56] cu [03:04:02] (03PS4) 10Yuvipanda: k8s: Pick up client password from secret hieradata [puppet] - 10https://gerrit.wikimedia.org/r/244851 [03:04:02] bye! [03:06:22] greg-g: yay [03:07:46] (03PS1) 10Yuvipanda: dynamicproxy: Do not depend on labsdebrepo! [puppet] - 10https://gerrit.wikimedia.org/r/244852 [03:08:51] (03PS1) 10Yuvipanda: labsdebrepo: Fix names to match file paths [puppet] - 10https://gerrit.wikimedia.org/r/244853 [03:09:14] (03PS2) 10Yuvipanda: labsdebrepo: Fix names to match file paths [puppet] - 10https://gerrit.wikimedia.org/r/244853 [03:09:16] (03PS2) 10Yuvipanda: dynamicproxy: Do not depend on labsdebrepo! [puppet] - 10https://gerrit.wikimedia.org/r/244852 [03:12:26] (03CR) 10Yuvipanda: [C: 032] dynamicproxy: Do not depend on labsdebrepo! [puppet] - 10https://gerrit.wikimedia.org/r/244852 (owner: 10Yuvipanda) [03:12:37] (03CR) 10Yuvipanda: [C: 032] labsdebrepo: Fix names to match file paths [puppet] - 10https://gerrit.wikimedia.org/r/244853 (owner: 10Yuvipanda) [03:21:15] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [03:22:55] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [04:16:51] (03PS4) 10Yuvipanda: hiera: Add support for 'secret' datadir [puppet] - 10https://gerrit.wikimedia.org/r/244841 (https://phabricator.wikimedia.org/T112005) [04:17:25] (03CR) 10Yuvipanda: [C: 032 V: 032] hiera: Add support for 'secret' datadir [puppet] - 10https://gerrit.wikimedia.org/r/244841 (https://phabricator.wikimedia.org/T112005) (owner: 10Yuvipanda) [04:24:19] (03PS6) 10Yuvipanda: k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 [04:24:36] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Use secret hieradata for managing users and tokens [puppet] - 10https://gerrit.wikimedia.org/r/244843 (owner: 10Yuvipanda) [04:24:52] (03PS5) 10Yuvipanda: k8s: Pick up client password from secret hieradata [puppet] - 10https://gerrit.wikimedia.org/r/244851 [04:25:03] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Pick up client password from secret hieradata [puppet] - 10https://gerrit.wikimedia.org/r/244851 (owner: 10Yuvipanda) [05:06:15] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [05:30:16] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [05:35:35] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [05:47:23] (03PS1) 10Yuvipanda: k8s: Fix kube2proxy to be nicer [puppet] - 10https://gerrit.wikimedia.org/r/244857 [05:48:20] (03CR) 10Yuvipanda: [C: 032] k8s: Fix kube2proxy to be nicer [puppet] - 10https://gerrit.wikimedia.org/r/244857 (owner: 10Yuvipanda) [05:54:06] (03PS1) 10Yuvipanda: k8s: Setup ssl certs to be owned by kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/244858 [05:54:58] (03CR) 10Yuvipanda: [C: 032] k8s: Setup ssl certs to be owned by kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/244858 (owner: 10Yuvipanda) [05:57:11] (03PS1) 10Yuvipanda: k8s: Remove extra k8s::ssl include [puppet] - 10https://gerrit.wikimedia.org/r/244859 [05:58:01] (03CR) 10Yuvipanda: [C: 032] k8s: Remove extra k8s::ssl include [puppet] - 10https://gerrit.wikimedia.org/r/244859 (owner: 10Yuvipanda) [06:02:05] (03PS1) 10Yuvipanda: k8s: Fix path to ssl cert path for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/244860 [06:03:51] (03CR) 10Yuvipanda: [C: 032] k8s: Fix path to ssl cert path for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/244860 (owner: 10Yuvipanda) [06:10:04] (03PS1) 10Yuvipanda: k8s: Add support for infrastructure-readonly accounts [puppet] - 10https://gerrit.wikimedia.org/r/244861 [06:10:47] (03PS1) 10Yuvipanda: tools: Use the proxy-infrastructure account for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/244862 [06:11:12] (03CR) 10Yuvipanda: [C: 032] k8s: Add support for infrastructure-readonly accounts [puppet] - 10https://gerrit.wikimedia.org/r/244861 (owner: 10Yuvipanda) [06:11:24] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Use the proxy-infrastructure account for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/244862 (owner: 10Yuvipanda) [06:17:38] 6operations, 6Labs, 10Labs-Team-Backlog: Make sure that the 'secret' repo in self hosted puppetmasters is back-upable - https://phabricator.wikimedia.org/T115177#1717146 (10yuvipanda) [06:23:58] (03PS1) 10Yuvipanda: aptly: Add ability to mark a repo as trusted [puppet] - 10https://gerrit.wikimedia.org/r/244863 (https://phabricator.wikimedia.org/T112699) [06:24:53] (03CR) 10Yuvipanda: [C: 032] aptly: Add ability to mark a repo as trusted [puppet] - 10https://gerrit.wikimedia.org/r/244863 (https://phabricator.wikimedia.org/T112699) (owner: 10Yuvipanda) [06:26:11] (03PS1) 10Yuvipanda: aptly: Mark aptly repo as trusted [puppet] - 10https://gerrit.wikimedia.org/r/244864 (https://phabricator.wikimedia.org/T112699) [06:29:46] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:25] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:35] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:46] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:26] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:36] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:55] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:33:34] (03CR) 10Yuvipanda: [C: 032] aptly: Mark aptly repo as trusted [puppet] - 10https://gerrit.wikimedia.org/r/244864 (https://phabricator.wikimedia.org/T112699) (owner: 10Yuvipanda) [06:40:04] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Oct 10 06:40:04 UTC 2015 (duration 40m 3s) [06:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:49:40] (03CR) 10MZMcBride: "Related: and ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 (owner: 10Ori.livneh) [06:54:03] (03CR) 10Hashar: [C: 031] "It is magic! :-)" [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis) [06:56:36] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:05] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:45] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:57:47] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:47] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:25] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:15:00] (03CR) 10Glaisher: "Which entries are you referring to? I just checked again and there's no such entry. All the commonsuploads wikis which are now specified h" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243920 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher) [09:03:06] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [09:04:45] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [09:35:26] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [09:56:46] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [10:00:17] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [10:35:53] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1717266 (10JohnLewis) 5Resolved>3Open Tried to test this by un silencing the mailman queue check (permissions noted above are correct) yet I... [11:32:07] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:40:44] (03CR) 10TTO: "@Krenair, I'm not sure that's a good idea. Remember that labswiki has no restrictions on account creation; doing that would open the door " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [11:43:23] (03CR) 10Alex Monk: [C: 04-1] "Yeah, I don't think you understand the code you're changing here. It's already not including the meta blacklist." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [11:46:14] (03CR) 10TTO: "How so? $wgTitleBlacklistUsernameSources only affects username creation, not page creation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [11:46:54] (03CR) 10TTO: "> Yeah, I don't think you understand the code you're changing here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [11:50:21] (03CR) 10Steinsplitter: "as far i know global title blacklist cab be locally whitelisted. so this patch is not needed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [11:52:02] (03CR) 10Alex Monk: [C: 031] "My mistake, I think this should be fine actually." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243920 (https://phabricator.wikimedia.org/T111335) (owner: 10Glaisher) [11:55:23] (03CR) 10Alex Monk: [C: 031] Remove duplicate entries from commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244435 (owner: 10Glaisher) [11:56:12] (03CR) 10Alex Monk: [C: 031] Fix nbwiki to nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244736 (owner: 10Amire80) [11:57:54] (03CR) 10Alex Monk: [C: 031] Use new page name for wmf release notes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241079 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow) [11:59:44] (03CR) 10Alex Monk: "Needs rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244141 (owner: 10TTO) [12:01:12] (03CR) 10Alex Monk: [C: 04-1] "currently open dependency, getting this out of the review queue until it's ready" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240640 (https://phabricator.wikimedia.org/T54709) (owner: 10Glaisher) [12:02:02] (03PS2) 10TTO: Revert "Route Bug40009 logs to fluorine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244141 [12:05:56] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [12:20:05] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [12:21:45] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [12:43:36] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [500.0] [12:47:37] (03CR) 10Alex Monk: [C: 031] "Files in tin:/srv/mediawiki-staging/wmf-config/.svn/tmp/ show this was once used to determine whether to require( $IP.'/extensions/Cite/Sp" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243921 (owner: 10Glaisher) [12:55:16] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:33:35] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:05:25] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [14:18:09] (03PS1) 10Yurik: tilerator should not expose admin UI [puppet] - 10https://gerrit.wikimedia.org/r/244884 [14:28:28] PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: Puppet has 1 failures [14:32:16] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:39:15] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: Puppet has 1 failures [14:53:46] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:01:44] (03CR) 10MZMcBride: "Why not just uninstall the TitleBlacklist and SpamBlacklist extensions from private and fishbowl wikis?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [15:07:55] RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:15] PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out [15:15:49] RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 3.005 second response time on port 9042 [15:19:46] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: puppet fail [15:24:26] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Puppet has 1 failures [15:36:27] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [15:41:52] (03CR) 10BBlack: IdleConnection: set keepalive (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/244717 (https://phabricator.wikimedia.org/T113151) (owner: 10Ori.livneh) [15:43:26] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [15:46:46] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:46:46] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [15:51:16] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:57:50] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1717447 (10mmodell) @chasemp: Is there anything remaining for this to be completed? Feel free to claim and close this task. :) [16:01:46] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:10:26] PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: puppet fail [16:35:36] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [16:36:02] 6operations, 10Wikimedia-DNS, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1717459 (10MZMcBride) The fundamental approach here seems to be flawed. From reading this task, it seems like we broadly have two options: *... [16:39:15] RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:47:56] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1717540 (10ssastry) 5Open>3Resolved a:3ssastry The main task of investigating this is done -- we have found... [18:01:46] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:43:37] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [18:45:16] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [18:56:35] ori: ^ Since you asked to be poked about nutcracker on silver things :D [19:05:36] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [19:07:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [19:12:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [19:45:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [19:52:56] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: puppet fail [20:02:47] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [20:21:26] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:25:25] (03Abandoned) 10Tim Landscheidt: WIP: Tools: Deploy local package management key [puppet] - 10https://gerrit.wikimedia.org/r/240021 (https://phabricator.wikimedia.org/T112699) (owner: 10Tim Landscheidt) [20:35:56] PROBLEM - puppet last run on mw2124 is CRITICAL: CRITICAL: puppet fail [20:36:26] (03PS1) 10Gerrit Patch Uploader: Bug: T114930 Add three groups to itwikiversity, and allow sysops to add or remove users to them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) [20:36:29] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [20:36:34] (03CR) 10jenkins-bot: [V: 04-1] Bug: T114930 Add three groups to itwikiversity, and allow sysops to add or remove users to them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [21:00:15] (03CR) 10Luke081515: "(Need to upload another patch, but have some problems with the upload, so could take a while)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [21:04:35] RECOVERY - puppet last run on mw2124 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [21:13:16] (03PS2) 10Gerrit Patch Uploader: Bug: T114930 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) [21:13:18] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [21:13:20] (03CR) 10jenkins-bot: [V: 04-1] Bug: T114930 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [21:15:35] (03PS3) 10Gerrit Patch Uploader: Bug: T114930 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) [21:15:37] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [21:15:39] (03CR) 10jenkins-bot: [V: 04-1] Bug: T114930 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [21:24:19] 6operations, 6Labs: 10.68.18.65 resolves to two different instances - https://phabricator.wikimedia.org/T115194#1717673 (10Krenair) 3NEW [21:26:02] 6operations, 6Labs: RDNS for 10.68.18.65 resolves to two different instances - https://phabricator.wikimedia.org/T115194#1717680 (10yuvipanda) [21:30:51] (03PS4) 10Gerrit Patch Uploader: Bug: T114930 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) [21:30:53] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (https://phabricator.wikimedia.org/T114930) (owner: 10Gerrit Patch Uploader) [21:31:59] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [21:33:12] (03PS5) 10Luke081515: Add three groups to itwikiversity, and allow sysops to add or remove users to them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244896 (owner: 10Gerrit Patch Uploader) [21:45:08] (03CR) 10Ori.livneh: [C: 04-1] "Every change to metric processing, even if it leads to greater accuracy, makes it harder to compare current data with historic data and th" [puppet] - 10https://gerrit.wikimedia.org/r/244488 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [21:47:37] (03PS3) 10Ori.livneh: webperf: Allow zero values for navtiming metrics [puppet] - 10https://gerrit.wikimedia.org/r/244488 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [21:51:19] (03PS4) 10Ori.livneh: webperf: Allow zero values for navtiming metrics [puppet] - 10https://gerrit.wikimedia.org/r/244488 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [21:52:05] (03CR) 10Ori.livneh: [C: 032] webperf: Allow zero values for navtiming metrics [puppet] - 10https://gerrit.wikimedia.org/r/244488 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [21:59:47] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [22:01:14] ori: ^ lol [22:01:36] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [22:02:56] PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 100% [22:03:45] PROBLEM - Host pay-lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [22:03:51] PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [22:03:59] PROBLEM - Host payments2002 is DOWN: PING CRITICAL - Packet loss = 100% [22:04:05] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [22:04:05] <_joe_> oh shit [22:04:10] PROBLEM - Host bellatrix is DOWN: PING CRITICAL - Packet loss = 100% [22:04:27] indeed, oh shit, seems like frack [22:04:39] they are all frack hosts [22:04:51] i'll text jeff green (he isnt going to our ops meeting) [22:04:51] <_joe_> robh: I have no access to frack [22:04:57] PROBLEM - carbon-cache write error on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [8.0] [22:04:58] <_joe_> he is not? [22:05:01] this seems like a pfw is down [22:05:06] <_joe_> yep [22:05:08] he isnt on the sheet for travel iirc, checking [22:05:21] <_joe_> anyways, page him :) [22:05:25] PROBLEM - Host fdb2001 is DOWN: PING CRITICAL - Packet loss = 100% [22:05:31] PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100% [22:06:20] PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100% [22:06:27] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [22:06:37] i have done so, i can login to the pfw stack [22:06:50] and now it just went away..... [22:06:59] (it was responsive and now just died out on me) [22:07:08] _joe_: is paravoid near you ;] [22:07:14] (or you traveling tomorrow?) [22:07:25] <_joe_> robh: I'm travelling tomorrow [22:07:36] ok, im looking up his # to sms him now [22:08:02] I am around as well [22:08:08] pfw problems ? [22:08:48] RECOVERY - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [22:08:51] it seems so we just lost josts in frack... [22:08:53] but now one came back [22:09:18] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:09:28] RECOVERY - Host payments2002 is UP: PING OK - Packet loss = 0%, RTA = 35.08 ms [22:09:34] RECOVERY - Host pay-lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 35.47 ms [22:09:41] RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 34.84 ms [22:09:49] faidon is 20minutes from hotel [22:10:00] RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 34.91 ms [22:10:00] 10:09PM up 8 mins, 0 users, load averages: 0.05, 0.43, 0.32 [22:10:10] RECOVERY - Host fdb2001 is UP: PING OK - Packet loss = 0%, RTA = 34.90 ms [22:10:11] that's pfw1-codfw [22:10:14] that would explain it [22:10:30] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 35.02 ms [22:10:36] RECOVERY - Host bellatrix is UP: PING OK - Packet loss = 0%, RTA = 35.56 ms [22:10:44] so why the heck did it reboot =[ [22:10:59] looking now [22:11:24] RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms [22:12:22] Oct 10 22:12:41 pfw-codfw eventd[1089]: SYSTEM_ABNORMAL_SHUTDOWN: System abnormally shut down [22:12:31] power ? [22:12:49] lemme login to the power strip and ceck log [22:13:53] (03CR) 10Luke081515: [C: 031] Modify timezone for cswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244649 (https://phabricator.wikimedia.org/T115048) (owner: 10Revi) [22:14:20] mehhhhhh what logssss [22:14:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:14:31] so no email notices that the pdu went down but it doesnt have local logging. [22:14:48] (still checkign to ensure thats right but noting in web gui, checking ssh) [22:16:42] ok, there is a log in the ssh [22:16:56] but no recent power loss events logged. [22:17:37] I don't have the new fangled frack access either [22:17:53] also we dont use per outlet switched except in the network racks [22:18:01] so i cannot check for per outlet logging, only infeed overall [22:18:16] i also texted jeff so he should be aware of the issue [22:19:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:19:23] we may want to call him [22:19:35] unless he responded to the text [22:19:54] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:20:25] Do call [22:20:30] pfw did log some high cpu usage alerts this day and the previous one but last one was 10 hours before the incident [22:20:34] I doubt it's related [22:20:37] I'll cal now since he hasnt repsonded [22:22:08] he did not pick up, i left a voicemail [22:22:14] (03CR) 10TTO: "@MZMcBride: The SpamBlacklist extension is already uninstalled from privates and fishbowls. Notice that the request in the task was only t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [22:22:48] Thanks [22:23:11] shall I text katie? [22:23:23] (shes head, shes know if someone else in her department can handle what jeff would?) [22:24:03] Doesn't hurt, not sure if codfw is active [22:24:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:24:33] I'll include that its codfw/dallas in the text and ask =] [22:24:39] Ok [22:24:54] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:25:12] (03CR) 10MZMcBride: "I think "\Archive 1" is a pretty obscure use-case." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [22:25:56] robh: Last reboot reason 0x800:reboot due to exception [22:26:01] ok that is not power [22:27:27] I've texted Katie with what happened, who I've contacted, and asking if she knows if its live or not, etc... [22:27:49] Also I thought frack had things double wired to avoid this? [22:27:56] (wired to both pfw) [22:28:19] or perhaps each server is simply mirrored by another server on the other pfw... [22:28:39] akosiaris: but only one of the two had that right? [22:28:40] the pfw's in frack are a cluster [22:28:49] and are accessible as one [22:28:50] or did both pfw then exception and reboot? [22:29:04] I am inclined to say both [22:29:06] Katie replied to SMS that she doesn't think Dallas is critical to them but is checking [22:29:14] So we are likely ok, but still checking =] [22:29:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:29:35] I 'll try to connect of the pfw's console [22:29:50] maybe some more info there [22:29:53] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:30:14] I also really dont want to wake up Jaime for frackdb stuff if its non critical, he'll be waking up soon enough to fly [22:30:30] i guess its not wuite midnight there though for you guys? [22:30:41] I doubt jaime will be able to help here anyway [22:30:51] plus it's pretty obviously the pfw [22:31:12] i just meant in db recovery but yep [22:31:31] I don't think there is anything to recover is my point [22:33:47] nothing on the serial either [22:34:00] so, pfw rebooted, probably not due to power issues, no logs... [22:34:04] none of the monitored external services have flapped at all, I'm suspecting dallas for fr is not critical? [22:34:09] that sounds comforting [22:34:18] chasemp: yup, katie indicated so as well [22:34:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:34:44] damn, it's getting dark way to fast here... [22:34:54] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:36:03] so, those 2 alerts are for slave behind master (null).. not very helpful [22:36:10] otherwise we seem to be ok [22:38:12] (03CR) 10BryanDavis: On Beta Cluster: Use different logo for login form (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243732 (https://phabricator.wikimedia.org/T115078) (owner: 10Jdlrobson) [22:38:59] robh: yes both nodes rebooted due to the same reason [22:39:08] urgh [22:39:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:39:27] well, the mysql errors dont page either so not the end of the world [22:39:53] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:40:01] as long as K4-713 doesn't reverse statement about codfw being in use =] [22:40:05] what's up? [22:40:21] robh: So, funny story. [22:40:48] paravoid: both pfw1 and pfw2 codfw rebooted with an odd exception. 0x800:reboot due to exception [22:41:06] robh: Our relatively new 2-factor auth system seems to be damaged beyond me being able to get in to anything. [22:41:50] I think we're still just fine. I just can't get in to the frack to actually check anything. [22:42:02] K4-713: i can login to tellurium [22:42:05] with the 2factor [22:42:17] (my yubikey seems to work) [22:42:19] * K4-713 tries again [22:42:46] I can login but I have ZERO frack infrastructure knowledge. my access has been to generate new ssl certificates [22:43:05] basically so i can generate the keys and never copy them off frack [22:43:18] Ah, I see. [22:43:32] Generally I'd be checking logs on indium right about now. [22:44:19] well, iridium isnt frack login though right/ [22:44:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:44:28] that seems to just use my normal production key, not frack. [22:44:38] * robh just logged into it with that [22:44:53] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:45:09] iridium is frack robh...unless there are two of them [22:45:12] isn't [22:45:15] damn, sorry [22:45:24] Indium. [22:45:28] Not Iridium. [22:45:30] on, sorry [22:45:34] ...totally not confusing. :) [22:45:36] hehe [22:45:57] hrmm, that one ic annot login to [22:46:08] doesnt like my identification [22:46:26] I'm getting "ssh_exchange_identification: Connection closed by remote host" when I try to use the yubikey [22:47:00] yea, its borked and doesnt even let me authenticate to the point of using yubi [22:47:11] i can connect to its mgmt of course, but i dont know the frack root password =[ [22:47:20] so i cannot connect to its serial console [22:49:16] I'm logged in to frack, what do you need? [22:49:21] Last login: Tue Oct 6 16:03:14 2015 from tellurium.frack.eqiad.wmnet [22:49:23] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:49:25] faidon@boron:~$ [22:49:49] Well, okay. I am able to verify that we're getting new donation messages in our queue. [22:49:53] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [22:49:54] I can't... actually get in to anything else. [22:50:19] robh: But, I think we can take it from here. [22:50:48] K4-713: cool, cuz indium's frack root isnt same as production root and i dunno it [22:50:59] K4-713: well, as long as we arent leaving you all in a horrible state, im coolw ith that =] [22:51:06] I've seen worse. :p [22:51:14] I'm still not sure why they rebooted (the pfw) [22:51:32] but i imagine that's going to need jeff and mark (or faidon, or alex) to dig into the pfw itself and see whats up [22:51:43] alex and I are already onit [22:51:56] (i figured but i didnt want to speak for you ;) [22:52:16] I can't wait to hear what happened there. And, it's very interesting that the 2fa is all borked for some reason. [22:52:30] well i'm not sure if we'll ever find out what happened [22:52:38] both nodes seem to have crashed 1 minute apart [22:52:44] there is a coredump file for one of the core daemons [22:52:55] ...that is of 0 bytes filesize [22:52:58] both are [22:54:23] RECOVERY - check_mysql on payments2001 is OK: Uptime: 1498981 Threads: 3 Questions: 982032 Slow queries: 0 Opens: 37 Flush tables: 1 Open tables: 30 Queries per second avg: 0.655 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [22:54:53] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 1497847 Threads: 1 Questions: 13683914 Slow queries: 6918 Opens: 12739 Flush tables: 2 Open tables: 64 Queries per second avg: 9.135 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [22:56:52] K4-713: juft fyi i had texted Jeff Green with the outage notification, and I also texted him a followup a few minutes ago that you had responded and we seemed to be in a non emergency but non optimal state [22:56:58] that was before those cleared just started scrolling. [22:57:12] (I didnt want him to get all the 'oh shit' messages and no resolution ;) [22:57:22] Much appreciated. [22:57:52] robh: Aaand, now the frack 2fa is working normally. [22:58:35] heh, at least it happened when its daytime for all of ops, thats always nice [23:01:09] Seriously. I'm just happy it wasn't December. [23:38:14] PROBLEM - puppet last run on mw1071 is CRITICAL: CRITICAL: Puppet has 1 failures [23:59:05] RECOVERY - carbon-cache write error on graphite1001 is OK: OK: Less than 1.00% above the threshold [1.0]