[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150604T0000). Please do the needful. [00:00:29] PROBLEM - puppet last run on mw2203 is CRITICAL puppet fail [00:01:01] (03CR) 10Ori.livneh: [C: 031] Make it possible to install multiple custom diamond collectors that use the same source [puppet] - 10https://gerrit.wikimedia.org/r/215056 (owner: 10Ottomata) [00:06:38] PROBLEM - puppet last run on neptunium is CRITICAL puppet fail [00:08:52] (03PS1) 10Faidon Liambotis: ldap: fix broken dependencies [puppet] - 10https://gerrit.wikimedia.org/r/215831 [00:09:26] (03CR) 10Faidon Liambotis: [C: 032] ldap: fix broken dependencies [puppet] - 10https://gerrit.wikimedia.org/r/215831 (owner: 10Faidon Liambotis) [00:09:32] (03CR) 10Faidon Liambotis: [V: 032] ldap: fix broken dependencies [puppet] - 10https://gerrit.wikimedia.org/r/215831 (owner: 10Faidon Liambotis) [00:10:38] PROBLEM - puppet last run on nembus is CRITICAL puppet fail [00:11:39] RECOVERY - puppet last run on neptunium is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [00:13:58] RECOVERY - puppet last run on nembus is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:14:38] PROBLEM - puppet last run on analytics1002 is CRITICAL Puppet has 1 failures [00:19:08] RECOVERY - puppet last run on mw2203 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [00:24:58] RECOVERY - puppet last run on analytics1002 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [00:31:18] PROBLEM - puppet last run on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:07] (03PS1) 1020after4: Bump phabricator tag [puppet] - 10https://gerrit.wikimedia.org/r/215832 [00:39:27] can I get a +2 on https://gerrit.wikimedia.org/r/#/c/215832/ ? [00:39:42] (03CR) 10Ori.livneh: [C: 032 V: 032] Bump phabricator tag [puppet] - 10https://gerrit.wikimedia.org/r/215832 (owner: 1020after4) [00:39:48] Thanks! [00:39:58] PROBLEM - puppet last run on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:38] RECOVERY - puppet last run on praseodymium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:44:02] twentyafterfour: How long should Phabricator be down for? [00:44:34] James_F: probably 5 minutes if all goes well, maybe a bit longer [00:44:38] Kk. [00:45:04] Essentially long enough to dump a database backup and then pull the tags and apply db migrations [00:46:00] twentyafterfour: Thanks. [00:46:29] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2737 bytes in 0.034 second response time [00:47:15] Phab is down :( [00:47:39] kaldari: Yes, weekly update window. https://wikitech.wikimedia.org/wiki/Deployments [00:48:00] ACKNOWLEDGEMENT - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2737 bytes in 0.034 second response time daniel_zahn being upgraded [00:48:11] (03PS1) 10BBlack: trying to fix chain-regeneration dep issues... [puppet] - 10https://gerrit.wikimedia.org/r/215833 [00:48:41] hmm, seems like a weird time to update, but whatever :P [00:49:46] (03PS2) 10BBlack: trying to fix chain-regeneration dep issues... [puppet] - 10https://gerrit.wikimedia.org/r/215833 [00:50:03] (03CR) 10BBlack: [C: 032 V: 032] trying to fix chain-regeneration dep issues... [puppet] - 10https://gerrit.wikimedia.org/r/215833 (owner: 10BBlack) [00:50:13] phab is 503ing for me... [00:50:19] maintenance [00:50:23] bawolff: upgrade [00:50:39] ok [00:50:39] 65% done with the db dump, then I just have to fire up the services again [00:53:28] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 25793 bytes in 0.354 second response time [00:53:49] there you go, monitoring works [00:54:20] kaldari: phab is back [00:54:29] yay [00:54:37] was that a less than 3 minute outage or is tmobile just being slow to send me all my texts? [00:54:44] i got them all in at same time =P [00:54:53] twentyafterfour: exciting new features? [00:55:30] mutante: nothing huge, but my api patch got merged upstream (adds features we need to the project api) and some improvements to the calendar app [00:55:56] robh: 17:47 PROBLEM, 17:48 ACK, 17:54 RECOVERY [00:56:27] twentyafterfour: cool [00:56:59] mutante: cool, thx for info. so that is pretty short... i still dislike that my phone got them all at the same time [00:57:00] twentyafterfour: let's schedule a downtime in icinga next time [00:57:04] but i was on a call for the first minute of that window [00:57:17] twentyafterfour: you can just tell an ops to run a script on neon [00:57:26] so yay gsm drawbacks! (seems i cannot get a text while on a phone call on tmobile) [00:57:28] or we could cronjob it [00:57:37] if the deployments are always the exact same time [00:58:00] or manual click in web ui [00:58:36] uh oh [00:58:45] Access denied for user .. to database 'phabricator_spaces' [00:59:00] (03PS1) 10BBlack: temporary workaround for messed up chained gen [puppet] - 10https://gerrit.wikimedia.org/r/215834 [00:59:16] (03CR) 10BBlack: [C: 032 V: 032] temporary workaround for messed up chained gen [puppet] - 10https://gerrit.wikimedia.org/r/215834 (owner: 10BBlack) [00:59:39] mutante: ok sorry about that [00:59:44] twentyafterfour: just in a specific app? [01:00:15] ah, i see [01:00:41] Unknown database 'phabricator_spaces'. [01:00:58] so not denied, but doesnt exist? [01:01:45] the normal db user doesnt have permission to create a new db i suppose [01:03:05] twentyafterfour: so this spaces db is a new thing introduced in this version? [01:05:36] yes [01:05:45] and why doesn't phuser have permission to create tables? [01:05:57] I can't apply any of the migrations because of too-restrictive permissions [01:06:21] create databases apparently, not tables [01:06:28] I need both [01:06:32] seems like it would be pretty normal for an app not to have permission to create new tables [01:06:38] err, databases [01:06:59] well databases come along rarely, but tables more often [01:07:09] CREATE command denied to user 'phuser'@ ... for table 'differential_hiddencomment' [01:07:21] is that in the new db though, or an existing one? [01:07:28] existing one [01:07:29] yes, so the thing is that phab always wants a new database [01:07:32] for each app [01:07:41] and the regular user doesn't have the rights to do so [01:07:45] I removed the spaces migrations to try to get past that point and it failed on another one [01:07:51] but the upgrade wants to add one [01:08:04] I can do without the new db if I could at least create tables [01:08:11] (since I skipped the migrations for spaces) [01:08:30] I'm out of window so I guess I should just roll back and do this next week? [01:08:53] well, it doesnt break anything existing it seems [01:09:01] only the new thing doesnt work yet [01:09:02] right [01:09:13] I don't know [01:09:17] PROBLEM - puppet last run on mw2093 is CRITICAL puppet fail [01:09:21] it's probably broken [01:09:29] code and db need to match [01:09:47] so far it seems to me like existing phab links work fine [01:09:58] just when i go to the "spaces" thing i get the error page [01:10:15] and seems to make sense because phab has a separate db for each app [01:10:18] hah I didn't even realize puppet must have restarted apache ;) [01:10:39] well the migrations are minor but a couple of them might matter [01:10:50] i would think we just make a ticket for springle and jynus [01:10:50] owners app is one [01:10:59] calendar is the other [01:10:59] requesting a new db and/or grants [01:11:04] ok [01:12:20] what's jynus's phab username? [01:12:37] jcrespo [01:13:29] what's broken? [01:13:40] (btw, i said that because it's not a local DB but m3-master and springle once asked to have tickets for grants) [01:13:50] springle: phabricator got upgraded and wants to create a new db [01:14:03] phuser@m3-master [01:14:18] why is upgrade using phuser? i though we had phadmin for that? [01:14:21] Unknown database 'phabricator_spaces'. [01:14:42] heh, so you already had a special user for that problem [01:14:47] aha [01:15:08] chasem.p asked for this last year sometime. fairly sure that upgrades need to use phadmin [01:15:08] https://phabricator.wikimedia.org/T101347 [01:15:11] twentyafterfour: phadmin user? [01:15:25] hmm [01:15:33] I don't know how to make it do that, let me see [01:15:36] phuser is the web user. it should not be able to do schema canges :) like, ever [01:15:59] how would I find the phadmin password? [01:16:45] I assume that's in a private puppet repo? not sure where I would look for it [01:17:45] class passwords::mysql::phabricator [01:18:02] found it [01:18:58] PROBLEM - puppet last run on sodium is CRITICAL Puppet has 1 failures [01:19:25] i found the grants are also in puppet [01:19:35] templates/mariadb/production-grants-m3.sql.erb [01:19:37] that's cool [01:20:29] so you can confirm there phadmin has ALL [01:21:54] ok all should be ok [01:22:06] phadmin can only work from iridium, too. just noting [01:22:09] can phadmin create tables? [01:22:13] yes [01:22:18] and databases? [01:22:19] or just tables [01:22:21] yes [01:22:23] both [01:22:23] cool [01:22:36] thanks sorry to bother ya, I didn't realize it was set up like that [01:23:27] np. i'll quiz chase to make sure assumptions havn't changed recently, but fairly sure phadmin needs to stay and be used [01:24:04] yeah it's all good it was my mistake [01:25:13] (03PS1) 10BBlack: sslcert::chainedcert deps: a slightly better solution [puppet] - 10https://gerrit.wikimedia.org/r/215839 [01:25:33] error is gone, so this is the new thing https://phabricator.wikimedia.org/spaces/ [01:25:53] (03CR) 10jenkins-bot: [V: 04-1] sslcert::chainedcert deps: a slightly better solution [puppet] - 10https://gerrit.wikimedia.org/r/215839 (owner: 10BBlack) [01:25:57] I think it's incomplete but I don't know [01:26:11] (the spaces app is probably not finished [01:26:37] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [01:26:51] gotcha [01:26:57] PROBLEM - puppet last run on ms-be2006 is CRITICAL puppet fail [01:29:20] (03PS2) 10BBlack: sslcert::chainedcert deps: a slightly better solution [puppet] - 10https://gerrit.wikimedia.org/r/215839 [01:29:59] (03CR) 10jenkins-bot: [V: 04-1] sslcert::chainedcert deps: a slightly better solution [puppet] - 10https://gerrit.wikimedia.org/r/215839 (owner: 10BBlack) [01:30:20] (03PS1) 10Dzahn: apache generic_vhost: add SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/215840 [01:30:51] (03PS2) 10Dzahn: apache generic_vhost: add SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/215840 (https://phabricator.wikimedia.org/T100831) [01:30:54] (03PS3) 10BBlack: sslcert::chainedcert deps: a slightly better solution [puppet] - 10https://gerrit.wikimedia.org/r/215839 [01:32:05] (03PS3) 10Dzahn: apache generic_vhost: add SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/215840 (https://phabricator.wikimedia.org/T100831) [01:34:05] (03PS4) 10Dzahn: apache generic_vhost: add SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/215840 (https://phabricator.wikimedia.org/T100831) [01:38:07] (03PS5) 10Dzahn: apache generic_vhost: add SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/215840 (https://phabricator.wikimedia.org/T100831) [01:41:10] (03PS6) 10Dzahn: apache generic_vhost: add SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/215840 (https://phabricator.wikimedia.org/T100831) [01:43:49] RECOVERY - puppet last run on ms-be2006 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [01:45:29] (03CR) 10Ori.livneh: apache generic_vhost: add SSLCertificateChainFile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215840 (https://phabricator.wikimedia.org/T100831) (owner: 10Dzahn) [01:58:44] (03PS4) 10BBlack: sslcert::chainedcert deps: a better solution [puppet] - 10https://gerrit.wikimedia.org/r/215839 [01:59:57] (03CR) 10BBlack: [C: 032] sslcert::chainedcert deps: a better solution [puppet] - 10https://gerrit.wikimedia.org/r/215839 (owner: 10BBlack) [02:04:37] PROBLEM - puppet last run on cp1059 is CRITICAL puppet fail [02:04:38] PROBLEM - puppet last run on cp1072 is CRITICAL puppet fail [02:04:58] PROBLEM - puppet last run on cp1047 is CRITICAL puppet fail [02:04:58] PROBLEM - puppet last run on uranium is CRITICAL puppet fail [02:05:07] PROBLEM - puppet last run on cp3046 is CRITICAL puppet fail [02:05:38] PROBLEM - puppet last run on cp3015 is CRITICAL puppet fail [02:05:56] (03PS1) 10BBlack: add path to chainedcert exec [puppet] - 10https://gerrit.wikimedia.org/r/215858 [02:06:08] PROBLEM - puppet last run on neptunium is CRITICAL puppet fail [02:06:08] PROBLEM - puppet last run on cp3045 is CRITICAL puppet fail [02:06:17] (03CR) 10BBlack: [C: 032 V: 032] add path to chainedcert exec [puppet] - 10https://gerrit.wikimedia.org/r/215858 (owner: 10BBlack) [02:06:18] PROBLEM - puppet last run on cp1070 is CRITICAL puppet fail [02:06:28] PROBLEM - puppet last run on cp4006 is CRITICAL puppet fail [02:06:28] PROBLEM - puppet last run on cp1049 is CRITICAL puppet fail [02:06:29] PROBLEM - puppet last run on magnesium is CRITICAL puppet fail [02:06:34] those are all because of what's fixed in the merge above :( [02:06:38] PROBLEM - puppet last run on netmon1001 is CRITICAL puppet fail [02:07:08] PROBLEM - puppet last run on cp1055 is CRITICAL puppet fail [02:07:27] PROBLEM - puppet last run on cp3012 is CRITICAL puppet fail [02:11:44] (03PS1) 10BBlack: chainedcert deps: one more bugfix to prev commits [puppet] - 10https://gerrit.wikimedia.org/r/215860 [02:12:01] (03CR) 10BBlack: [C: 032 V: 032] chainedcert deps: one more bugfix to prev commits [puppet] - 10https://gerrit.wikimedia.org/r/215860 (owner: 10BBlack) [02:20:28] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 28.57% of data above the critical threshold [500.0] [02:21:28] RECOVERY - puppet last run on cp1059 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [02:21:37] RECOVERY - puppet last run on cp1072 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [02:21:47] RECOVERY - puppet last run on magnesium is OK Puppet is currently enabled, last run 1 second ago with 0 failures [02:21:57] RECOVERY - puppet last run on uranium is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [02:21:58] RECOVERY - puppet last run on cp3046 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [02:22:38] RECOVERY - puppet last run on cp3015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:22:58] RECOVERY - puppet last run on neptunium is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [02:23:08] RECOVERY - puppet last run on cp3045 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [02:23:18] RECOVERY - puppet last run on cp1070 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:23:27] RECOVERY - puppet last run on cp4006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:23:27] RECOVERY - puppet last run on cp1049 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:23:38] RECOVERY - puppet last run on netmon1001 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [02:23:58] RECOVERY - puppet last run on cp1055 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [02:24:18] RECOVERY - puppet last run on cp3012 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [02:25:03] !log l10nupdate Synchronized php-1.26wmf8/cache/l10n: (no message) (duration: 07m 22s) [02:25:20] Logged the message, Master [02:25:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [02:25:58] RECOVERY - puppet last run on labvirt1003 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [02:25:59] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [02:26:18] RECOVERY - puppet last run on cp1061 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [02:26:28] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [02:26:48] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [02:26:48] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [02:26:48] RECOVERY - puppet last run on cp3016 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [02:27:07] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:27:08] RECOVERY - puppet last run on cp1071 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [02:27:18] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [02:27:38] RECOVERY - puppet last run on silver is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [02:27:38] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:27:38] RECOVERY - puppet last run on cp4019 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [02:27:39] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:27:47] RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [02:27:47] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [02:27:47] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:27:47] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:27:58] RECOVERY - puppet last run on cp1058 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [02:28:08] RECOVERY - puppet last run on virt1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:28:28] RECOVERY - puppet last run on cp4001 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [02:28:28] RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [02:28:38] RECOVERY - puppet last run on dataset1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:28:38] RECOVERY - puppet last run on virt1004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:28:48] RECOVERY - puppet last run on antimony is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:28:57] RECOVERY - puppet last run on cp3005 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [02:28:57] RECOVERY - puppet last run on cp1050 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [02:28:58] RECOVERY - puppet last run on nembus is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:28:58] RECOVERY - puppet last run on cp1063 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [02:28:58] RECOVERY - puppet last run on virt1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:28:58] RECOVERY - puppet last run on cp4005 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [02:28:58] RECOVERY - puppet last run on cp3003 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [02:29:28] RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [02:29:28] RECOVERY - puppet last run on cp3041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:29:28] RECOVERY - puppet last run on cp4018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:29:38] RECOVERY - puppet last run on plutonium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:29:38] RECOVERY - puppet last run on cp1046 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:29:57] !log LocalisationUpdate completed (1.26wmf8) at 2015-06-04 02:28:54+00:00 [02:30:04] Logged the message, Master [02:30:18] RECOVERY - puppet last run on cp1062 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [02:30:58] RECOVERY - puppet last run on cp1048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:31:07] RECOVERY - puppet last run on titanium is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [02:31:08] RECOVERY - puppet last run on rcs1002 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [02:31:29] RECOVERY - puppet last run on cp3035 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [02:31:48] RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:32:18] RECOVERY - puppet last run on cp3034 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [02:32:37] RECOVERY - puppet last run on rcs1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:32:48] RECOVERY - puppet last run on cp1060 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:33:29] RECOVERY - puppet last run on cp3049 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [02:34:28] RECOVERY - puppet last run on cp3019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:34:57] if someone comes by and see the wall of spam: everything's currently ok as far as we know, no real outages. just stupid things that were corrected [02:35:27] RECOVERY - puppet last run on cp4020 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [02:36:08] (and the recent 5xx spike in reqerror is unrelated, was a poor attempt at DoS or a broken tool, but either way it's blocked now) [02:36:45] oh, they're back again heh [02:37:40] blocked again! [02:47:09] (and again) [02:47:11] ... [03:09:18] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:56:38] (03PS1) 10KartikMistry: CX: Added staff-recommender campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215865 (https://phabricator.wikimedia.org/T101353) [04:18:18] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [04:44:39] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (101066s 100000s) [04:45:38] PROBLEM - puppet last run on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:55:38] PROBLEM - puppet last run on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:36] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 4 05:11:32 UTC 2015 (duration 11m 31s) [05:12:42] Logged the message, Master [05:19:17] PROBLEM - puppet last run on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:48:18] (03CR) 10Santhosh: [C: 031] CX: Added staff-recommender campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215865 (https://phabricator.wikimedia.org/T101353) (owner: 10KartikMistry) [06:03:47] (03PS25) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [06:17:59] PROBLEM - puppet last run on praseodymium is CRITICAL Puppet last ran 4 hours ago [06:18:38] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (18992 100000s) [06:29:38] PROBLEM - puppet last run on mw1126 is CRITICAL puppet fail [06:30:18] PROBLEM - puppet last run on rhodium is CRITICAL Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on cp3010 is CRITICAL Puppet has 1 failures [06:32:18] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures [06:34:08] PROBLEM - puppet last run on db2018 is CRITICAL Puppet has 1 failures [06:34:48] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [06:34:57] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:34:59] PROBLEM - puppet last run on mw1144 is CRITICAL Puppet has 1 failures [06:35:02] (03PS3) 10ArielGlenn: Add wb_changes_subscription table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210072 (https://phabricator.wikimedia.org/T98742) (owner: 10Aude) [06:35:17] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:35:17] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 2 failures [06:35:28] PROBLEM - puppet last run on mw1042 is CRITICAL Puppet has 1 failures [06:35:29] PROBLEM - puppet last run on mw1228 is CRITICAL Puppet has 1 failures [06:35:47] PROBLEM - puppet last run on mw1189 is CRITICAL Puppet has 1 failures [06:35:48] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 1 failures [06:35:49] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:35:49] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [06:37:03] (03PS1) 10Giuseppe Lavagetto: Adding patch for CVE-2015-3413 (post-embargo commit) [debs/hhvm] - 10https://gerrit.wikimedia.org/r/215866 [06:37:09] (03PS4) 10ArielGlenn: Add wb_changes_subscription table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210072 (https://phabricator.wikimedia.org/T98742) (owner: 10Aude) [06:41:41] (03CR) 10ArielGlenn: "patch set 3 was rebase, 4 was fixing up the job name. all table dumps should have a job name ending in 'table'." [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210072 (https://phabricator.wikimedia.org/T98742) (owner: 10Aude) [06:42:31] (03CR) 10ArielGlenn: [C: 032] Add wb_changes_subscription table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210072 (https://phabricator.wikimedia.org/T98742) (owner: 10Aude) [06:43:59] (03CR) 10Muehlenhoff: [C: 031] Adding patch for CVE-2015-3413 (post-embargo commit) [debs/hhvm] - 10https://gerrit.wikimedia.org/r/215866 (owner: 10Giuseppe Lavagetto) [06:45:01] (03PS2) 10Giuseppe Lavagetto: Adding patch for CVE-2015-3413 (post-embargo commit) [debs/hhvm] - 10https://gerrit.wikimedia.org/r/215866 [06:45:27] RECOVERY - puppet last run on rhodium is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:28] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on mw1228 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:57] RECOVERY - puppet last run on mw1189 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:37] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on mw1144 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:47:09] RECOVERY - puppet last run on mw1042 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:47:25] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Adding patch for CVE-2015-3413 (post-embargo commit) [debs/hhvm] - 10https://gerrit.wikimedia.org/r/215866 (owner: 10Giuseppe Lavagetto) [06:47:28] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:38] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on db2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:08] RECOVERY - puppet last run on mw1126 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:48:27] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:46] (03PS4) 10ArielGlenn: Add wbc_entity_usage table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210081 (https://phabricator.wikimedia.org/T98743) (owner: 10Aude) [06:53:52] (03PS5) 10ArielGlenn: Add wbc_entity_usage table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210081 (https://phabricator.wikimedia.org/T98743) (owner: 10Aude) [07:00:54] (03CR) 10ArielGlenn: [C: 032] Add wbc_entity_usage table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210081 (https://phabricator.wikimedia.org/T98743) (owner: 10Aude) [07:38:17] (03PS2) 10ArielGlenn: Per bug #48012. Compressed possible errors into for loop; program exits if there is one or more errors and writes all of them. [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63782 (owner: 10Sanja pavlovic) [07:38:48] RECOVERY - puppet last run on praseodymium is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:40:47] (03CR) 10ArielGlenn: "rebased." [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63782 (owner: 10Sanja pavlovic) [07:46:11] (03CR) 10Alexandros Kosiaris: [C: 04-2] "It is still being used in opendj unfortunately. Please do not remove." [puppet] - 10https://gerrit.wikimedia.org/r/215821 (owner: 10Faidon Liambotis) [07:47:21] (03CR) 10Alexandros Kosiaris: [C: 04-2] "I already gave the dependent change a -2, giving -2 to this one as well." [puppet] - 10https://gerrit.wikimedia.org/r/215829 (owner: 10Faidon Liambotis) [07:48:17] PROBLEM - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% [07:50:58] RECOVERY - Host labvirt1005 is UPING OK - Packet loss = 0%, RTA = 3.67 ms [07:57:15] (03CR) 10Alexandros Kosiaris: [C: 031] install-server: create placeholder LV to work around partman-lvm bug [puppet] - 10https://gerrit.wikimedia.org/r/215806 (https://phabricator.wikimedia.org/T100636) (owner: 10Filippo Giunchedi) [08:08:27] mobrovac: hi. did you find anything with cxserver logstash on beta? [08:09:26] (03PS1) 10Jcrespo: Resolving grant issue on parsercache [puppet] - 10https://gerrit.wikimedia.org/r/215869 (https://phabricator.wikimedia.org/T101182) [08:10:07] (03CR) 10jenkins-bot: [V: 04-1] Resolving grant issue on parsercache [puppet] - 10https://gerrit.wikimedia.org/r/215869 (https://phabricator.wikimedia.org/T101182) (owner: 10Jcrespo) [08:13:19] kart_: looking into it now [08:18:36] (03PS2) 10Jcrespo: Resolving grant issue on parsercache [puppet] - 10https://gerrit.wikimedia.org/r/215869 (https://phabricator.wikimedia.org/T101182) [08:19:25] (03CR) 10Ori.livneh: wmflib: Make ipresolve throw an error if it can't resolve (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215682 (https://phabricator.wikimedia.org/T99833) (owner: 10Yuvipanda) [08:22:56] kart_: the last log entry in /var/log/cxserver/main.log @ deployment-cxserver03 is from yesterday afternoon and is a restart sequence [08:23:21] kart_: so presumably it's not possible to see something in logstash if no logs have been emitted since [08:27:19] mobrovac: let me generate something... [08:27:30] mobrovac: we can see info logs right? [08:27:43] kart_: no, warnings and errors only [08:27:46] so do something nasty [08:27:47] :) [08:29:36] sudo: ldap_start_tls_s(): Connect error [08:29:43] that's bad [08:29:58] kart_: yup, ldap is currently not working in deployment-prep [08:30:12] waiting on yuvi to come online for that one [08:30:19] ah [08:31:34] i know andrewbogott_afk and thcipriani|afk have been talking about that yesterday, so we may have to wait even longer :) [08:34:55] mobrovac: ok. So, I can messup there without root :) [08:35:10] can't!!! [08:35:19] you cannot restart the service [08:35:46] yes [08:37:09] PROBLEM - puppet last run on labvirt1008 is CRITICAL Puppet has 1 failures [08:37:48] <_joe_> mh I can become root on deployment-salt now [08:38:14] <_joe_> I'm gonna see how I can fix this, it will take me some time though [08:38:50] ok, this seems to be local to deployment-cxserver03 [08:39:23] tried a couple of other machines from deployment-prep, and i can become root without issues [08:42:03] _joe_: maybe give to deployment-cxserver03 the usual windows medicine ? [08:42:17] <_joe_> mobrovac: oh I see [08:42:29] <_joe_> mobrovac: yeah, good idea. You can do it by yourself btw [08:42:30] <_joe_> :) [08:42:55] ok, doing it [08:46:28] PROBLEM - salt-minion processes on etherpad1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:54:33] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "the syncer tool works, I will add what's missing in subsequent commits." [software/conftool] - 10https://gerrit.wikimedia.org/r/215654 (owner: 10Giuseppe Lavagetto) [08:56:03] (03CR) 10Muehlenhoff: "I'm leaning towards using the current shell approach:" [puppet] - 10https://gerrit.wikimedia.org/r/211688 (https://phabricator.wikimedia.org/T100773) (owner: 10Muehlenhoff) [09:02:01] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [09:03:57] (03CR) 10Faidon Liambotis: "Where? :)" [puppet] - 10https://gerrit.wikimedia.org/r/215821 (owner: 10Faidon Liambotis) [09:06:51] PROBLEM - puppet last run on mw2213 is CRITICAL puppet fail [09:09:31] mobrovac: rebooted? [09:09:48] mobrovac: in that case, we should see 'something' in dashboard, right? [09:10:19] kart_: yes, but i can't log in now, so i don't even know if the service is up or not [09:10:21] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [09:14:12] mobrovac: running. [09:14:17] mobrovac: http://cxserver-beta.wmflabs.org/ [09:16:20] kart_: i'm looking at logstash logs, but i can see are ldap failures [09:20:41] That's strange [09:25:40] RECOVERY - puppet last run on mw2213 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:34:44] kart_: ArchaeologistPan should have a look at it soon [09:34:52] (re ldap issues) [09:35:01] RECOVERY - puppet last run on labvirt1008 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:43:46] mobrovac: ok so I can login with root key [09:43:49] * ArchaeologistPan tries stuff [09:45:37] ArchaeologistPan: it's rather peculiar that other deployment-prep instances do not suffer from this [09:46:14] mobrovac: they do! puppet doesn't work anywhere because ldap still fails on deployment-salt [09:46:26] I wonder if this is just a thing affecting all self hosted puppetmasters, actually [09:46:29] * ArchaeologistPan tests that [09:47:43] i recall andrewbogott_afk and thcipriani|afk talking about that a couple of days ago and coming up with a plan for migrating properly [09:47:55] (03CR) 10Alexandros Kosiaris: "neptunium/nembus admin-truststore and truststore. Should be really easy to remove it from them. Seems like the ads-truststore does not use" [puppet] - 10https://gerrit.wikimedia.org/r/215821 (owner: 10Faidon Liambotis) [09:48:02] yup, puppet failure on deployment-salt [09:49:03] mobrovac: yup, and they migrated it yesterday I think [09:53:22] paravoid: around? [09:53:23] YuviPanda: you pinged me? [09:53:48] YuviPanda: re: language-dev, feel free to do whatever you want :) [09:54:39] kart_: ah cool :) [09:54:58] kart_: btw, in the meantime - since I *can* ssh to cxserver03, anything you want me to do to unblock you? [09:56:51] PROBLEM - puppet last run on labvirt1008 is CRITICAL puppet fail [10:03:07] YuviPanda: check /var/log/cxserver/main.log for today's entries, please [10:03:28] mobrovac: I think I fixed the underlying issue. moment [10:03:37] ah cool [10:03:51] think it was caused by Ia7d0c047e7a542e2c8a9934e580505e79ff999b4 [10:05:22] mobrovac: kart_ can you file a bug so I can put details in it? [10:05:37] YuviPanda: yup, on it [10:05:38] kart_: and try logging in now? [10:05:53] YuviPanda: works now [10:06:03] cool :) [10:09:15] YuviPanda: https://phabricator.wikimedia.org/T101377 [10:09:25] YuviPanda: and thnx <3 [10:14:27] mobrovac: :) yw! I posed an explanation there [10:17:20] ah, outdated cert [10:21:35] mobrovac: outdated cert path, and puppetmaster died itself before it could correct that [10:21:40] mobrovac: thanks! [10:21:52] heh [10:22:04] YuviPanda: and thanks :) [10:22:09] the irony of the puppet master :) [10:22:16] yeah :) [10:22:27] meanwhile I restarted cxserver [10:22:57] kart_: could you please try to do something there that would trigger a warn or error log entry? [10:23:57] let me apply broken config [10:24:11] but that doesn't work either as it won't let log :/ [10:24:38] euh? [10:24:41] not sure i understand [10:29:25] mobrovac: restart should do that? [10:29:59] kart_: most likely, yes [10:31:22] done [10:31:26] but nothing yet [10:31:59] brb. tea() [10:35:40] RECOVERY - puppet last run on labvirt1008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:40] mobrovac: nothing yet [10:42:48] mobrovac: also no restbase too [10:44:59] yup, that's rather strange [10:45:33] (03PS3) 10ArielGlenn: Per bug #48012. Compressed possible errors into for loop; program exits if there is one or more errors and writes all of them. [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63782 (owner: 10Sanja pavlovic) [10:46:57] (03CR) 10ArielGlenn: [C: 032] Per bug #48012. Compressed possible errors into for loop; program exits if there is one or more errors and writes all of them. [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63782 (owner: 10Sanja pavlovic) [10:50:53] (03Abandoned) 10ArielGlenn: Per bug #48012. Patch for worker.py. It checks for external programs existence in the initialization part. [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/64095 (owner: 10Sanja pavlovic) [10:51:31] (03Abandoned) 10ArielGlenn: Check for external programs existence in worker.py initialization [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63390 (https://bugzilla.wikimedia.org/48012) (owner: 10Sanja pavlovic) [11:03:50] (03CR) 10Hoo man: [C: 04-1] rsync wikidata json dumps to labs /public/dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) (owner: 10Addshore) [11:05:16] (03CR) 10Addshore: rsync wikidata json dumps to labs /public/dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) (owner: 10Addshore) [11:08:23] (03PS2) 10ArielGlenn: Dump GeoData information [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/156450 (https://bugzilla.wikimedia.org/51225) (owner: 10MaxSem) [11:12:58] (03PS3) 10ArielGlenn: Dump GeoData information [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/156450 (https://bugzilla.wikimedia.org/51225) (owner: 10MaxSem) [11:14:32] (03CR) 10TTO: "A win for long term memory! Nice to see old patch sets getting some love." [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63782 (owner: 10Sanja pavlovic) [11:15:33] (03CR) 10ArielGlenn: [C: 032] Dump GeoData information [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/156450 (https://bugzilla.wikimedia.org/51225) (owner: 10MaxSem) [11:31:53] Wow, fatalmonitor shows 3500 dberror fatals in the last 5 minutes. [11:32:00] onsistently so over the past few days [11:32:06] that cant be good [12:13:11] Krinkle, T98489 [12:16:31] (03PS1) 10Springle: repool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215890 [12:17:02] (03CR) 10Springle: [C: 032] repool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215890 (owner: 10Springle) [12:18:17] (03PS1) 10Giuseppe Lavagetto: conftool: adding the cli-tool [software/conftool] - 10https://gerrit.wikimedia.org/r/215891 [12:18:33] (03Merged) 10jenkins-bot: repool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215890 (owner: 10Springle) [12:19:34] !log springle Synchronized wmf-config/db-eqiad.php: repool db1072, warm up (duration: 00m 13s) [12:19:40] Logged the message, Master [12:23:37] (03PS26) 10Mobrovac: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry) [12:26:23] kart_: fixed [12:26:26] finally :) [12:26:30] (03CR) 10Ofus: [C: 031] tin: set cluster in hiera, not in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/210835 (owner: 10Dzahn) [12:38:20] (03PS1) 10Mobrovac: Use new Labs DNS scheme in deployment-prep for services [puppet] - 10https://gerrit.wikimedia.org/r/215896 [12:54:16] brb [13:07:42] PROBLEM - salt-minion processes on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:42] PROBLEM - RAID on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:42] PROBLEM - puppet last run on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:42] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:42] PROBLEM - uWSGI web apps on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:51] PROBLEM - statsite backend instances on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:52] PROBLEM - DPKG on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:08:01] PROBLEM - Disk space on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:08:11] PROBLEM - SSH on graphite2001 is CRITICAL - Socket timeout after 10 seconds [13:08:32] PROBLEM - statsdlb process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:08:32] PROBLEM - configured eth on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:08:41] PROBLEM - dhclient process on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:20] ^host is up, but ssh and nrpe are down [13:16:00] RECOVERY - salt-minion processes on graphite2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:16:02] RECOVERY - uWSGI web apps on graphite2001 is OK All defined uWSGI apps are runnning. [13:16:02] RECOVERY - puppet last run on graphite2001 is OK Puppet is currently enabled, last run 17 minutes ago with 0 failures [13:16:02] RECOVERY - Graphite Carbon on graphite2001 is OK All defined Carbon jobs are runnning. [13:16:02] RECOVERY - RAID on graphite2001 is OK Active: 8, Working: 8, Failed: 0, Spare: 0 [13:16:20] RECOVERY - DPKG on graphite2001 is OK: All packages OK [13:16:21] RECOVERY - Disk space on graphite2001 is OK: DISK OK [13:16:31] RECOVERY - SSH on graphite2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [13:16:43] ^not me [13:21:22] PROBLEM - salt-minion processes on graphite2001 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [13:21:22] PROBLEM - uWSGI web apps on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:22] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:27] (03CR) 10Mobrovac: "Tested in deployment-prep, works as advertised." [puppet] - 10https://gerrit.wikimedia.org/r/215896 (owner: 10Mobrovac) [13:21:31] PROBLEM - puppet last run on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:31] PROBLEM - RAID on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:33] PROBLEM - DPKG on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:42] PROBLEM - Disk space on graphite2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:51] PROBLEM - SSH on graphite2001 is CRITICAL - Socket timeout after 10 seconds [13:22:52] RECOVERY - salt-minion processes on graphite2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:22:52] RECOVERY - Graphite Carbon on graphite2001 is OK All defined Carbon jobs are runnning. [13:22:52] RECOVERY - uWSGI web apps on graphite2001 is OK All defined uWSGI apps are runnning. [13:23:01] RECOVERY - RAID on graphite2001 is OK Active: 8, Working: 8, Failed: 0, Spare: 0 [13:23:01] RECOVERY - puppet last run on graphite2001 is OK Puppet is currently enabled, last run 24 minutes ago with 0 failures [13:23:02] RECOVERY - statsite backend instances on graphite2001 is OK All defined statsite jobs are runnning. [13:23:02] RECOVERY - DPKG on graphite2001 is OK: All packages OK [13:23:15] (03CR) 10Andrew Bogott: [C: 031] "This is fine. Note that the old-school FQDNs will continue to work with the new DNS setup; so this patch is a nice security improvement " [puppet] - 10https://gerrit.wikimedia.org/r/215896 (owner: 10Mobrovac) [13:23:20] RECOVERY - Disk space on graphite2001 is OK: DISK OK [13:23:21] RECOVERY - SSH on graphite2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [13:23:50] RECOVERY - statsdlb process on graphite2001 is OK: PROCS OK: 1 process with command name statsdlb [13:23:50] RECOVERY - configured eth on graphite2001 is OK - interfaces up [13:23:51] RECOVERY - dhclient process on graphite2001 is OK: PROCS OK: 0 processes with command name dhclient [13:25:31] load average: 8.77, 20.98, 21.12 [13:26:26] (03PS1) 10Anomie: Remove obsolete 'ValidateExtendedMetadataCache' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215900 [13:29:24] (03PS1) 10Anomie: Enable ApiFeatureUsage everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215901 (https://phabricator.wikimedia.org/T1272) [13:31:27] (03PS5) 10Andrew Bogott: For self-hosted puppet, require simple puppetmaster name. [puppet] - 10https://gerrit.wikimedia.org/r/215333 [13:32:37] (03CR) 10Andrew Bogott: [C: 032] For self-hosted puppet, require simple puppetmaster name. [puppet] - 10https://gerrit.wikimedia.org/r/215333 (owner: 10Andrew Bogott) [13:33:41] (03CR) 10Anomie: CX: Add wikis for deployment on 20150406 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215281 (https://phabricator.wikimedia.org/T100622) (owner: 10KartikMistry) [13:33:43] (03PS2) 10Andrew Bogott: Use new Labs DNS scheme in deployment-prep for services [puppet] - 10https://gerrit.wikimedia.org/r/215896 (owner: 10Mobrovac) [13:34:54] (03CR) 10Andrew Bogott: [C: 032] Use new Labs DNS scheme in deployment-prep for services [puppet] - 10https://gerrit.wikimedia.org/r/215896 (owner: 10Mobrovac) [13:35:39] andrewbogott: cheers [13:38:06] http://bots.wmflabs.org/dump/%23wikimedia-operations.htm [13:38:06] @info 10.64.16.22 [13:38:06] Krinkle: [10.64.16.22: s7] db1033 [13:38:38] http://bots.wmflabs.org/dump/%23wikimedia-operations.htm [13:38:38] @info d5 [13:38:38] Krinkle: Unknown identifier (d5) [13:38:41] http://bots.wmflabs.org/dump/%23wikimedia-operations.htm [13:38:41] @info s5 [13:38:42] Krinkle: [s5] db1058: 10.64.32.28, db1049: 10.64.16.144, db1045: 10.64.16.34, db1026: 10.64.16.15, db1070: 10.64.48.25, db1071: 10.64.48.26 [13:38:46] http://bots.wmflabs.org/dump/%23wikimedia-operations.htm [13:38:46] @info db1058 [13:38:46] Krinkle: [db1058: s5] 10.64.32.28 [13:38:49] @replag db1058 [13:38:49] Krinkle: Could not get replag information. [13:38:53] @replag s2 [13:38:53] Krinkle: [s2: zhwiki] db1024: 0s, db1021: 0s, db1036: 0s, db1018: 0s, db1054: 0s, db1060: 0s, db1063: 0s, db1067: 0s [13:39:53] @docs [13:39:54] Krinkle: https://www.mediawiki.org/wiki/dbbot-wm [13:43:37] @docs [13:43:37] Krinkle: https://www.mediawiki.org/wiki/dbbot-wm - https://github.com/Krinkle/wmfDbBot#commands [13:44:21] (03CR) 10Giuseppe Lavagetto: [C: 031] ipresolve: Make passing in type of record optional [puppet] - 10https://gerrit.wikimedia.org/r/215884 (owner: 10Yuvipanda) [13:44:37] (03Abandoned) 10Giuseppe Lavagetto: Conftool: initial commit [software/conftool] - 10https://gerrit.wikimedia.org/r/215604 (owner: 10Giuseppe Lavagetto) [13:45:59] (03CR) 10Giuseppe Lavagetto: [C: 031] wmflib: Make ipresolve throw an error if it can't resolve [puppet] - 10https://gerrit.wikimedia.org/r/215682 (https://phabricator.wikimedia.org/T99833) (owner: 10Yuvipanda) [13:49:37] (03PS1) 10Giuseppe Lavagetto: build-dep: add gawk [debs/hhvm] - 10https://gerrit.wikimedia.org/r/215903 [13:50:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] build-dep: add gawk [debs/hhvm] - 10https://gerrit.wikimedia.org/r/215903 (owner: 10Giuseppe Lavagetto) [13:50:11] PROBLEM - Translation cache space on mw1025 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:51:50] RECOVERY - Translation cache space on mw1025 is OK: HHVM_TC_SPACE OK TC sizes are OK [13:52:34] 503's from TC restarts? [13:53:02] bunch of nonpaging CRITs in icinga like: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.479 second response time [13:53:12] for various mwNNNN [13:53:34] <_joe_> bblack: I didn't restart it, I was looking at mw1025 and it had a segfault just now [13:53:50] I mean maybe they self-restarted, I donno. [13:54:05] they only failed for 1/3 checks, but it was a fair chunk of them [13:54:21] <_joe_> yeah probably they failed because of that [13:54:27] <_joe_> I'm taking a look [13:55:09] I guess TC space is less an issue if we were on repoauth or something? [13:55:25] <_joe_> bblack: I don't think this has to do with TC [13:55:39] well there were TC warnings leading up to it, so I figured it was related [13:56:16] <_joe_> bblack: I wasn't looking, tbh [13:57:59] _joe_: https://phabricator.wikimedia.org/P728 [13:58:13] it's noisy, it's just they all flit into fail-ok-fail fast enough to not hit IRC [13:58:31] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [13:59:33] <_joe_> bblack: I don't see real errors on the TC space, apart from not being reachable because HHVM crashed [13:59:43] yeah [14:01:31] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [14:02:10] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 0 below the confidence bounds [14:08:02] PROBLEM - Translation cache space on mw1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:41] RECOVERY - Translation cache space on mw1020 is OK: HHVM_TC_SPACE OK TC sizes are OK [14:13:41] (03PS1) 10BBlack: disable text-backend retry503 behavior [puppet] - 10https://gerrit.wikimedia.org/r/215910 [14:14:27] this is segfaulting on edit https://zh.wikipedia.org/wiki/%E9%80%B1%E5%85%AD%E5%A4%9C%E7%8F%BE%E5%A0%B4 [14:15:25] (03CR) 10BBlack: [C: 032] disable text-backend retry503 behavior [puppet] - 10https://gerrit.wikimedia.org/r/215910 (owner: 10BBlack) [14:19:01] (03CR) 10Alexandros Kosiaris: [C: 032] "I 've removed wmf-ca from nembus/neptunium admin-truststore and truststore. +2 from me now" [puppet] - 10https://gerrit.wikimedia.org/r/215821 (owner: 10Faidon Liambotis) [14:19:09] (03CR) 10Alexandros Kosiaris: [C: 032] base: remove wmf-ca ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/215829 (owner: 10Faidon Liambotis) [14:20:41] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [14:21:36] (03PS1) 10Andrew Bogott: Mark out some obsolete passenger settings on Trusty and Jessie. [puppet] - 10https://gerrit.wikimedia.org/r/215911 [14:24:41] PROBLEM - puppetmaster https on labcontrol1001 is CRITICAL: Connection refused [14:26:31] PROBLEM - puppet last run on labcontrol1001 is CRITICAL Puppet has 1 failures [14:29:41] PROBLEM - puppet last run on mw1254 is CRITICAL Puppet has 1 failures [14:29:56] anyone around with google webmaster tools access? [14:33:20] RECOVERY - puppetmaster https on labcontrol1001 is OK: HTTP OK: Status line output matched 400 - 287 bytes in 1.230 second response time [14:35:32] (03PS1) 10BBlack: temporarily block bingbot from zhwiki [puppet] - 10https://gerrit.wikimedia.org/r/215914 [14:36:45] (03PS2) 10BBlack: temporarily block bingbot from zhwiki [puppet] - 10https://gerrit.wikimedia.org/r/215914 [14:37:20] (03CR) 10Giuseppe Lavagetto: [C: 031] temporarily block bingbot from zhwiki [puppet] - 10https://gerrit.wikimedia.org/r/215914 (owner: 10BBlack) [14:40:54] <_joe_> kart_, Nikerabbit is there a way to disable automatic translations in zh.wikipedia.org/zh-tw/.. or similar? [14:41:21] _joe_: do you mean language coverter? [14:41:28] <_joe_> Nikerabbit: yep [14:42:11] !log running sudo sed -i 's/GlobalSign_CA.pem/ca-certificates.crt/' /etc/ldap/ldap.conf on all labs nodes [14:42:19] Logged the message, Master [14:43:12] Language converter... I don't really know [14:43:32] there is preference for content language variant [14:43:52] might as well just disable the wiki :) [14:44:54] there's still a fair amount of legit non-CN traffic there :P [14:45:20] RECOVERY - puppet last run on mw1254 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:45:48] <_joe_> bblack: let's disable bing at least, it may help us to see what's left anyways [14:46:05] there's some from google/yahoo I could include as well, but bing dwarfs them [14:46:10] ok [14:46:45] (03CR) 10BBlack: [C: 032] temporarily block bingbot from zhwiki [puppet] - 10https://gerrit.wikimedia.org/r/215914 (owner: 10BBlack) [14:50:58] kart_, legoktm: Ping for SWAT in about 10 minutes [14:51:49] anomie: ack [14:51:56] (03PS2) 10KartikMistry: CX: Add wikis for deployment on 20150406 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215281 (https://phabricator.wikimedia.org/T100622) [14:52:22] anomie: sorted list now :) [14:52:33] jouncebot: next [14:52:33] In 0 hour(s) and 7 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150604T1500) [14:52:34] good! [14:55:20] <_joe_> so, disabling language converter for zhwiki is not an option? [14:55:26] anomie: pong [14:57:51] might as well just disable the wiki :) [14:58:16] can we delay SWAT a bit? we're still dealing with some issues here [14:58:45] legoktm, kart_: SWAT may be delayed or canceled this morning. Will update. [14:58:49] (03PS2) 10KartikMistry: CX: Add wikis for deployment on 20150406 [puppet] - 10https://gerrit.wikimedia.org/r/215282 (https://phabricator.wikimedia.org/T100622) [14:58:52] alright [14:59:00] * legoktm reads up [14:59:39] anomie: oops :/ [15:00:01] bblack: how much time it will take? [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, kart_, legoktm, anomie: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150604T1500). Please do the needful. [15:00:36] unknown. we're dealing with applayer crasher that started up 1h15m ago triggered by external traffic [15:00:57] as in hhvm-crasher, not just php-exception [15:01:10] (03PS1) 10coren: Tool Labs: add old-style fqdn aliases to nodes [puppet] - 10https://gerrit.wikimedia.org/r/215918 (https://phabricator.wikimedia.org/T101296) [15:01:20] bblack: thanks for update! [15:03:19] bblack: I should also stop updating code too, right? [15:03:54] please, for a bit. mostly we just don't want any secondary unrelated issue clouding things [15:04:42] okay! [15:08:10] PROBLEM - Apache HTTP on mw1182 is CRITICAL - Socket timeout after 10 seconds [15:08:30] PROBLEM - HHVM rendering on mw1182 is CRITICAL - Socket timeout after 10 seconds [15:09:31] <_joe_> !log puppet disabled, fss disabled on mw1017 [15:09:36] Logged the message, Master [15:11:13] (03PS1) 10Faidon Liambotis: autoinstall: workaround silly d-i/tasksel bug [puppet] - 10https://gerrit.wikimedia.org/r/215923 [15:12:49] !log ori Synchronized php-1.26wmf8/includes/libs/ReplacementArray.php: Ia5f3dc84605: awful hack: disable fss on zhwiki only, except on mw1017 (duration: 00m 17s) [15:12:54] Logged the message, Master [15:13:22] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [15:13:46] 5xxs are declining actually [15:16:31] (03PS1) 10BBlack: Revert "temporarily block bingbot from zhwiki" [puppet] - 10https://gerrit.wikimedia.org/r/215925 [15:16:57] (03CR) 10BBlack: [C: 032 V: 032] Revert "temporarily block bingbot from zhwiki" [puppet] - 10https://gerrit.wikimedia.org/r/215925 (owner: 10BBlack) [15:19:56] _joe_, bblack: Let me know if/when we can restart SWAT [15:20:56] (03PS2) 10Yuvipanda: tools: add old-style fqdn aliases to nodes [puppet] - 10https://gerrit.wikimedia.org/r/215918 (https://phabricator.wikimedia.org/T101296) (owner: 10coren) [15:22:56] anomie: kart_ others: swat still on hold for at least 10 minutes while we let things settle [15:23:07] see: https://gdash.wikimedia.org/dashboards/reqerror/ [15:23:38] RECOVERY - puppet last run on sodium is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:23:51] (03PS14) 10Andrew Bogott: For cert names, use the fqdn instead of the ec2id if use_dnsmasq is lowered. [puppet] - 10https://gerrit.wikimedia.org/r/202924 [15:24:43] greg-g: sure! [15:25:47] PROBLEM - DPKG on etherpad1001 is CRITICAL: Connection refused by host [15:25:58] PROBLEM - Disk space on etherpad1001 is CRITICAL: Connection refused by host [15:26:00] (03CR) 10BBlack: "Seems like the if/else there is duplicated in many files, can we move it to one place in the name of DRY somehow?" [puppet] - 10https://gerrit.wikimedia.org/r/202924 (owner: 10Andrew Bogott) [15:26:32] (03CR) 10Yuvipanda: [C: 04-1] tools: add old-style fqdn aliases to nodes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/215918 (https://phabricator.wikimedia.org/T101296) (owner: 10coren) [15:26:37] PROBLEM - RAID on etherpad1001 is CRITICAL: Connection refused by host [15:26:48] PROBLEM - configured eth on etherpad1001 is CRITICAL: Connection refused by host [15:27:03] kart_: anomie ok, the reqerrors graph is starting to look normalized, you're free to sync-dir/file/whatever at 15:30 [15:27:07] PROBLEM - dhclient process on etherpad1001 is CRITICAL: Connection refused by host [15:27:17] PROBLEM - puppet last run on etherpad1001 is CRITICAL: Connection refused by host [15:27:27] PROBLEM - salt-minion processes on etherpad1001 is CRITICAL: Connection refused by host [15:27:36] etherpad?! [15:27:40] shushhh [15:27:41] don't worry [15:27:45] it's alll okkkk [15:27:47] re SWAT: I'll do kart_'s first, then mine, then legoktm's because that's likely to take longest. [15:27:50] YuviPanda: :) [15:28:00] greg-g: it's the new misc VM cluster akosiari.s is playing with [15:28:07] fun fun [15:28:08] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:29:36] (03CR) 10coren: tools: add old-style fqdn aliases to nodes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/215918 (https://phabricator.wikimedia.org/T101296) (owner: 10coren) [15:30:01] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215281 (https://phabricator.wikimedia.org/T100622) (owner: 10KartikMistry) [15:30:08] (03Merged) 10jenkins-bot: CX: Add wikis for deployment on 20150406 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215281 (https://phabricator.wikimedia.org/T100622) (owner: 10KartikMistry) [15:30:31] (03CR) 10Yuvipanda: tools: add old-style fqdn aliases to nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215918 (https://phabricator.wikimedia.org/T101296) (owner: 10coren) [15:30:33] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Add wikis for deployment on 20150406 [[gerrit:215281]] (duration: 00m 12s) [15:30:34] kart_: ^ Test please [15:30:38] Logged the message, Master [15:30:47] (03PS2) 10Anomie: CX: Added staff-recommender campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215865 (https://phabricator.wikimedia.org/T101353) (owner: 10KartikMistry) [15:33:26] (03CR) 10Giuseppe Lavagetto: "I want to reiterate, I'm not against writing this in , I'm against reinventing the wheel. And I feel like we'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [15:33:44] anomie: ack [15:34:07] kart_: Is that "ack" as in "testing now", or "works good"? [15:34:28] anomie: cx deployed. so good. [15:34:43] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215865 (https://phabricator.wikimedia.org/T101353) (owner: 10KartikMistry) [15:34:49] (03Merged) 10jenkins-bot: CX: Added staff-recommender campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215865 (https://phabricator.wikimedia.org/T101353) (owner: 10KartikMistry) [15:34:50] akosiaris: or godog can you merge, https://gerrit.wikimedia.org/r/#/c/215282/ [15:34:54] (03PS2) 10Faidon Liambotis: autoinstall: fix preseeding of tasksel/first [puppet] - 10https://gerrit.wikimedia.org/r/215923 [15:35:14] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Added staff-recommender campaign [[gerrit:215865]] (duration: 00m 12s) [15:35:15] kart_: ^ Test please [15:35:20] Logged the message, Master [15:35:35] anomie: nothing visible to test for staff-recommender, [15:35:37] Thanks! [15:35:53] (03PS2) 10Anomie: Remove obsolete 'ValidateExtendedMetadataCache' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215900 [15:36:02] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215900 (owner: 10Anomie) [15:36:08] (03Merged) 10jenkins-bot: Remove obsolete 'ValidateExtendedMetadataCache' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215900 (owner: 10Anomie) [15:36:15] (03PS3) 10Filippo Giunchedi: CX: Add wikis for deployment on 20150406 [puppet] - 10https://gerrit.wikimedia.org/r/215282 (https://phabricator.wikimedia.org/T100622) (owner: 10KartikMistry) [15:36:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] CX: Add wikis for deployment on 20150406 [puppet] - 10https://gerrit.wikimedia.org/r/215282 (https://phabricator.wikimedia.org/T100622) (owner: 10KartikMistry) [15:36:36] kart_: 20150406 ?? [15:36:52] !log anomie Synchronized wmf-config/CommonSettings.php: SWAT: Remove obsolete 'ValidateExtendedMetadataCache' hook [[gerrit:215900]] (duration: 00m 12s) [15:36:59] kart_: merged, even with the US date :P [15:36:59] Logged the message, Master [15:37:17] akosiaris: I'm from past :/ [15:37:33] godog: akosiaris no worries. but I'll take care. [15:37:37] anomie: Exception from T97469 isn't showing up in exception.log, so we're probably good. [15:37:49] (03PS2) 10Anomie: Enable ApiFeatureUsage everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215901 (https://phabricator.wikimedia.org/T1272) [15:37:58] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215901 (https://phabricator.wikimedia.org/T1272) (owner: 10Anomie) [15:38:04] (03Merged) 10jenkins-bot: Enable ApiFeatureUsage everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215901 (https://phabricator.wikimedia.org/T1272) (owner: 10Anomie) [15:38:28] PROBLEM - Translation cache space on mw1119 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [15:38:38] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable ApiFeatureUsage everywhere [[gerrit:215901]] (duration: 00m 19s) [15:38:43] Logged the message, Master [15:39:00] anomie: Works! [15:39:18] PROBLEM - Translation cache space on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:39:37] legoktm: Doing yours now, waiting on Jenkins. [15:40:12] anomie: every time you make me chuckle [15:40:18] RECOVERY - Translation cache space on mw1119 is OK: HHVM_TC_SPACE OK TC sizes are OK [15:40:39] greg-g: anomie is serious swatter [15:40:44] :) [15:40:51] indeed, the seriousest [15:40:54] (not a word) [15:41:04] * legoktm waits patiently [15:41:35] RECOVERY - Translation cache space on mw1116 is OK: HHVM_TC_SPACE OK TC sizes are OK [15:43:43] 6operations: Redis lua sandbox bypass - https://phabricator.wikimedia.org/T101397#1337796 (10MoritzMuehlenhoff) [15:44:06] 6operations: Redis lua sandbox bypass - https://phabricator.wikimedia.org/T101397#1337563 (10MoritzMuehlenhoff) A backported package for trusty has been built and installed on tool-labs-* [15:44:11] (03CR) 1020after4: "I evaluated quilt and it doesn't really fit for my simple use-case. I found some alternatives that would probably work but the most promis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [15:46:25] (03PS3) 10Yuvipanda: wmflib: Make ipresolve throw an error if it can't resolve [puppet] - 10https://gerrit.wikimedia.org/r/215682 (https://phabricator.wikimedia.org/T99833) [15:46:49] (03CR) 10Yuvipanda: [C: 032 V: 032] wmflib: Make ipresolve throw an error if it can't resolve [puppet] - 10https://gerrit.wikimedia.org/r/215682 (https://phabricator.wikimedia.org/T99833) (owner: 10Yuvipanda) [15:47:28] (03CR) 10Giuseppe Lavagetto: "Can you please elaborate a little on "doesn't fit my use-case"? Just so that I can get convinced :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [15:47:39] (03PS2) 10Yuvipanda: ipresolve: Make passing in type of record optional [puppet] - 10https://gerrit.wikimedia.org/r/215884 [15:48:43] !log anomie Synchronized php-1.26wmf8/includes/jobqueue/: SWAT: jobqueue: Record stats on how long it takes before a job is run [[gerrit:215748]] (duration: 00m 14s) [15:48:44] legoktm: ^ Test please [15:48:49] Logged the message, Master [15:49:15] (03CR) 10Yuvipanda: [C: 032] ipresolve: Make passing in type of record optional [puppet] - 10https://gerrit.wikimedia.org/r/215884 (owner: 10Yuvipanda) [15:49:23] _joe_: https://phabricator.wikimedia.org/T95375 aka the linked task :P [15:50:39] akosiaris: if you’re still working, can you join Coren and me in the mystery of trusty/passenger/ldap on labcontrol1001? [15:51:03] andrewbogott: gimme a couple of mins first, but yeah [15:51:23] * legoktm is fiddling with graphite [15:52:24] anomie: working, woot :D [15:52:32] * anomie is done with SWAT [15:54:24] (03CR) 10Yuvipanda: "I don't know if maintaining the ply package ourselves is that big an issue, esp. if we're ok working with upstream to fix any issues we mi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [15:54:25] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [15:54:53] !log added redis_2.8.4-2+wmf1 to trusty-wikimedia on apt.wikimedia.org [15:54:59] Logged the message, Master [15:57:55] (03PS4) 10Yuvipanda: ssh: Make hba enable-able via hiera [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) [15:58:15] anyone wanna look at this somewhat small change to the ssh module? ^ [15:58:20] don't want to break ssh and make everyone saaaad [15:58:24] s/somewhat// [15:59:06] (03CR) 1020after4: "Thing is, this simple script I wrote in php does the trick, and honestly using nothing at all would be easier than arguing about it which " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [15:59:29] (03PS5) 10Yuvipanda: ssh: Make hba enable-able via hiera [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) [16:00:04] kart_: Respected human, time to deploy Content Translation Server (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150604T1600). Please do the needful. [16:00:19] chasemp: ^^ since you're on duty and I guess would be the most hurt if I broke ssh :) [16:00:38] should be a noop [16:01:20] looking [16:01:54] (03CR) 1020after4: "If we use someone else's solution I will still need to write quite a bit of wrapper code to make it apply to our unique code layout - spec" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [16:02:02] YuviPanda: lgtm, fwiw :) [16:02:17] i'm starting to write like teenagers :P [16:02:56] jouncebot: yes [16:03:42] (03CR) 10Yuvipanda: "My personal feeling is that this is one of those things that might end up looking like pre-rewrite scap a few years down the line, so roll" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [16:03:48] mobrovac: :D [16:04:09] YuviPanda: should " $hba = $enable_hba or $::ssh_hba == 'yes'" [16:04:14] be == false [16:04:18] becuase yes is static in the template [16:04:25] I'm having trouble following the logic otherwise [16:04:42] seems maniphest should set a bool if you hard code it [16:04:44] chasemp: $hba is boolean, while $::ssh_hba is string [16:04:58] ah [16:05:01] and the only valid value for it earlier as 'yes' anyway [16:05:01] coffee needed [16:05:09] so just codifies that [16:05:21] <_joe_> twentyafterfour: I was just curious, not fighting at all. I was actually trying to suggest what I felt was a cleaner way to manage all this, but it's /your/ workflow, that's why I didn't downvote it again. [16:05:34] +1 to _joe_ [16:05:59] (03CR) 10Legoktm: "a) The current solution sucks. I think we can all agree on that. b) we now have something that works, and means we're not going to acciden" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [16:06:01] (03PS1) 10Alexandros Kosiaris: autoinstall: Set default task for tasksel [puppet] - 10https://gerrit.wikimedia.org/r/215929 [16:07:02] chasemp: ah, cool :) [16:08:53] <_joe_> really, guys, stop with the animosity in comments on that patch [16:09:06] <_joe_> I asked one question, didn't even -1 it... [16:10:00] <_joe_> one question I asked 1 month ago, and got no answer at the time either, btw. but it's ok as long as no one else makes a drama about this. [16:10:34] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:10:40] _joe_: thanks, I really did give quilt a good look but it's a complex tool and it would take me as long to really learn quilt as it took to write that php script. [16:10:53] scope in puppet makes my head hurt sometimes [16:10:55] also, you weren't the only one to object to the php [16:11:01] YuviPanda: I'm just slow man I'm working it out [16:11:08] chasemp: lol "scope" [16:11:20] <_joe_> twentyafterfour: did I? [16:11:28] <_joe_> I don't think so :) [16:11:56] <_joe_> twentyafterfour: but re: quilt needs learning, fair enough [16:12:01] _joe_: I mean, object to the patch not php specifically [16:12:45] <_joe_> twentyafterfour: if/when I have time I'll try to show you how that can be done with quilt, we'll see if that's better/worse then [16:13:29] YuviPanda: for $::ssh_hba == 'yes' is the alternative to be set to 'no' or a non value [16:13:39] because mixing booleans and strings here seems weird [16:13:50] How does quilt deal with submodules? I couldn't find a single thing anywhere (my searching skills are pretty good but I din't look for very long, I may have missed it) [16:14:02] and won't 'no' evaulate just as well as 'yes' in the template case [16:15:19] <_joe_> twentyafterfour: quilt doesn't care about git, it can work on any filesystem, but maybe that's an issue for us? [16:15:51] chasemp: you should that as ($enable_hba) || ($ssh_hba == 'yes') so the outcome can be only a boolean [16:15:58] s/that/ read that/ [16:16:12] <_joe_> chasemp: why a top-scope variable? [16:16:15] <_joe_> who did that? [16:16:33] mobrovac: yeah...of course that's right [16:16:39] (03CR) 10BryanDavis: "I'm going to +2 this as soon as the php coding style issues are fixed. Having a tool today is 100% better than hoping someone will find an" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202665 (https://phabricator.wikimedia.org/T95375) (owner: 1020after4) [16:16:58] _joe_: I'm working it out :) I think it's to simplify some labs edge cases and be able to utilize hiera [16:17:55] too many things at once [16:19:44] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [16:20:47] (03CR) 10Giuseppe Lavagetto: [C: 04-1] ssh: Make hba enable-able via hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) (owner: 10Yuvipanda) [16:22:21] _joe_: perhaps quilt would work in that case. There is some git specific stuff happening but it may not be entirely necessary. [16:22:45] I'll give it one more shot [16:23:55] !log kartik Started scap: Update ContentTranslation [16:24:01] Logged the message, Master [16:25:54] (03PS3) 10Alexandros Kosiaris: autoinstall: fix preseeding of tasksel/first [puppet] - 10https://gerrit.wikimedia.org/r/215923 (owner: 10Faidon Liambotis) [16:25:56] (03PS2) 10Alexandros Kosiaris: autoinstall: Set default task for tasksel [puppet] - 10https://gerrit.wikimedia.org/r/215929 [16:26:57] (03CR) 10Alexandros Kosiaris: [C: 032] autoinstall: fix preseeding of tasksel/first [puppet] - 10https://gerrit.wikimedia.org/r/215923 (owner: 10Faidon Liambotis) [16:27:20] (03CR) 10Alexandros Kosiaris: [C: 032] autoinstall: Set default task for tasksel [puppet] - 10https://gerrit.wikimedia.org/r/215929 (owner: 10Alexandros Kosiaris) [16:30:24] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1337970 (10Wwes) >>! In T98283#1337610, @Dzahn wrote: > Let's give Stu the missing 15 https sites and resolve this ticket, then discuss automating the process as a separate thing. and wes please [16:33:13] !log kartik Finished scap: Update ContentTranslation (duration: 09m 17s) [16:33:19] Logged the message, Master [16:34:18] (03PS1) 10Giuseppe Lavagetto: hhvm: actually set the timeout on normal appserver, restore on canaries [puppet] - 10https://gerrit.wikimedia.org/r/215931 (https://phabricator.wikimedia.org/T98489) [16:34:22] <_joe_> bd808: ^^ [16:34:33] * bd808 looks [16:34:45] <_joe_> so, someone removed the def from the canaries, and on appservers we just used the wrong key [16:35:23] *nod* [16:35:40] there is tailing whitespace in hieradata/role/common/mediawiki/canary_appserver.yaml [16:35:48] *trailing [16:38:33] <_joe_> bd808: yeah, correcting that too [16:39:21] (03PS1) 10Jforrester: Hovercards: Disable test release on Catalan and Greek Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215932 (https://phabricator.wikimedia.org/T92555) [16:39:33] (03PS2) 10Giuseppe Lavagetto: hhvm: actually set the timeout on normal appserver, restore on canaries [puppet] - 10https://gerrit.wikimedia.org/r/215931 (https://phabricator.wikimedia.org/T98489) [16:39:40] (03CR) 10Jforrester: [C: 04-1] "For 2015-06-18, not beforehand." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215932 (https://phabricator.wikimedia.org/T92555) (owner: 10Jforrester) [16:40:37] andrewbogott: I am around now. Still need help ? [16:40:59] akosiaris: yes! [16:41:10] so, what's the issue ? [16:41:18] (03PS3) 10Giuseppe Lavagetto: hhvm: actually set the timeout on normal appserver, restore on canaries [puppet] - 10https://gerrit.wikimedia.org/r/215931 (https://phabricator.wikimedia.org/T98489) [16:41:21] akosiaris: on labcontrol1001, Puppet (debug): Failed to load library 'ldap' for feature 'ldap' [16:41:24] (03PS1) 10Jcrespo: Depool es1008 and es2008 (and its slaves) for CHANGE MASTER [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215933 [16:42:10] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: actually set the timeout on normal appserver, restore on canaries [puppet] - 10https://gerrit.wikimedia.org/r/215931 (https://phabricator.wikimedia.org/T98489) (owner: 10Giuseppe Lavagetto) [16:42:29] 6operations, 10hardware-requests: CODFW Search Servers - https://phabricator.wikimedia.org/T97049#1338016 (10RobH) 5Open>3stalled a:5RobH>3None [16:43:07] andrewbogott: where's that error logged ? [16:43:30] /var/log/puppet/puppet-master.log [16:43:48] I turned on debugging in /usr/share/puppet/rack/puppetmasterd/config.ru and directed it there [16:44:13] I’m about to restart apache, so it’ll be noisy for a moment [16:44:23] andrewbogott: please do [16:44:30] I made a small change [16:44:44] 6operations, 10hardware-requests: Replace rubidium with radon for authdns (allocate radon, deallocate rubidium) - https://phabricator.wikimedia.org/T101256#1338025 (10RobH) 5Open>3Resolved a:3RobH The wipe task has been set and the server is ready for spares (once wiped.) If the disk is bad, Chris will... [16:44:45] 6operations, 10ops-eqiad: rubidium - wipe and reclaim to spares - investigate hdd issue - https://phabricator.wikimedia.org/T101279#1334412 (10RobH) [16:44:50] akosiaris: I did a dist-upgrade [16:44:56] so either your change or my upgrade improved things [16:44:58] ldap seems happy now [16:45:01] what did you change? [16:45:05] andrewbogott: sudo aptitude install ruby-ldap [16:45:17] andrewbogott: please puppetize that :-) [16:45:25] hah! ok, will do. [16:45:29] it's labs only btw [16:45:33] production does not need it [16:45:45] 6operations, 10ops-codfw: prepare equipment list for eqdfw - https://phabricator.wikimedia.org/T91077#1338037 (10RobH) a:5Papaul>3RobH [16:45:49] but if it's not worth the casing, changes whatever [16:45:54] it won't hurt in production either [16:46:03] akosiaris: related: https://gerrit.wikimedia.org/r/#/c/215911/ [16:46:30] 6operations: order onsite tools for eqdfw/eqord - https://phabricator.wikimedia.org/T91095#1338038 (10RobH) [16:46:33] 6operations, 7HHVM, 5Patch-For-Review: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1338039 (10Joe) The setting is now applied everywhere. [16:48:41] (03PS1) 10Andrew Bogott: Include ruby-ldap on puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/215936 [16:49:21] (03CR) 10jenkins-bot: [V: 04-1] Include ruby-ldap on puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/215936 (owner: 10Andrew Bogott) [16:51:24] (03PS2) 10Andrew Bogott: Include ruby-ldap on puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/215936 [16:51:47] akosiaris: ok, the next issue is that it isn’t seeing classes in the private repo [16:51:52] (which in this case is the labs ‘private’ repo) [16:52:05] hm, possibly because it’s not there :) [16:57:22] (03CR) 10APerson: "Matanya, just wondering why I was listed as a reviewer on this change, as I don't have a whole lot of experience with embedded Ruby." [puppet] - 10https://gerrit.wikimedia.org/r/215785 (owner: 10Matanya) [16:57:54] (03CR) 10Matanya: "a mistake, sorry." [puppet] - 10https://gerrit.wikimedia.org/r/215785 (owner: 10Matanya) [16:58:27] matanya: aper [16:58:35] RECOVERY - Apache HTTP on mw1182 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [16:58:35] RECOVERY - HHVM rendering on mw1182 is OK: HTTP OK: HTTP/1.1 200 OK - 67532 bytes in 0.238 second response time [16:58:36] yes, tath [16:58:41] same for me :) [16:58:41] *that [17:02:34] (03PS4) 10Dzahn: ganglia's sha1 cert to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214670 (https://phabricator.wikimedia.org/T100825) (owner: 10RobH) [17:03:46] (03CR) 10Dzahn: [C: 032] ganglia's sha1 cert to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214670 (https://phabricator.wikimedia.org/T100825) (owner: 10RobH) [17:06:29] congrats to joe, giuseppe and bryan :-) [17:07:08] (03PS4) 10Dzahn: ganglia: use SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/215508 (https://phabricator.wikimedia.org/T100825) [17:08:08] (03CR) 10Dzahn: [C: 032] ganglia: use SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/215508 (https://phabricator.wikimedia.org/T100825) (owner: 10Dzahn) [17:10:02] 6operations, 7HTTPS, 5Patch-For-Review: replace ganglia's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100825#1338091 (10Dzahn) a:3Dzahn [17:10:40] (03CR) 10Yuvipanda: ssh: Make hba enable-able via hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) (owner: 10Yuvipanda) [17:12:23] (03CR) 10Jcrespo: [C: 032] Depool es1008 and es2008 (and its slaves) for CHANGE MASTER [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215933 (owner: 10Jcrespo) [17:17:22] !log jynus Synchronized wmf-config/db-eqiad.php: Depool es1008 (duration: 00m 12s) [17:17:33] Logged the message, Master [17:18:23] !log jynus Synchronized wmf-config/db-codfw.php: Depool es2008 and its slaves (duration: 00m 13s) [17:18:28] Logged the message, Master [17:19:25] chasemp: _joe_ so the right thing to do is to not use the global (set by wikitech / LDAP) and just use the hiera bit [17:19:37] But the LDAP variable is already currently used by many instances [17:19:47] And it is string while the right thing is to use a bool [17:20:03] So that string LDAP variable is kept for backwards compatible only [17:20:21] honestly the interaction between hiera and ldap variables is opaque to me thus my poor reviewing [17:20:30] it may be better to wait and ask _joe_ if we can [17:20:34] So LDAP variables are just injected as globals [17:20:43] We already use this pattern in several places [17:20:55] !log Disabling Puppet and nutcracker on mw1017 to control for parser cache [17:21:07] Logged the message, Master [17:21:13] YuviPanda: sure no doubt but my guess is they predate hiera? [17:21:56] Yes [17:21:57] They do [17:22:02] Wait [17:22:04] I mean [17:22:17] We have several places where we have a canonical source from hiera [17:22:22] With fallback to an LDAP variable [17:22:26] Exactly like this [17:22:39] Grep for use_dnsmasq to see another example [17:23:31] 6operations, 6Phabricator, 7database: Add Story points (from Sprint Extension) to the phabricator data dump - https://phabricator.wikimedia.org/T100846#1338122 (10JAufrecht) Is there any transaction history for story points? E.g., if a story is changed from 8 points on March 2 to 13 points on March 3, I'd... [17:24:15] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [17:24:48] ^seems cause by the restarts, it is ok now [17:25:13] YuviPanda: afaik part of what _joe_ was asking would mean no class variable is needed [17:25:28] that may be the confusion between what he is asking and your thinking [17:25:49] <_joe_> YuviPanda: we're trying to get rid of that damn mechanism [17:25:57] <_joe_> (using ldap to set global variables) [17:26:04] PROBLEM - nutcracker port on mw1017 is CRITICAL: Connection refused [17:26:23] <_joe_> but hey, don't wanna be a blocker [17:26:25] PROBLEM - nutcracker process on mw1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (nutcracker), command name nutcracker [17:26:36] <_joe_> I know you'll fix things if I ask you to [17:26:53] It may be a case of common pattern now and fix enmass when we are ready [17:26:58] to wipe out the anitpattern [17:27:11] 6operations: order onsite tools for eqdfw/eqord - https://phabricator.wikimedia.org/T91095#1338127 (10RobH) [17:27:11] _joe_: chasemp I think that fix will come when we move off open stack manager [17:27:19] <_joe_> chasemp: actually, we've been silently removing that antipatter for a long time [17:27:23] Which should be sometime this year [17:27:50] If you feel strongly about it I can not fall back to LDAP but thats 30 mins of me clicking buttons [17:28:11] just curious but is that to remove teh global fall back from certain projects? [17:28:30] No its to find out all the instances that have them on [17:28:36] And decide how to deal with them [17:28:50] We have only per project and per host hiera setup for labs [17:29:26] (03PS6) 10Rush: ssh: Make hba enable-able via hiera [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) (owner: 10Yuvipanda) [17:30:01] (03CR) 10Alexandros Kosiaris: [C: 031] Include ruby-ldap on puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/215936 (owner: 10Andrew Bogott) [17:30:28] (03CR) 10Rush: [C: 031] "functionally I think this works, I'm not sure on the best case pattern for labs variables and hiera. But this does accomplish the mission" [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) (owner: 10Yuvipanda) [17:30:30] (03CR) 10Alexandros Kosiaris: [C: 031] Mark out some obsolete passenger settings on Trusty and Jessie. [puppet] - 10https://gerrit.wikimedia.org/r/215911 (owner: 10Andrew Bogott) [17:30:51] YuviPanda: functionally I think it's cool, the overall cleanup if intended probably deserves it's own task? [17:30:52] idk [17:30:57] Yes [17:30:58] It does [17:31:11] 'Get rid of all variables in ldap' [17:31:17] A weeklong thing I bet [17:31:39] chasemp and needs more hiera support for labs too [17:34:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [17:35:05] (03PS1) 10Andrew Bogott: Add version switching to the puppetmaster apache config. [puppet] - 10https://gerrit.wikimedia.org/r/215939 [17:35:44] (03PS2) 10Andrew Bogott: Mark out some obsolete passenger settings on Trusty and Jessie. [puppet] - 10https://gerrit.wikimedia.org/r/215911 [17:36:36] (03CR) 10Andrew Bogott: [C: 032] Mark out some obsolete passenger settings on Trusty and Jessie. [puppet] - 10https://gerrit.wikimedia.org/r/215911 (owner: 10Andrew Bogott) [17:36:55] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 214 bytes in 0.006 second response time [17:36:55] PROBLEM - check_listener_ipn on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 214 bytes in 0.006 second response time [17:37:20] (03PS3) 10Andrew Bogott: Include ruby-ldap on puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/215936 [17:38:14] RECOVERY - nutcracker port on mw1017 is OK: TCP OK - 0.000 second response time on port 11212 [17:38:19] (03CR) 10Andrew Bogott: [C: 032] Include ruby-ldap on puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/215936 (owner: 10Andrew Bogott) [17:38:35] RECOVERY - nutcracker process on mw1017 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [17:39:29] (03PS2) 10Andrew Bogott: Add version switching to the puppetmaster apache config. [puppet] - 10https://gerrit.wikimedia.org/r/215939 [17:39:29] 6operations, 7HTTPS, 5Patch-For-Review: replace ganglia's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100825#1338189 (10Dzahn) before: {F174573} {F174575} after: {F174577} {F174579} [17:41:29] (03CR) 10Andrew Bogott: [C: 032] Add version switching to the puppetmaster apache config. [puppet] - 10https://gerrit.wikimedia.org/r/215939 (owner: 10Andrew Bogott) [17:41:54] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 214 bytes in 0.006 second response time [17:41:55] PROBLEM - check_listener_ipn on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 214 bytes in 0.006 second response time [17:42:14] 6operations, 7HTTPS, 5Patch-For-Review: replace ganglia's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100825#1338196 (10Dzahn) 5Open>3Resolved [17:42:15] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1338197 (10Dzahn) [17:42:47] (03PS1) 10Jcrespo: Repool es2008, es2009 and es2010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215940 [17:43:00] (03PS1) 10Alexandros Kosiaris: Assign roles to etherpad1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/215941 [17:43:59] (03CR) 10Jcrespo: [C: 032] Repool es2008, es2009 and es2010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215940 (owner: 10Jcrespo) [17:44:50] !log jynus Synchronized wmf-config/db-codfw.php: Repool es2008 and its slaves (duration: 00m 13s) [17:44:56] Logged the message, Master [17:46:54] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 214 bytes in 0.006 second response time [17:46:54] PROBLEM - check_listener_ipn on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 214 bytes in 0.006 second response time [17:48:14] PROBLEM - puppet last run on labcontrol1001 is CRITICAL Puppet has 1 failures [17:48:20] (03PS1) 10Chad: Phabricator: Enable the webserver to serve git repos [puppet] - 10https://gerrit.wikimedia.org/r/215942 [17:48:43] twentyafterfour: ^^ :) [17:48:44] WIP [17:48:55] PROBLEM - puppet last run on cp1070 is CRITICAL Puppet has 1 failures [17:49:31] (03PS2) 10Chad: Phabricator: Enable the webserver to serve git repos [puppet] - 10https://gerrit.wikimedia.org/r/215942 [17:49:47] akosiaris: now I’m getting clean puppet runs off of a labs Trusty puppetmaster. Are there things I should be wary about (e.g. https://phabricator.wikimedia.org/T98129) or can I declare victory? [17:49:54] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:50:23] andrewbogott: how many instances are using it ? [17:50:30] right now? two [17:50:37] andrewbogott: it will work until it doesn't [17:50:37] just test cases [17:50:43] fair enough :) [17:50:54] because some weird ERB shows up in a template [17:51:34] yeah, no doubt beta will break when it switches over. [17:51:44] andrewbogott: I 'd say declare the battle won, prepare for the war [17:51:54] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 214 bytes in 0.007 second response time [17:51:54] PROBLEM - check_listener_ipn on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 214 bytes in 0.006 second response time [17:51:56] PROBLEM - puppet last run on mw1140 is CRITICAL Puppet has 1 failures [17:52:33] (03PS1) 10Jcrespo: Repool es1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215943 [17:52:45] (03CR) 1020after4: [C: 031] Phabricator: Enable the webserver to serve git repos [puppet] - 10https://gerrit.wikimedia.org/r/215942 (owner: 10Chad) [17:54:13] Can anyone give me a ballpark for recent UDP packet loss rates? Specifically, I'm trying to triage slightly low udplog request numbers. [17:56:39] (03CR) 10Jcrespo: [C: 032] Repool es1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215943 (owner: 10Jcrespo) [17:56:54] RECOVERY - check_listener_gc on thulium is OK: HTTP OK: HTTP/1.1 200 OK - 248 bytes in 0.013 second response time [17:56:54] RECOVERY - check_listener_ipn on thulium is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.014 second response time [17:57:31] (03PS3) 10coren: Tool Labs: add old-style fqdn aliases to nodes [puppet] - 10https://gerrit.wikimedia.org/r/215918 (https://phabricator.wikimedia.org/T101296) [17:57:53] !log jynus Synchronized wmf-config/db-eqiad.php: Repool es1008 (duration: 00m 15s) [17:57:59] Logged the message, Master [17:58:37] (03PS4) 10coren: Tool Labs: add old-style fqdn aliases to nodes [puppet] - 10https://gerrit.wikimedia.org/r/215918 (https://phabricator.wikimedia.org/T101296) [17:59:24] PROBLEM - puppet last run on elastic1020 is CRITICAL Puppet has 1 failures [17:59:25] PROBLEM - puppet last run on cp4009 is CRITICAL Puppet has 1 failures [18:02:25] PROBLEM - puppet last run on lvs3003 is CRITICAL Puppet has 1 failures [18:02:35] PROBLEM - puppet last run on mw1066 is CRITICAL Puppet has 2 failures [18:02:43] (03PS1) 10BBlack: no-op: remove explicit default retry5NN => 0 [puppet] - 10https://gerrit.wikimedia.org/r/215945 [18:02:45] (03PS1) 10BBlack: mobile: retry503x1 in fe, no other retries [puppet] - 10https://gerrit.wikimedia.org/r/215946 (https://phabricator.wikimedia.org/T97206) [18:02:55] RECOVERY - puppet last run on cp1070 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:02:58] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1338262 (10dr0ptp4kt) Yes, and Wes, please :) I do recommend adding the https://www.wikipedia.org/ and http://www.wikipedia.org/ domains for the two as well. [18:03:36] PROBLEM - puppet last run on mw2075 is CRITICAL Puppet has 1 failures [18:03:36] (03CR) 10BBlack: [C: 032] no-op: remove explicit default retry5NN => 0 [puppet] - 10https://gerrit.wikimedia.org/r/215945 (owner: 10BBlack) [18:03:45] PROBLEM - puppet last run on mw2128 is CRITICAL Puppet has 1 failures [18:03:55] (03CR) 10BBlack: [C: 032] mobile: retry503x1 in fe, no other retries [puppet] - 10https://gerrit.wikimedia.org/r/215946 (https://phabricator.wikimedia.org/T97206) (owner: 10BBlack) [18:04:15] RECOVERY - puppet last run on mw1140 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:05] 6operations: exclude /autoinstall/ from being cached on install-server - https://phabricator.wikimedia.org/T101419#1338266 (10fgiunchedi) 3NEW [18:07:10] Did everyone already notice how the error rate on wfLogDBError plummeted once _joe_ fixed the hhvm config patch? [18:07:27] look at the 6 hour view of https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError [18:07:36] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 6 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1338292 (10BBlack) The current state of affairs is now this updated table: | cluster/layer | retry5xx | retry503 | text fe | 0... [18:08:28] (03PS2) 10Filippo Giunchedi: install-server: create placeholder LV to work around partman-lvm bug [puppet] - 10https://gerrit.wikimedia.org/r/215806 (https://phabricator.wikimedia.org/T100636) [18:08:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install-server: create placeholder LV to work around partman-lvm bug [puppet] - 10https://gerrit.wikimedia.org/r/215806 (https://phabricator.wikimedia.org/T100636) (owner: 10Filippo Giunchedi) [18:08:48] 6operations: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1338296 (10akosiaris) I 've started a full fleet catalog compilation process on catalogcompiler.eqiad.wmflabs using ruby 1.9, set via update-alternatives. I expect it... [18:08:54] PROBLEM - MySQL Replication Heartbeat on es1009 is CRITICAL: CRIT replication delay 364 seconds [18:09:33] ^my fault [18:09:52] (03PS3) 10Chad: Phabricator: Enable the webserver to serve git repos [puppet] - 10https://gerrit.wikimedia.org/r/215942 [18:11:02] 6operations, 10wikitech.wikimedia.org, 7HTTPS, 5Patch-For-Review: wikitech.wikimedia.org SSL certificate considered "outdated security" in Chrome - https://phabricator.wikimedia.org/T92709#1338304 (10Dzahn) a:3Dzahn [18:11:39] 6operations, 5Patch-For-Review: LVM recipes broken for jessie, set up all remaining LVM space as swap - https://phabricator.wikimedia.org/T100636#1338305 (10fgiunchedi) 5Open>3Resolved tentatively resolved, I've tested the new recipes with jessie and trusty and didn't see regressions in trusty [18:11:51] 6operations, 7HHVM, 5Patch-For-Review: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1338307 (10bd808) According to logstash the error rate has gone from ~90/minute before the patch to ~2/minute after. [18:13:26] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [18:15:04] RECOVERY - puppet last run on elastic1020 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:15:45] RECOVERY - puppet last run on mw2075 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:15:55] RECOVERY - puppet last run on mw2128 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [18:16:15] RECOVERY - puppet last run on lvs3003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:16:24] RECOVERY - puppet last run on mw1066 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:16:26] (03PS1) 10Jcrespo: Master->Slave switchover of es1009 to es1008 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215949 [18:16:54] RECOVERY - puppet last run on cp4009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:17:17] (03CR) 10Jcrespo: [C: 032] Master->Slave switchover of es1009 to es1008 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215949 (owner: 10Jcrespo) [18:18:52] (03PS1) 10BBlack: avoid cache effects of old clicktracking-session cookie [puppet] - 10https://gerrit.wikimedia.org/r/215951 [18:19:56] (03CR) 10BBlack: [C: 032] avoid cache effects of old clicktracking-session cookie [puppet] - 10https://gerrit.wikimedia.org/r/215951 (owner: 10BBlack) [18:20:06] here we go [18:20:35] !log jynus Synchronized wmf-config/db-eqiad.php: Depool es1009 and master-slave switchover (duration: 00m 13s) [18:20:41] Logged the message, Master [18:21:36] (03PS1) 10BBlack: Revert "avoid cache effects of old clicktracking-session cookie" [puppet] - 10https://gerrit.wikimedia.org/r/215952 [18:21:43] (03CR) 10BBlack: [C: 032 V: 032] Revert "avoid cache effects of old clicktracking-session cookie" [puppet] - 10https://gerrit.wikimedia.org/r/215952 (owner: 10BBlack) [18:22:17] we'll probably get a small chunk of puppetfail spam from my clicktrack change+revert above [18:22:20] sorry! [18:23:28] bblack, I am spaming more! I have to do every step up/down on separate step [18:24:18] I may have gotten lucky and reverted fast enough [18:25:45] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:26:35] (03PS5) 10Ottomata: Make it possible to install multiple custom diamond collectors that use the same source [puppet] - 10https://gerrit.wikimedia.org/r/215056 [18:26:47] andrewbogott: I am going to merge that ^ [18:26:53] hopefully i won't break labs puppet this time! [18:27:07] do you remember where you saw this break before? [18:27:31] ottomata: I think everywhere :) [18:27:46] any labs instance? [18:27:56] join -lab and shinken will let you know if things break [18:28:01] *-labs [18:28:04] (03CR) 10Ottomata: [C: 032] Make it possible to install multiple custom diamond collectors that use the same source [puppet] - 10https://gerrit.wikimedia.org/r/215056 (owner: 10Ottomata) [18:28:15] ok cool [18:28:44] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 19 hours old. [18:30:25] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [18:30:29] ottomata: so far looks ok [18:35:14] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [18:35:27] (03PS1) 10BBlack: avoid cache effects of old clicktracking-session cookie, round 2 [puppet] - 10https://gerrit.wikimedia.org/r/215955 [18:37:55] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [18:40:00] andrewbogott: we could do the wikitech cert change now [18:40:09] andrewbogott: just the cert, no config change [18:40:14] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [18:40:31] mutante: ok. I don’t have to do anything except look on nervously, right? [18:40:35] andrewbogott: right [18:40:48] then let’s do it :) [18:41:00] 10Ops-Access-Requests, 6operations: Login for jkrauska to librenms - https://phabricator.wikimedia.org/T101064#1338409 (10RobH) So for the record, I hopped on librenms just to check out who had accounts already. Joel: You already have an account: jkrauska. So these users were imported over from observium, so... [18:41:15] ok, i'll keep a copy of the old cert in /root and let puppet recreate stuff [18:41:35] 10Ops-Access-Requests, 6operations: Login for jkrauska to librenms - https://phabricator.wikimedia.org/T101064#1338413 (10RobH) I should have made it very clear: try to use your old observium login if you have it =] [18:41:42] cajoel: ^ [18:42:34] andrewbogott: oooh, caught the issue just in time before merge [18:42:43] additional whitespace in cert, would have broken it [18:42:44] ? [18:42:44] fixing [18:42:47] ok [18:43:32] " -----BEGIN CERTIFICATE" that [18:43:45] (03CR) 10BBlack: [C: 032] avoid cache effects of old clicktracking-session cookie, round 2 [puppet] - 10https://gerrit.wikimedia.org/r/215955 (owner: 10BBlack) [18:44:26] (03PS7) 10Dzahn: certs: wikitech.wm.org certificate SHA1 to SHA2 [puppet] - 10https://gerrit.wikimedia.org/r/214666 (https://phabricator.wikimedia.org/T92709) (owner: 10RobH) [18:45:15] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [18:45:23] (03CR) 10Dzahn: "PS7: fixed leading whitespace that would have broken it like on the icinga cert" [puppet] - 10https://gerrit.wikimedia.org/r/214666 (https://phabricator.wikimedia.org/T92709) (owner: 10RobH) [18:45:32] (03PS8) 10Dzahn: certs: wikitech.wm.org certificate SHA1 to SHA2 [puppet] - 10https://gerrit.wikimedia.org/r/214666 (https://phabricator.wikimedia.org/T92709) (owner: 10RobH) [18:46:31] (03CR) 10Dzahn: [C: 032] certs: wikitech.wm.org certificate SHA1 to SHA2 [puppet] - 10https://gerrit.wikimedia.org/r/214666 (https://phabricator.wikimedia.org/T92709) (owner: 10RobH) [18:48:10] !log restarted apache on silver/wikitech [18:48:14] Logged the message, Master [18:49:42] andrewbogott: done, still alive and now cert is not signed with SHA1 anymore [18:49:49] cool, thanks! [18:50:07] andrewbogott: grade A https://www.ssllabs.com/ssltest/analyze.html?d=wikitech.wikimedia.org [18:50:14] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 11 failures [18:50:32] this will also solve the warnings Chrome users got [18:51:45] Krinkle: ^ wikitech warnings about "outdated security" should be gone [18:53:05] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 19 hours old. [18:53:41] 6operations, 10wikitech.wikimedia.org, 7HTTPS, 5Patch-For-Review: wikitech.wikimedia.org SSL certificate considered "outdated security" in Chrome - https://phabricator.wikimedia.org/T92709#1338433 (10Dzahn) amended to the change (we had another leading whitespace that would break it, fixed that), ran pupp... [18:54:01] 6operations, 10ops-codfw: prepare equipment list for eqord - https://phabricator.wikimedia.org/T91079#1338434 (10RobH) 5Open>3Resolved I'm rolling this into T91077. [18:54:12] an easy way to triple-check those certs on your local machine when preparing the commit: "openssl x509 -in files/ssl/foo.crt -text" [18:54:18] 6operations, 10ops-codfw: prepare equipment list for eqord - https://phabricator.wikimedia.org/T91079#1338440 (10RobH) [18:54:20] 6operations, 10ops-codfw: prepare equipment list for eqdfw - https://phabricator.wikimedia.org/T91077#1073784 (10RobH) [18:54:22] 6operations: order onsite tools for eqdfw/eqord - https://phabricator.wikimedia.org/T91095#1338438 (10RobH) 5Open>3Resolved Rolling this into T91077 [18:54:24] it will catch silly errors like the space thing and barf [18:55:15] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 11 failures [18:56:25] 6operations, 10wikitech.wikimedia.org, 7HTTPS, 5Patch-For-Review: wikitech.wikimedia.org SSL certificate considered "outdated security" in Chrome - https://phabricator.wikimedia.org/T92709#1338449 (10Dzahn) a:5Dzahn>3RobH re-assigned to RobH >>! In T92709#1321521, @RobH wrote: > once the above patchs... [18:58:14] 6operations, 7database: es[12]00[123] maintenance and upgrade - https://phabricator.wikimedia.org/T101084#1338454 (10jcrespo) Switchover completed. No relevant errors on kibana, there are some errors in 1008 error log about 1009 disconnecting, but before the fail over (probably caused by the temporary 10 -> 5.... [18:58:15] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [19:00:14] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 11 failures [19:00:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:05:06] (03PS4) 10Dzahn: mysql: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211358 [19:05:14] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 11 failures [19:05:45] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1338495 (10RobH) [19:10:04] (03CR) 10Dzahn: [C: 032] mysql: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211358 (owner: 10Dzahn) [19:10:14] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [19:10:51] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1338514 (10Ottomata) Oo, ok, another question. Do we want client / backend metrics for a given varnish instance reported? Or just client? If I was doing just client, i would pass -c to varnishlog api... [19:11:36] (03Abandoned) 10Dzahn: retab redirects.dat [puppet] - 10https://gerrit.wikimedia.org/r/214421 (owner: 10Dzahn) [19:11:37] (03CR) 10Dzahn: "wish i could run it in puppet-compiler" [puppet] - 10https://gerrit.wikimedia.org/r/211356 (owner: 10Dzahn) [19:11:57] (03PS4) 10Dzahn: labs_lvm: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211346 [19:11:59] (03PS2) 10Dzahn: varnish: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211352 [19:12:03] (03PS5) 10Dzahn: lvs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211343 [19:15:15] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 97 seconds ago with 0 failures [19:26:52] 6operations, 7HTTPS, 5Patch-For-Review: replace librenms's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100831#1338566 (10Dzahn) a:3Dzahn [19:30:16] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [19:30:54] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [19:34:56] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1338589 (10RobH) p:5High>3Normal [19:40:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:46:26] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:52:54] 10Ops-Access-Requests, 6operations: Login for jkrauska to librenms - https://phabricator.wikimedia.org/T101064#1338653 (10faidon) I'm okay with this but since I'm guessing the timing with NANOG wasn't accidental: please treat data over there as confidential — in particular, do not share traffic figures with ve... [19:53:31] 6operations: move RT behind misc-web - https://phabricator.wikimedia.org/T101432#1338654 (10Dzahn) 3NEW [19:53:35] (03CR) 10Matanya: "minor nitpicks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/211352 (owner: 10Dzahn) [19:53:46] 6operations: move RT behind misc-web - https://phabricator.wikimedia.org/T101432#1338661 (10Dzahn) a:3Dzahn [19:54:53] 6operations: move RT behind misc-web - https://phabricator.wikimedia.org/T101432#1338654 (10Dzahn) p:5Triage>3Normal [19:55:16] mutante: do you want comment on missing lint fixes, or just on what you changed ? [19:56:33] 10Ops-Access-Requests, 6operations: Login for jkrauska to librenms - https://phabricator.wikimedia.org/T101064#1338668 (10chasemp) a:3RobH tossing your way Robh since it sounds like a simple deal for local folk and you have a handle on it [19:56:54] matanya: hmm.. just on what i changed because it's hard to draw a line and i tried to keep them smaller by focusing on a few types of warnings at a time [19:57:04] ok [19:58:00] in this case "indentation of =>" and "double quoted string" [19:58:23] if one class of warning can be completely removed we can re-enable it in jenkins [19:59:24] (03CR) 10Matanya: [C: 031] openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211356 (owner: 10Dzahn) [19:59:39] matanya: thank you for checking these [20:00:39] mutante: honestly, it is a frightening one [20:00:43] matanya: that's why it's not merged yet [20:01:20] i gave up on a huge one that fixed it globally :p [20:01:41] fair enough [20:02:33] (03PS1) 10BBlack: delete ancient clicktracking-session cookie from browsers [puppet] - 10https://gerrit.wikimedia.org/r/215962 [20:03:14] matanya: missing the compiler.. or i would do that [20:03:29] yeah, agree with your comment there [20:04:02] (03CR) 10Ori.livneh: openstack: lint fixes (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/211356 (owner: 10Dzahn) [20:04:31] bblack: why don't we do it in JS? [20:05:26] just seemed easier (to me) to do it here and know that I'm hitting every possible domain/wiki, and to remember removing it and the regex exception together once they mostly-vanish from logs. [20:05:30] eh yea, that's the same thing. thanks, but it was intentionally only fixing certain types of warnings [20:05:43] ori: ^ [20:06:47] bblack: makes sense. how would we know to remove it? [20:07:03] by me remembering to check up on varnishlog sampling manually looking for it [20:07:04] maybe in a week we could add a syslog call? [20:07:06] right [20:07:13] that's better [20:07:17] (03CR) 10Ori.livneh: [C: 031] delete ancient clicktracking-session cookie from browsers [puppet] - 10https://gerrit.wikimedia.org/r/215962 (owner: 10BBlack) [20:07:50] in general I'm still due to come back around and do a bunch more VCL refactoring soon and try to kill cruft. So even if I forget, I'll find it again myself. [20:08:12] yeah [20:08:53] apropos of nothing: while poking around with varnishapi, i saw that varnish logs VCL_Debug records on vcl subroutine entry [20:09:02] could be useful one day for debugging some obtuse VCL issue [20:09:25] nice [20:14:50] (03PS2) 10Dzahn: replace librenms's sha1 cert with sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214676 (https://phabricator.wikimedia.org/T100831) (owner: 10RobH) [20:15:24] (03PS3) 10Dzahn: replace librenms's sha1 cert with sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214676 (https://phabricator.wikimedia.org/T100831) (owner: 10RobH) [20:16:05] \o/ [20:16:14] (03CR) 10BBlack: [C: 032] delete ancient clicktracking-session cookie from browsers [puppet] - 10https://gerrit.wikimedia.org/r/215962 (owner: 10BBlack) [20:17:05] (03PS4) 10Dzahn: replace librenms's sha1 cert with sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214676 (https://phabricator.wikimedia.org/T100831) (owner: 10RobH) [20:17:59] (03CR) 10Dzahn: [C: 032] replace librenms's sha1 cert with sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214676 (https://phabricator.wikimedia.org/T100831) (owner: 10RobH) [20:19:34] 6operations: salt broken after the upgrade - https://phabricator.wikimedia.org/T100502#1338711 (10ArielGlenn) for non grain targets, i.e. * matching, don't use the timeout option. use the batch option. this is what I said on irc. and on this ticket. salt -b 100 --out=raw '*' cmd.run 'lsb_release -c -s' | tee... [20:20:04] ... waiting Warning: DocumentRoot [/srv/nonexistent] does not exist [20:20:07] lol? [20:20:44] 6operations, 6Phabricator, 7database: Add Story points (from Sprint Extension) to the phabricator data dump - https://phabricator.wikimedia.org/T100846#1338714 (10chasemp) 5Open>3Resolved >>! In T100846#1338122, @JAufrecht wrote: > Is there any transaction history for story points? E.g., if a story is... [20:21:40] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1338718 (10BBlack) Basically we need something like this for the websockets through nginx: http://nginx.org/en/docs/http/websocket.html [20:22:23] (03PS5) 10coren: Tool Labs: add old-style fqdn aliases to nodes [puppet] - 10https://gerrit.wikimedia.org/r/215918 (https://phabricator.wikimedia.org/T101296) [20:22:37] (03CR) 10coren: [C: 032] " Coren: consider it a virtual +1? Am out :)" [puppet] - 10https://gerrit.wikimedia.org/r/215918 (https://phabricator.wikimedia.org/T101296) (owner: 10coren) [20:24:19] Is there a way to force a GIF thumbnail to be rendered? [20:24:23] 6operations, 7HTTPS, 5Patch-For-Review: replace librenms's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100831#1338721 (10Dzahn) replaced cert without making an additional config change signature algorithm is now SHA256withRSA grade A- https://www.ssllabs.com/ssltest/analyze.html?d=librenm... [20:24:46] 6operations, 7HTTPS, 5Patch-For-Review: replace librenms's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100831#1338722 (10Dzahn) 5Open>3Resolved [20:24:48] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1338723 (10Dzahn) [20:27:28] Krenair: where does modules/admin/files/GenSysadminTable.py suppose to run ? [20:27:58] where is it supposed to run? [20:29:50] it runs on your laptop ? [20:29:50] yep [20:29:50] would mind automating ? [20:29:51] you want me to write a bot that updates the correct part of the page? :/ [20:30:14] that would be one option [20:30:54] though i thought of suggesting to do it [20:33:41] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1338775 (10Dzahn) could this ticket be split into more specific ToDos? No PFS, weak key-options and the MAC of the certificate still use SHA1 and config DNSSEC for that domain too (and if... [20:34:11] (03PS24) 10Ottomata: Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [20:35:37] modules/mediawiki/files/apache/sites/secure.wikimedia.conf: RewriteRule ^/otrs/(.*)$ https://ticket.wikimedia.org/otrs/$1 [R=301,L] [20:35:42] secure.conf ? really? [20:35:49] matanya, although I think that script is broken at the moment, because it assumes all users have ensure: present, but niedzielski does not [20:36:09] probably legacy from secure.wm.o? [20:36:18] Krenair: yes, indeed [20:36:39] let's fix that about the missing ensure => for one user [20:36:46] even though it happens to work [20:37:10] it's just a thing with my script, not necessarily the data.. [20:37:38] it's still inconsistent in the data.yaml [20:37:44] seems ensure defaults to 'present' [20:37:44] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [20:38:07] which seems perfectly reasonable, so.. [20:40:27] (03PS1) 10Dzahn: admin: add ensure => for user niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/215966 [20:41:45] (03PS2) 10Dzahn: admin: add ensure => for user niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/215966 [20:42:15] (03PS3) 10Dzahn: admin: add ensure => for user niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/215966 [20:42:48] (03CR) 10Dzahn: "is this still interesting now that DNS changes have happened in labs?" [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [20:43:13] (03CR) 10Dzahn: [C: 032] admin: add ensure => for user niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/215966 (owner: 10Dzahn) [20:44:14] (03CR) 10Paladox: [C: 031] Phabricator: Enable the webserver to serve git repos [puppet] - 10https://gerrit.wikimedia.org/r/215942 (owner: 10Chad) [20:44:44] PROBLEM - DPKG on mw1017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:46:14] <_joe_> ori: that you ^^ ? [20:46:25] RECOVERY - DPKG on mw1017 is OK: All packages OK [20:47:04] 6operations, 7HTTPS, 5Patch-For-Review: replace librenms's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100831#1338860 (10Dzahn) [20:47:28] _joe_: yes, godog and i are on it [20:47:35] The following packages have unmet dependencies: [20:47:35] hhvm-fss : Depends: hhvm-api- which is a virtual package. [20:47:38] 6operations, 7HTTPS, 5Patch-For-Review: replace tendril.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100835#1338862 (10Dzahn) [20:47:58] 6operations, 7HTTPS, 5Patch-For-Review: replace icinga's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100830#1338864 (10Dzahn) [20:47:58] <_joe_> ori: yeah you need to update the build deps of hhvm-fss [20:48:10] <_joe_> ori: add "hhvm" in the build deps :) [20:48:13] <_joe_> godog: ^^ [20:48:24] 6operations, 7HTTPS, 5Patch-For-Review: replace git's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100827#1338866 (10Dzahn) [20:48:27] <_joe_> I forgot to fix that :( [20:48:41] 6operations, 7HTTPS, 5Patch-For-Review: replace ganglia's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100825#1338868 (10Dzahn) [20:49:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:51:02] (03PS6) 10coren: Tool Labs: add old-style fqdn aliases to nodes [puppet] - 10https://gerrit.wikimedia.org/r/215918 (https://phabricator.wikimedia.org/T101296) [20:52:38] _joe_: oh ok, will do that [20:53:09] (03CR) 10Ori.livneh: [C: 031] Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [20:55:02] ori: i was about to push a patch on that :) [20:55:03] some changes. [20:55:25] (03CR) 10Ori.livneh: [C: 04-1] "needs some changes!" [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [20:55:30] better? :) [20:55:45] hehe [20:55:59] <_joe_> lol [20:57:58] _joe_: btw in the latest hhvm-dev package I think sth got mixed up, it should Depends: on a bunch of packages, the change is in git but not in the uploaded package [20:58:25] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [20:58:51] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1338890 (10Dzahn) @RobH unlike all other replaced certs i don't see this one in Gerrit yet. Could you get that too? [20:59:51] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1338899 (10Dzahn) a:3RobH [21:00:49] <_joe_> godog: yeah it will be tomorrow :P [21:00:58] (03PS25) 10Ottomata: Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [21:01:16] <_joe_> I'm uploading a package tomorrow, it will be based off of the current master branch [21:03:11] (03CR) 10Ottomata: "Please review varnishreqstats-diamond.py." [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [21:04:15] (03PS26) 10Ottomata: Add varnish request stats diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) [21:06:17] 6operations, 7HTTPS: The certificate chains of newly installed SHA256 certificates are incomplete. - https://phabricator.wikimedia.org/T88507#1338917 (10Dzahn) [21:08:16] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1338919 (10Dzahn) please revoke the SHA-1 versions of: tendril, librenms, icinga, ganglia please check if the SHA-1 certs have been revoked already: dumps, blog, gerrit [21:10:44] godog: could also do this: /bin/grep -Po '(?<=#define HHVM_API_VERSION )\d+' /usr/include/hphp/runtime/ext/extension.h [21:10:50] that way you don't need to depend on hhvm [21:11:03] but i guess hhvm --version is less likely to break than grepping a particular file [21:11:44] PROBLEM - puppet last run on mw2030 is CRITICAL Puppet has 1 failures [21:11:57] (03PS1) 10Mjbmr: Add dns for lrc sites [dns] - 10https://gerrit.wikimedia.org/r/215970 [21:12:58] ori: heh I think hhvm-dev should generate that at build time and ship it somewhere known, so extensions can read it [21:13:33] (03PS2) 10Mjbmr: Add dns for lrc sites [dns] - 10https://gerrit.wikimedia.org/r/215970 [21:13:39] makes sense [21:14:57] (03PS1) 10Dzahn: RT: adjust Apache config to be behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/215972 (https://phabricator.wikimedia.org/T101432) [21:17:52] (03PS1) 10Dzahn: varnish: add RT on magnesium to misc-web config [puppet] - 10https://gerrit.wikimedia.org/r/215973 (https://phabricator.wikimedia.org/T101432) [21:20:58] (03PS1) 10Dzahn: RT: move behind misc-web [dns] - 10https://gerrit.wikimedia.org/r/215974 (https://phabricator.wikimedia.org/T101432) [21:21:53] (03CR) 10Dzahn: "not using a CNAME here because we also have MX records. compare to phabricator which is behind misc-web but also has MX records" [dns] - 10https://gerrit.wikimedia.org/r/215974 (https://phabricator.wikimedia.org/T101432) (owner: 10Dzahn) [21:24:31] (03PS1) 10Andrew Bogott: Distinguish between glance and keystone IP. [puppet] - 10https://gerrit.wikimedia.org/r/215975 [21:25:15] (03CR) 10jenkins-bot: [V: 04-1] Distinguish between glance and keystone IP. [puppet] - 10https://gerrit.wikimedia.org/r/215975 (owner: 10Andrew Bogott) [21:25:50] (03PS2) 10Andrew Bogott: Distinguish between glance and keystone IP. [puppet] - 10https://gerrit.wikimedia.org/r/215975 [21:27:00] (03CR) 10Andrew Bogott: [C: 032] Distinguish between glance and keystone IP. [puppet] - 10https://gerrit.wikimedia.org/r/215975 (owner: 10Andrew Bogott) [21:27:52] !log restarted logstash and elasticsearch on logstash100[1-3] to pick up latest jre updates [21:27:56] Logged the message, Master [21:29:10] (03PS1) 10Andrew Bogott: s/controller/host (type) [puppet] - 10https://gerrit.wikimedia.org/r/215976 [21:29:15] RECOVERY - puppet last run on mw2030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:29:29] update all the things! [21:29:38] 10Ops-Access-Requests, 6operations: Login for jkrauska to librenms - https://phabricator.wikimedia.org/T101064#1338974 (10JKrauska) @faidon of course. I have no intention of speaking on behalf of production or Ops and never present myself as such. [21:30:02] (03CR) 10Andrew Bogott: [C: 032] s/controller/host (type) [puppet] - 10https://gerrit.wikimedia.org/r/215976 (owner: 10Andrew Bogott) [21:30:17] (03CR) 10Dzahn: "at a glance i already see 3 tickets where this has been closed as invalid/rejected or duplicate" [dns] - 10https://gerrit.wikimedia.org/r/215970 (owner: 10Mjbmr) [21:31:37] (03CR) 10Mjbmr: "Wait till Monday please." [dns] - 10https://gerrit.wikimedia.org/r/215970 (owner: 10Mjbmr) [21:32:15] 10Ops-Access-Requests, 6operations: Login for jkrauska to librenms - https://phabricator.wikimedia.org/T101064#1338983 (10JKrauska) @faidon and to be further clear, I will treat the data as internally confidential. Although I disagree with that policy from a 'free open knowledge' perspective. [21:32:57] (03CR) 10Dzahn: [C: 04-1] "can we have a "Verified" on https://meta.wikimedia.org/wiki/Requests_for_new_languages please?" [dns] - 10https://gerrit.wikimedia.org/r/215970 (owner: 10Mjbmr) [21:33:34] (03CR) 10Dzahn: "ah, saw your last comment after i submitted mine. ok, thanks, will wait" [dns] - 10https://gerrit.wikimedia.org/r/215970 (owner: 10Mjbmr) [21:34:43] (03CR) 10Mjbmr: "See: https://lists.wikimedia.org/pipermail/langcom/2015-June/000378.html" [dns] - 10https://gerrit.wikimedia.org/r/215970 (owner: 10Mjbmr) [21:34:55] !log performing rolling restart of HHVMs for hhvm-fss upgrade [21:35:00] Logged the message, Master [21:35:12] (03PS3) 10Dzahn: Add language "lrc" (Northern Luri) [dns] - 10https://gerrit.wikimedia.org/r/215970 (owner: 10Mjbmr) [21:35:28] 10Ops-Access-Requests, 6operations: Request for access to analytics cluster for hive queries (AndyRussG) - https://phabricator.wikimedia.org/T101443#1338985 (10AndyRussG) 3NEW [21:36:13] 10Ops-Access-Requests, 6operations: Request for access to analytics cluster for hive queries (AndyRussG) - https://phabricator.wikimedia.org/T101443#1339003 (10AndyRussG) [21:38:35] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 13.33% of data above the critical threshold [500.0] [21:39:02] that's the restart [21:39:05] it's subsiding [21:43:11] !log ori Synchronized php-1.26wmf8/includes/libs/ReplacementArray.php: 1b20d62c26: Revert "awful hack: disable fss on zhwiki only, except on mw1017" (duration: 00m 13s) [21:43:11] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [21:43:11] Logged the message, Master [21:43:11] hmm analytics1013 [21:43:33] no open tasks for it, but it crashed once before in april [21:43:37] * jgage jumps on console [21:44:48] no response, rebooting.. [21:45:31] sad server. [21:48:21] !log analyics1013 crashed, rebooted [21:48:25] Logged the message, Master [21:48:54] RECOVERY - Host analytics1013 is UPING OK - Packet loss = 0%, RTA = 3.60 ms [21:49:06] oh [21:49:07] uh oh [21:49:12] jgage was it that temp thing? [21:49:16] temperature [21:49:46] there are DRAM errors in syslog [21:50:41] socket 0, i wonder if that's closest to the cpu [21:50:47] which could be a thermal thing [21:51:10] 6operations, 10Datasets-General-or-Unknown: snaphot1004 running dumps very slowly, investigate - https://phabricator.wikimedia.org/T98585#1339079 (10DCDuring) If this is to be an application open to ordinary Wiktionary project contributors, then the criterion for closure has to be something like "what failed r... [21:51:48] Jun 4 21:38:11 analytics1013 mcelog: Processor 9 heated above trip temperature. Throttling enabled. [21:51:48] Jun 4 21:38:11 analytics1013 mcelog: Please check your system cooling. Performance will be impacted [21:52:31] looks related [21:52:45] that's rigiht before [21:52:45] Jun 4 21:38:12 analytics1013 kernel: [3268161.338897] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x1995ae offset:0x200 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:1) [21:55:25] 6operations, 10Analytics: analytics1013 crashed, investigate... - https://phabricator.wikimedia.org/T97380#1339105 (10Gage) 5Resolved>3Open This machine crashed again. All the errors are on socket 0, so we should probably replace that DIMM. Furthermore I'd like to know if that socket is the one closest to... [21:55:34] 6operations, 10Analytics-Cluster, 3Fundraising Sprint Kraftwerk, 3Fundraising Sprint Lou Reed, 10Fundraising Tech Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1339107 (10AndyRussG) > In those cases, there are more requests in kafkatee than in udp2... [21:55:39] 10Ops-Access-Requests, 6operations: Request for access to analytics cluster for hive queries (AndyRussG) - https://phabricator.wikimedia.org/T101443#1339109 (10chasemp) p:5Triage>3Normal See: https://wikitech.wikimedia.org/wiki/Requesting_shell_access#Escalating_Existing_Shell_Access Please provide your p... [21:56:18] ottomata: the last time i looked at that thermal overload error i saw it every day on several analytics hosts [21:56:27] it's in 1013's logs from yesterday, too [21:56:27] Jun 3 06:27:06 analytics1013 kernel: [3127099.228500] CPU9: Package temperature above threshold, cpu clock throttled (total events = 261964075) [21:56:42] hm, aye [21:56:43] hm. [21:56:43] ok [21:56:54] only on the older dells, right? [21:56:58] an11-an20? [21:57:19] hmm i don't recall [21:59:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:00:57] (03CR) 10Yuvipanda: "T101447 for getting rid of the LDAP globals in general." [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) (owner: 10Yuvipanda) [22:01:17] _joe_: ^ if I can convince you to take off your -1 :) [22:02:14] <_joe_> YuviPanda consider it done [22:02:32] _joe_: :D ok! [22:03:00] hi all, i'm trying to login at wikitech but I keep getting the incorrect password entered message. I even reset my password, but I can't login. Can anyone help? [22:03:15] my username is bmansurov [22:03:23] (03CR) 10Giuseppe Lavagetto: "since we now have a phab ticket, I won't block this." [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) (owner: 10Yuvipanda) [22:03:31] _joe_: thanks [22:03:41] all I need a phab ticket to get things done? huh [22:03:53] (03PS7) 10Yuvipanda: ssh: Make hba enable-able via hiera [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) [22:04:00] <_joe_> greg-g: nope, you also need to be yuvi :P [22:04:00] (03CR) 10Yuvipanda: [C: 032 V: 032] ssh: Make hba enable-able via hiera [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) (owner: 10Yuvipanda) [22:04:29] <_joe_> he /promised/ he'll fix it :) [22:05:09] nm, i got in this time [22:05:12] works for me bmansurov [22:05:26] that's weird, after the second reset it worked [22:06:57] (03PS1) 10Ottomata: Turn on Kafka auto topic creation [puppet] - 10https://gerrit.wikimedia.org/r/215986 [22:07:25] (03PS2) 10Ottomata: Turn on Kafka auto topic creation [puppet] - 10https://gerrit.wikimedia.org/r/215986 [22:08:12] 10Ops-Access-Requests, 6operations: Login for jkrauska to librenms - https://phabricator.wikimedia.org/T101064#1339207 (10RobH) Chatted with Joel already, so indeed, I'll handle this with him @ the office on Friday. [22:09:06] _joe_: I can promise things too! [22:09:35] PROBLEM - puppet last run on mw2090 is CRITICAL puppet fail [22:09:53] (03CR) 10Ottomata: [C: 032] Turn on Kafka auto topic creation [puppet] - 10https://gerrit.wikimedia.org/r/215986 (owner: 10Ottomata) [22:10:24] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [22:21:48] !log doing controlled restart of kafka brokers services to apply auto create topic config [22:21:52] Logged the message, Master [22:24:34] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:28:55] RECOVERY - puppet last run on mw2090 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:30:05] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1022 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 29.0 [22:31:25] (03PS1) 10Faidon Liambotis: base: kill annoying mpt-status emails [puppet] - 10https://gerrit.wikimedia.org/r/215994 [22:32:17] (03CR) 10Faidon Liambotis: [C: 032] base: kill annoying mpt-status emails [puppet] - 10https://gerrit.wikimedia.org/r/215994 (owner: 10Faidon Liambotis) [22:33:54] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1022 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [22:42:43] (03PS6) 10Ori.livneh: Rsyncing slow-parse logs from fluorine to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/49678 (https://phabricator.wikimedia.org/T98563) (owner: 10Ottomata) [22:45:30] 6operations, 10Datasets-General-or-Unknown: snaphot1004 running dumps very slowly, investigate - https://phabricator.wikimedia.org/T98585#1339385 (10ArielGlenn) It's not a prediction, it's based on observation after running the problematic step. I ran a full set of stub dumps in the middle of the month alread... [22:45:53] (03PS7) 10Ori.livneh: Rsyncing slow-parse logs from fluorine to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/49678 (https://phabricator.wikimedia.org/T98563) (owner: 10Ottomata) [22:45:54] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1021 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 18.0 [22:46:18] 6operations, 10ContentTranslation-cxserver, 6Labs, 6Services: LDAP TLS failing on some instances due to inconsistent state - https://phabricator.wikimedia.org/T101377#1339388 (10yuvipanda) I salted the sed on most machines earlier, and they all seem ok now. [22:46:31] (03CR) 10Ori.livneh: [C: 032 V: 032] Rsyncing slow-parse logs from fluorine to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/49678 (https://phabricator.wikimedia.org/T98563) (owner: 10Ottomata) [22:49:35] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1021 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [22:50:24] 6operations, 10Datasets-General-or-Unknown: snaphot1004 running dumps very slowly, investigate - https://phabricator.wikimedia.org/T98585#1339399 (10DCDuring) I assumed you had a sound basis for your prediction, but the prediction isn't the fact. Do you think you could humor me? [22:53:13] 6operations, 10Datasets-General-or-Unknown: snaphot1004 running dumps very slowly, investigate - https://phabricator.wikimedia.org/T98585#1339416 (10jberkel) thanks for your work on this ariel! to be fair, the title of the ticket is "snaphot1004 running dumps very slowly", not "wiktionary/(insert other project... [22:54:05] 6operations, 10ContentTranslation-cxserver, 6Labs, 6Services: LDAP TLS failing on some instances due to inconsistent state - https://phabricator.wikimedia.org/T101377#1339418 (10faidon) FWIW, I ran a sed yestrerday via salt across the Labs fleet (in fact, the same salt command as you did). I don't know why... [22:55:15] PROBLEM - Kafka Broker Messages In on analytics1022 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [22:57:44] 6operations, 10ContentTranslation-cxserver, 6Labs, 6Services: LDAP TLS failing on some instances due to inconsistent state - https://phabricator.wikimedia.org/T101377#1339425 (10yuvipanda) Salt issues possibly, also puppet might've set them back if it was in an inconsistent state? [22:58:55] RECOVERY - Kafka Broker Messages In on analytics1022 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1558.52299078 [23:00:04] RoanKattouw, ^d, rmoen, James_F, kaldari: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150604T2300). Please do the needful. [23:06:34] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1339479 (10Qgil) Is or could be {T823} a blocker for this task? [23:07:35] Who's doing SWAT? [23:09:45] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [23:09:55] (03PS1) 10Ori.livneh: rsync_slow_parse: run at 23h15m [puppet] - 10https://gerrit.wikimedia.org/r/216002 [23:10:21] (03CR) 10Ori.livneh: [C: 032 V: 032] rsync_slow_parse: run at 23h15m [puppet] - 10https://gerrit.wikimedia.org/r/216002 (owner: 10Ori.livneh) [23:11:01] James_F, RoanKattouw, ^d: I guess I'll do SWAT again if no one else is available [23:11:13] kaldari: That'd be great. [23:11:22] and the spate of fatals is again unrelated [23:11:29] 2015-06-04 23:10:37 mw1105 metawiki fatal INFO: [18501505] /w/index.php?title=Special:Book&bookcmd=book_creator&referer=Category:2006/Translations ErrorException from line 361 of /srv/mediawiki/php-1.26wmf8/vendor/oojs/oojs-ui/php/Tag.php: PHP Error: exception 'OOUI\Exception' with message 'Potentially unsafe 'href' attribute value. Scheme: ''; value: '/wiki/Category:2006/Translations'.' in /srv/mediawiki/php-1.26wmf8/vendor/o [23:11:29] ojs/oojs-ui/php/Tag.php:317 [23:11:37] what is this and why hasn't it been fixed yet [23:11:40] i have seen it for days [23:11:49] ori: Because wmf10 doesn't come out for a while. [23:11:57] ori: No fatals allowed during SWAT! ;) [23:12:07] ori: Fix is merged and waiting library release, which is waiting for wmf9 branch. [23:12:08] no SWAT allowed during fatals [23:12:16] can it be cherry-picked? [23:12:18] fatals in prod aren't cool [23:12:34] ori: Not trivially. [23:12:37] * James_F ponders. [23:12:43] well *a* fix [23:12:46] MatmaRex: What do you think? [23:13:13] ori: A fix is to not have the Collection extension pass valid-but-borking URLs to OOUI. [23:13:29] (Short-term.) [23:13:44] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1018 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 36.0 [23:13:50] what about '/wiki/Category:2006/Translations' is borking? [23:13:59] ':'? [23:14:40] ori: yes [23:14:43] mostly [23:14:52] James_F: I'm doing the mobile update first, then your change [23:14:53] can you hack around it? [23:14:55] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1022 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 18.0 [23:14:58] kaldari: Kk. [23:15:11] ori: kind of, i did do it in the merged patch that is awaiting release. [23:15:26] ori: backporting this would be a pain, i think, since it's in a composer-managed library [23:15:45] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1021 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 27.0 [23:15:52] MatmaRex: We could fake it in mediawiki/vendor. [23:15:54] or rather: with mediawiki/vendor we could do it trivially, if we ignore possibly failing sanity checks [23:16:00] i have seen it for days [23:16:03] that's been there for months.. [23:16:07] Indeed. [23:16:08] and if we don't mind inconsistent versions of stuff [23:16:15] (ssshhh ah i tried to schedule downtime for those kafka alerts in icinga but just saw that I wasn't logged in so was unauthorized) [23:16:19] [fluorine:/a/mw-log] $ grep -c OOUI fatal.log [23:16:19] 10074 [23:16:22] how is that cool? [23:16:45] [fluorine:/a/mw-log] $ grep -c 'Potentially unsafe' fatal.log [23:16:45] 657 [23:16:55] kaldari: please stop [23:16:56] anyway, i'm not ops, i'm not a deployer, you guys feel free to do whatever [23:16:58] ori: Hmm. What are all the others about? [23:17:10] just don't forget to backport the followup patch too [23:17:24] James_F: probably the same but matching multiple lines in individual traces [23:17:59] ori: https://phabricator.wikimedia.org/maniphest/query/Uw.4JDRhBSas/#R returns only this. [23:18:00] [fluorine:/a/mw-log] $ zgrep -c 'Potentially unsafe' archive/fatal.log-201506* [23:18:00] archive/fatal.log-20150601.gz:849 [23:18:00] archive/fatal.log-20150602.gz:1308 [23:18:02] ori: sure, just let me know when the coast is clear [23:18:02] archive/fatal.log-20150603.gz:1247 [23:18:04] archive/fatal.log-20150604.gz:1230 [23:18:09] seriously, guys, wtf [23:18:22] ori: i think it's failr clear to everyone that it's been there for a while [23:18:25] Sorry, I'm here now to take over SWAT [23:18:29] ( kaldari ) [23:18:30] no one indicated that it is high priority [23:18:34] RoanKattouw: no, please don't [23:18:35] and it clearly wasn't [23:18:40] since it was siting there for like two months [23:18:42] when it happens [23:18:51] first noticed 1st april - https://phabricator.wikimedia.org/P468 [23:19:05] the words "Fatal error:" indicate the priority [23:19:06] at least by me [23:19:08] RoanKattouw: might need to wait a bit. Ori et al are working on a situation [23:19:11] so please do not make a huge fuss of it suddenly. am i understanding correctly that we're stopping SWAT deployment because of two months old fatal? [23:19:21] ori: What is the sudden disaster that means you want to us to have lied to the enwiki community and keep the A/B test going longer than we promised? [23:19:25] https://phabricator.wikimedia.org/T94900 [23:19:44] everybody calm down. the wikis will flow! [23:20:00] do the SWAT [23:20:15] this is a shit-show, though, and i'm disappointed as hell [23:20:16] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [23:20:19] * James_F too. [23:20:33] in MatmaRex especially [23:20:47] the phrase "two months old fatal" should not occur [23:21:14] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1018 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [23:21:29] * MatmaRex bows [23:22:42] ori: Maybe if there's a fatal in an extension we shouldn't blame the library it uses, hmm? [23:23:08] I don't blame code [23:23:43] Alright I'm gonna do the VE config change [23:23:48] RoanKattouw: If you want to take over, the MobileFrontend submodule update just merged: https://gerrit.wikimedia.org/r/#/c/216001/ . I haven't done James's VE config update yet, so that too. [23:23:53] Thanks [23:24:22] I am disappointed in engineers that know of a _fatal error_ happening at a rate of over a thousand times per day don't act on that knowledge because "no one indicated that it is high priority" [23:25:00] ori: Oldest currently-open fatal error in the newly-created list is https://phabricator.wikimedia.org/T24510 in LiquidThreads. From over five years ago. [23:25:15] ori: And no, *no one* gave numbers about severity. [23:25:40] $ zgrep -c ApiFormatFeedWrapper archive/fatal.log-201506* [23:25:40] archive/fatal.log-20150601.gz:0 [23:25:40] archive/fatal.log-20150602.gz:0 [23:25:42] archive/fatal.log-20150603.gz:0 [23:25:44] archive/fatal.log-20150604.gz:0 [23:25:54] ori: I don't have access to these logs. Neither does MatmaRex. [23:25:57] (03PS2) 10Catrope: Disable A/B test of VisualEditor for new accounts on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211698 (https://phabricator.wikimedia.org/T90666) (owner: 10Jforrester) [23:26:00] ori: So it's fixed? [23:26:08] (03CR) 10Catrope: [C: 032] Disable A/B test of VisualEditor for new accounts on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211698 (https://phabricator.wikimedia.org/T90666) (owner: 10Jforrester) [23:26:14] (03Merged) 10jenkins-bot: Disable A/B test of VisualEditor for new accounts on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211698 (https://phabricator.wikimedia.org/T90666) (owner: 10Jforrester) [23:26:42] James_F: i don't know if it's fixed or not; it's not occurring in [23:26:43] prod [23:26:56] PROBLEM - Kafka Broker Messages In on analytics1012 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 680.377617105 [23:26:59] ori: I can't wait for the day we de-deploy LQT. [23:27:17] We're working on the conversion :) [23:27:33] Although we're behind schedule quite a lot :S https://phabricator.wikimedia.org/T92303 [23:27:34] ori: personally i am disappointed about https://phabricator.wikimedia.org/T66721 still not being fixed, in spite of being filed almost a year ago and having three duplicates filed [23:27:37] RoanKattouw: Yeah. [23:27:50] I don't think you quite understand how severe this is. The OOUI fatals tend to cluster, so they have been tripping up 5xx error alerts [23:27:56] for weeks, presumably [23:28:00] pity we don't have server-side logs for that [23:28:12] ori: OK. What does that mean for users? [23:28:15] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 14.0 [23:28:42] anyone from ops want to jump in here? [23:28:42] It's not just server-side log noise but actually breaks things? [23:28:44] ori: it's actually pretty clear that no one understood that it was "severe". [23:28:53] if anyone understood, it would have been fixed [23:28:55] !log catrope Synchronized wmf-config/InitialiseSettings.php: Disable VE A/B test for new accounts on enwiki (duration: 00m 13s) [23:29:00] Logged the message, Master [23:29:02] no one in the know clarified the severity [23:29:03] is it an issue that 5xx alerts get tripped up multiple times a day by this bug for weeks? [23:29:13] ori: No-one you're talking to gets paged about those. [23:29:20] ori: There is no visibility about 5xx alerts. [23:29:28] if it was an issue for you, why did you not say so? [23:29:34] seriously [23:29:46] Clearly there should be more attention paid to them. [23:29:46] OK [23:29:48] i left my crystal ball in the other pants [23:29:50] I am indicating the severity [23:29:53] It is very severe [23:29:57] Can you fix it, now? [23:29:58] But https://phabricator.wikimedia.org/T94900 is marked as resolved? [23:30:06] So it sounds like it's fixed in master, just not deployed? [23:30:16] sorry to tell you this but platform stability is a shared responsibility now [23:30:21] it is fixed [23:30:29] deploy at will [23:30:50] who deployed this to prod in the first place? [23:30:58] the ops team is not staffed appropriately to track down every fatal introduced in mediawiki and there is certainly no mwcore anymore [23:31:08] 02:29 < James_F> ori: There is no visibility about 5xx alerts. [23:31:17] this statement is not truet [23:31:27] What 5xx alerts? [23:31:28] it clearly is [23:31:40] if there was visibility, we would have known that this is an issue [23:31:42] paravoid: Where can I see them? Beyond Greg and Chad posting the link every few weeks to remind us. [23:31:42] it clearly is not and I think you should tone it down a notch [23:31:45] ori: My understanding is it's the Collection extension that's passing in weird data, and because they don't appear to want to fix that OOUI has been changed to work around that instead [23:31:52] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [23:31:54] these 5xx alerts [23:31:55] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1012 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [23:32:11] RoanKattouw: ori: no, it was a OOUI bug [23:32:17] James_F: every deployer has access to logs, logstash & graphite graphs, as well as the icinga alerts etc. [23:32:25] or rather a PHP bug/misfeature that we didn't work around in OOUI [23:32:30] James_F: BTW the A/B test is disabled now [23:32:33] you mean that's http 500s coming from the mw servers, not 500s being served to people browsing graphite.wm.o? [23:32:34] 7Puppet: Allow per-host hiera customizations on wikitech - https://phabricator.wikimedia.org/T97055#1339601 (10scfc) [23:32:37] As of 23:28:55 UTC [23:32:37] paravoid: neither me or James_F are deployers [23:32:43] RoanKattouw: Yeah, thanks. :-) [23:32:44] Krenair: correct [23:33:18] MatmaRex: did I point any fingers at you or James_F though? [23:33:24] Yeah the icinga warnings for 5xx are a bit confusing if you don't already know how it's set up [23:33:24] paravoid: no, ori did [23:33:26] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1022 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [23:33:45] OK, so people can see the rate of 5xx errors [23:33:47] So icinga bitches about dozens of things every day, but the 5xx things are urgent? [23:33:49] Ok. [23:33:51] Where can they see what these errors are? [23:33:56] anyway. it is late here [23:33:57] ok, who did you expect to find, triage and fix fatals? [23:34:00] (if they're not deployers) [23:34:02] paravoid: ori: do you need action from me? [23:34:14] paravoid: People empowered to know they exist? [23:34:22] MatmaRex: yes [23:34:24] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1021 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [23:34:29] if not, i'll go to sleep [23:34:32] I need a patch that can be safely applied on top of the current production branch [23:34:34] paravoid: Which releng do a great job of with the new on-Phabricator list. [23:34:35] that makes this fatal go away [23:34:45] I don't think this is releng's job either [23:34:47] ori: mw-config: De-deploy Collection extension. Fixed. [23:34:49] ori: it would be a mediawiki/vendor patch [23:35:06] both ops and releng can't scale to triage every error introduced by everyone around here [23:35:07] ori: and i don't know enough about composer to prepare it [23:35:13] How does one cherry-pick a library fix? [23:35:15] you ship it, you own it, sorry :) [23:35:25] It's https://gerrit.wikimedia.org/r/#/c/215052/ in oojs/ui [23:35:26] I can try to do a cherry-pick for /vendor, I guess. [23:35:37] And we can just V+2 it through… [23:35:38] s/both/neither/ anyway [23:35:46] ori: cherry-pick this and this: https://gerrit.wikimedia.org/r/#/c/215052/ https://gerrit.wikimedia.org/r/#/c/215713/ [23:36:04] or rather, apply the patches to the right branches in mediawiki/vendor [23:36:31] I can do that as part of the SWAT, if someone who understands vendor better than me can explain the right way to do it [23:37:05] PROBLEM - Kafka Broker Messages In on analytics1018 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [23:37:31] https://gerrit.wikimedia.org/r/216005 [23:37:51] legoktm: Does Jenkins let you do that? [23:37:51] oh, two patches? [23:37:55] RoanKattouw: it should [23:37:56] Yeah, two patches :S [23:37:58] legoktm: Two patches. [23:38:00] but I only grabbed one patch [23:38:09] legoktm: https://gerrit.wikimedia.org/r/#/c/215713/ [23:38:23] It won't complain about the code in the repo not matching the code that composer would fetch for that version? [23:38:25] legoktm: That will V-1 everything, won't it? [23:38:52] And if not, shouldn't it [23:38:52] ? [23:39:10] no it won't [23:39:19] Hmm. [23:39:22] because that would prevent us from applying live hacks as necessary [23:39:26] like in this case [23:39:33] Good :) [23:39:34] updated https://gerrit.wikimedia.org/r/216005 [23:39:55] that needs +2 and then cherry-picks to branches and submodule bumps in mediawiki/core like an extension [23:39:59] legoktm: Thanks… [23:40:08] * RoanKattouw +2s [23:40:09] RoanKattouw: can you take care of that ^? [23:40:10] thanks [23:40:40] Yeah I'm on it [23:40:49] !log catrope Synchronized php-1.26wmf8/extensions/MobileFrontend: SWAT (duration: 00m 13s) [23:40:52] legoktm: Branch singular, right now. [23:40:53] Logged the message, Master [23:40:55] RECOVERY - Kafka Broker Messages In on analytics1018 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 3890.98053421 [23:41:03] right [23:41:10] legoktm: (wmf7 rolled off prod on Wednesday; wmf9 doesn't arrive 'til Tuesday.) [23:41:54] RECOVERY - Kafka Broker Messages In on analytics1012 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 6011.70065171 [23:42:33] I have to leave in ~5min [23:42:38] I got it [23:42:48] awesome :) [23:42:51] legoktm: thank you sir [23:44:45] If there are any other nasty things like that in prod, they should probably be filed at https://phabricator.wikimedia.org/maniphest/?statuses=open%28%29&projects=PHID-PROJ-4uc7r7pdosfsk55qg7f6#R and/or marked as high / unbreak now [23:45:02] (Better link: https://phabricator.wikimedia.org/project/board/1055/) [23:53:16] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 16.67% of data above the critical threshold [500.0] [23:55:44] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1339741 (10Dzahn) >>! In T93760#1339479, @Qgil wrote: > Is or could be {T823} a blocker for this task? If it fixes "email an attachment into the task & ensure its attachment isn't viewable to anyone...