[00:04:08] (03PS2) 10Dzahn: Revert "ganglia: set ocg1001 as aggregator for ocg hosts" [puppet] - 10https://gerrit.wikimedia.org/r/216198 [00:04:42] (03CR) 10Dzahn: [C: 032] Revert "ganglia: set ocg1001 as aggregator for ocg hosts" [puppet] - 10https://gerrit.wikimedia.org/r/216198 (owner: 10Dzahn) [00:09:29] (03PS2) 10Dzahn: Revert "ganglia: switch ocg servers to ganglia_new" [puppet] - 10https://gerrit.wikimedia.org/r/216201 [00:11:22] (03CR) 10Dzahn: [C: 032] Revert "ganglia: switch ocg servers to ganglia_new" [puppet] - 10https://gerrit.wikimedia.org/r/216201 (owner: 10Dzahn) [00:20:32] (03PS1) 10Dzahn: Revert "Revert "ganglia: set ocg1001 as aggregator for ocg hosts"" [puppet] - 10https://gerrit.wikimedia.org/r/216365 [00:20:52] (03CR) 10Dzahn: "doesn't work - There was an error collecting ganglia data (127.0.0.1:8654): fsockopen error: Connection refused" [puppet] - 10https://gerrit.wikimedia.org/r/216365 (owner: 10Dzahn) [00:21:05] (03PS2) 10Dzahn: Revert "Revert "ganglia: set ocg1001 as aggregator for ocg hosts"" [puppet] - 10https://gerrit.wikimedia.org/r/216365 [00:22:11] (03CR) 10Dzahn: [C: 032] "sigh" [puppet] - 10https://gerrit.wikimedia.org/r/216365 (owner: 10Dzahn) [00:24:50] cscott: the PDF servers are back in ganglia .. just that they are using the old class [00:25:10] can't get them to use the new one without either breaking ganglia-web or the servers disappearing [00:25:54] 10Ops-Access-Requests, 6operations: Login for jkrauska to librenms - https://phabricator.wikimedia.org/T101064#1342896 (10JKrauska) No sir. Didn't sync with Rob. Will try again Monday. [00:26:50] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [00:30:22] (03PS3) 10BryanDavis: elasticsearch: allow control of dynamic scripting [puppet] - 10https://gerrit.wikimedia.org/r/216325 [00:30:57] (03PS2) 10BryanDavis: [WIP] logstash: jessie support and beta cluster cluster [puppet] - 10https://gerrit.wikimedia.org/r/216337 (https://phabricator.wikimedia.org/T101541) [00:31:54] (03PS3) 10BryanDavis: logstash: jessie support and beta cluster cluster [puppet] - 10https://gerrit.wikimedia.org/r/216337 (https://phabricator.wikimedia.org/T101541) [00:40:03] (03CR) 10BryanDavis: "Applied via cherry-pick on beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/216325 (owner: 10BryanDavis) [00:46:50] (03CR) 10BryanDavis: "Applied via cherry-pick on beta cluster. deployment-logstash2 jessie host up and running." [puppet] - 10https://gerrit.wikimedia.org/r/216337 (https://phabricator.wikimedia.org/T101541) (owner: 10BryanDavis) [00:47:19] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:03:25] 6operations, 10Wikimedia-Apache-configuration: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1342928 (10Dzahn) a:3Dzahn [01:13:55] 6operations: Migrate access-requests@ from RT to Phabricator - https://phabricator.wikimedia.org/T84861#1342935 (10Dzahn) done? [01:19:03] 6operations: Install fonts-wqy-zenhei on all mediawiki app servers - https://phabricator.wikimedia.org/T84777#1342937 (10Dzahn) a:3Dzahn [01:19:23] (03PS1) 10Dzahn: install fonts-wqy-zenhei on app servers [puppet] - 10https://gerrit.wikimedia.org/r/216392 (https://phabricator.wikimedia.org/T84777) [01:20:27] (03PS2) 10Dzahn: install fonts-wqy-zenhei on app servers [puppet] - 10https://gerrit.wikimedia.org/r/216392 (https://phabricator.wikimedia.org/T84777) [01:31:30] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [01:53:59] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [02:26:33] !log l10nupdate Synchronized php-1.26wmf8/cache/l10n: (no message) (duration: 07m 10s) [02:26:42] Logged the message, Master [02:31:27] !log LocalisationUpdate completed (1.26wmf8) at 2015-06-06 02:30:24+00:00 [02:31:32] Logged the message, Master [02:38:49] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [02:45:20] PROBLEM - puppet last run on mw2099 is CRITICAL Puppet has 1 failures [02:59:31] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:02:30] RECOVERY - puppet last run on mw2099 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:13:19] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [03:38:20] PROBLEM - puppet last run on dbstore2002 is CRITICAL puppet fail [03:39:09] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:55:39] RECOVERY - puppet last run on dbstore2002 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [04:05:19] PROBLEM - puppet last run on eventlog1001 is CRITICAL Puppet has 58 failures [04:24:33] 6operations: document redis upgrade/restart procedures - https://phabricator.wikimedia.org/T101585#1343058 (10chasemp) p:5Triage>3Normal [04:25:31] 6operations, 10Datasets-General-or-Unknown: dataset1001 - dpkg reports broken packages - https://phabricator.wikimedia.org/T101579#1343060 (10chasemp) p:5Triage>3Normal [04:25:50] 6operations, 10Analytics-Cluster: stat1002 - dpkg reports broken packages - https://phabricator.wikimedia.org/T101582#1343062 (10chasemp) p:5Triage>3Normal [04:26:33] 6operations: improve redis master/slave monitoring - https://phabricator.wikimedia.org/T101584#1343064 (10chasemp) p:5Triage>3Normal [04:43:20] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [05:36:19] RECOVERY - DPKG on dataset1001 is OK: All packages OK [05:41:04] 6operations, 10Datasets-General-or-Unknown: dataset1001 - dpkg reports broken packages - https://phabricator.wikimedia.org/T101579#1343082 (10ArielGlenn) 5Open>3Resolved a:3ArielGlenn I cleaned up a few of the entries and it's reporting ok now in icinga. [05:42:40] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 8 below the confidence bounds [05:48:43] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Jun 6 05:47:40 UTC 2015 (duration 47m 39s) [05:48:49] Logged the message, Master [05:49:30] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [05:52:49] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (102051s 100000s) [06:06:10] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [06:14:30] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [06:29:25] (03Abandoned) 10Legoktm: Add logmsgbot instance for #wikimedia-releng that listens to gallium [puppet] - 10https://gerrit.wikimedia.org/r/197386 (owner: 10Legoktm) [06:29:59] PROBLEM - puppet last run on db1034 is CRITICAL Puppet has 1 failures [06:30:11] PROBLEM - puppet last run on multatuli is CRITICAL puppet fail [06:30:29] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [06:31:40] PROBLEM - puppet last run on elastic1022 is CRITICAL Puppet has 1 failures [06:32:50] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:33:20] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:46:00] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:47:01] RECOVERY - puppet last run on elastic1022 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:47:09] RECOVERY - puppet last run on db1034 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:47:10] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:47:29] RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:17:00] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [07:34:00] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:07:40] PROBLEM - RAID on lvs3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:00] PROBLEM - SSH on lvs3001 is CRITICAL - Socket timeout after 10 seconds [08:09:09] PROBLEM - DPKG on lvs3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:20] PROBLEM - Disk space on lvs3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:20] PROBLEM - dhclient process on lvs3001 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [08:09:21] PROBLEM - configured eth on lvs3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:21] PROBLEM - Check rp_filter disabled on lvs3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:21] PROBLEM - puppet last run on lvs3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:29] PROBLEM - salt-minion processes on lvs3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:34:39] RECOVERY - DPKG on lvs3001 is OK: All packages OK [08:34:50] RECOVERY - Disk space on lvs3001 is OK: DISK OK [08:34:50] RECOVERY - dhclient process on lvs3001 is OK: PROCS OK: 0 processes with command name dhclient [08:34:50] RECOVERY - Check rp_filter disabled on lvs3001 is OK kernel parameters are set to expected value. [08:34:50] RECOVERY - configured eth on lvs3001 is OK - interfaces up [08:34:50] RECOVERY - puppet last run on lvs3001 is OK Puppet is currently enabled, last run 49 minutes ago with 0 failures [08:34:51] RECOVERY - RAID on lvs3001 is OK optimal, 2 logical, 2 physical [08:35:00] RECOVERY - salt-minion processes on lvs3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:35:11] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [08:45:43] (03PS1) 10Smalyshev: Add definitions for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/216403 [08:46:47] (03CR) 10jenkins-bot: [V: 04-1] Add definitions for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/216403 (owner: 10Smalyshev) [08:52:26] (03PS2) 10Smalyshev: Add definitions for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/216403 [08:53:09] (03CR) 10jenkins-bot: [V: 04-1] Add definitions for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/216403 (owner: 10Smalyshev) [08:54:21] 6operations, 10vm-requests, 7discovery-system: eqiad: 3 VM request for ETCD - https://phabricator.wikimedia.org/T101506#1343109 (10Joe) @faidon yes, we spoke about this in the ops meeting. Since the servers we chose (the zk servers) have to be moved from the analytics VLAN, and I need to port both the deb p... [08:54:49] (03PS1) 10ArielGlenn: dumps: clean up error handling in xml streamed dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/216404 [08:55:07] (03PS3) 10Smalyshev: Add definitions for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/216403 [08:56:40] (03CR) 10ArielGlenn: [C: 032] dumps: clean up error handling in xml streamed dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/216404 (owner: 10ArielGlenn) [09:00:33] (03PS4) 10Smalyshev: Add definitions for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/216403 [09:02:50] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [09:05:11] (03PS5) 10Ori.livneh: Add definitions for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/216403 (owner: 10Smalyshev) [09:05:33] SMalyshev: I made some lint-type fixes; I hope that's OK. I'm a stickler for our Puppet style and deviations make me twitch. [09:14:37] ori: thank you! [09:14:38] (03PS6) 10Smalyshev: Add definitions for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/216403 [09:16:08] SMalyshev: no problem. We have an nginx module, by the way, so to provision an nginx site, all you have to do is: ::nginx::site { 'wdqs': content => template('wdqs/nginx.erb'), } [09:16:23] this will take care of the package installation, managing the service, etc. [09:16:46] if you grep for 'nginx::site' you'll find several examples [09:16:49] ori: hmm... ok, I'll check it out. I wasn't sure it does what I want [09:18:00] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [09:18:24] ori: thanks for noting it, I wasn't sure which parts already exist and which don't [09:24:36] ori: if you're not sleeping yet - is there a list of puppet modules enabled somewhere? [09:25:26] there's https://doc.wikimedia.org/puppet/ , but it's pretty unreadable, in my opinion. The contents of the modules/ directory is probably your best bet. [09:26:11] aha, ok, thanks [09:26:24] we were experimenting with submodules for a while but the consensus nowadays is that they are such a pain to work with that some code duplication is the lesser evil [09:26:31] so nginx is actually quite unusual in being a submodule [09:26:40] the rest is almost entirely directly in the repository [09:27:03] i'm off for the night, good luck with puppet :) [09:28:05] thanks! [09:35:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 630 [09:40:09] RECOVERY - check_mysql on db1008 is OK: Uptime: 4395275 Threads: 2 Questions: 15253899 Slow queries: 29278 Opens: 47095 Flush tables: 2 Open tables: 64 Queries per second avg: 3.470 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:49:10] (03PS7) 10Smalyshev: Add definitions for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/216403 [10:37:16] 6operations, 10vm-requests, 7discovery-system: eqiad: 3 VM request for ETCD - https://phabricator.wikimedia.org/T101506#1343161 (10faidon) (the idea wasn't to stay with trusty but to upgrade those to jessie too) For the record, I think we'll call it "almost done", move on to another priority and never get t... [11:40:30] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [11:57:45] Coren: could you restart https://tools.wmflabs.org/copyvios/ please [12:10:10] PROBLEM - puppet last run on mw2093 is CRITICAL puppet fail [12:17:30] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:27:10] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:32:40] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:10] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 43.96 ms [12:44:00] (03PS4) 10Giuseppe Lavagetto: etcd: setup servers/ganglia stuff [puppet] - 10https://gerrit.wikimedia.org/r/216099 [13:16:54] (03PS1) 10Glaisher: Add 'abusefilter-modify-restricted' to abusefilter at plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216415 (https://phabricator.wikimedia.org/T101604) [13:36:29] (03PS1) 10ArielGlenn: dumps: for small wikis that are dumped in one run, fix item range [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/216416 [13:38:21] (03CR) 10ArielGlenn: [C: 032] dumps: for small wikis that are dumped in one run, fix item range [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/216416 (owner: 10ArielGlenn) [13:54:16] (03PS1) 10Andrew Bogott: Use ipresolve in a few more places. [puppet] - 10https://gerrit.wikimedia.org/r/216418 [13:54:59] (03CR) 10jenkins-bot: [V: 04-1] Use ipresolve in a few more places. [puppet] - 10https://gerrit.wikimedia.org/r/216418 (owner: 10Andrew Bogott) [14:00:03] (03PS2) 10Andrew Bogott: Use ipresolve in a few more places. [puppet] - 10https://gerrit.wikimedia.org/r/216418 [14:04:05] (03CR) 10Andrew Bogott: [C: 032] Use ipresolve in a few more places. [puppet] - 10https://gerrit.wikimedia.org/r/216418 (owner: 10Andrew Bogott) [14:46:30] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [14:58:19] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:02:47] (03PS1) 10Paladox: Add json and less highlight support to gitblit and gerrit [puppet] - 10https://gerrit.wikimedia.org/r/216421 [15:03:02] (03PS2) 10Paladox: Add json and less highlight support to gitblit and gerrit [puppet] - 10https://gerrit.wikimedia.org/r/216421 [15:04:53] (03CR) 10Paladox: "Adds less and json support for highlighting since they are now mostly used too. less is a form of css but more advanced." [puppet] - 10https://gerrit.wikimedia.org/r/216421 (owner: 10Paladox) [15:06:38] (03PS3) 10Paladox: Add json and less highlight support to gitblit and gerrit [puppet] - 10https://gerrit.wikimedia.org/r/216421 [15:46:52] YuviPanda, hi, any updates on postgis upgrade? MaxSem and I are really blocked by it :( [15:47:19] * yurik is sitting at UN at OSM conf :) [15:48:30] nice [15:49:02] yurik: I've no updates, I think it's basically just 'whenever Alex has time' [15:54:29] YuviPanda, i think someone assigned that task to yu [15:54:42] so akosiaris thinks its now on your plate [15:54:47] (i suspect) [15:54:51] oh, tha'ts probably for the check with people [15:54:54] give me link again? [15:54:56] I did they were ok with it [15:55:18] YuviPanda, https://phabricator.wikimedia.org/T101233 [15:55:53] 7Blocked-on-Operations, 6Labs, 10Maps: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1343408 (10yuvipanda) a:5yuvipanda>3None I asked them and they're ok with it :) And nope, not the right assignee :) This has no assignee atm. [15:55:54] updated [15:56:20] YuviPanda, do you think you can do it though? or is it akosiaris-only task? [15:56:29] big blocker on our side [15:56:35] my plate is pretty full now, sorry :( [15:56:47] that's why i think we need root on that machine [15:57:27] I think there are underlying fundamental organizational issues here, and just asking for root to trash around labsdb isn't going to help anything, yurik [15:57:45] i am not planning on trashing it you know )) [15:57:53] nobody plans on trashing things :) [15:58:14] YuviPanda, it seems that according to http://postgis.net/install/ - you simply need to install the new binaries (they can coexist it seems) [15:58:23] and than we can do the sql commands [15:58:36] <_joe_> yurik: speak with mark if you're blocked, or raise this in the SoS I guess [15:59:06] <_joe_> as far as I know, alex is pretty heavily working on the ops goal for this quarter [15:59:25] <_joe_> yurik: I'll raise the question in the next ops meeting anyway [15:59:50] _joe_, i raised it on the last sos. Thanks! [16:00:01] <_joe_> not about root, about your need for a postgis upgrade [16:00:02] 7Blocked-on-Operations, 6Labs, 10Maps, 6Scrum-of-Scrums: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1343410 (10Yurik) [16:00:14] _joe_, that's what i meant [16:00:27] <_joe_> root is not gonna happen, as I will personally oppose to that :) [16:00:48] for some reason the issue itself was not tagged with SOS [16:01:08] <_joe_> yurik: I'll raise this on monday anyway :) [16:01:22] <_joe_> we definitely don't want to block you if we can [16:01:58] <_joe_> (it's kind of a standard pain for us to be blockers of everyone, of course) [16:08:20] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [16:08:23] 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1343414 (10Yurik) 5Resolved>3Open I'm reopening this task - it was implemented as shared server as oppo... [16:09:58] I don't understand- yesterday this wasn't working: https://phabricator.wikimedia.org/T101567 [16:18:29] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [17:03:39] (03PS1) 10Yuvipanda: ores: Don't include proxy cache temp path when cache is turned off [puppet] - 10https://gerrit.wikimedia.org/r/216428 [17:03:39] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [17:04:11] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Don't include proxy cache temp path when cache is turned off [puppet] - 10https://gerrit.wikimedia.org/r/216428 (owner: 10Yuvipanda) [17:05:11] 6operations: foreachwikiexceptdblist to run scripts on all but a blacklist of wikis - https://phabricator.wikimedia.org/T101213#1343494 (10Krenair) I've been thinking about Ori's [[ https://gerrit.wikimedia.org/r/#/c/208263/4 | dblist expressions ]] here - what if we gave `foreachwiki` a parameter that would lim... [17:14:00] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [17:39:59] (03PS1) 10Jcrespo: Upgrade es1007 from MariaDB 5.5 to 10 [puppet] - 10https://gerrit.wikimedia.org/r/216429 [17:49:11] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 25.00% of data above the critical threshold [500.0] [17:57:31] (03PS1) 10Jcrespo: Repool es1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216430 [17:58:47] (03CR) 10Jcrespo: [C: 04-1] "Do not commit until gerrit:216429 is merged and we are 100% sure the server is up and stable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216430 (owner: 10Jcrespo) [17:59:40] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:02:09] PROBLEM - RAID on ms-be2008 is CRITICAL 1 failed LD(s) (Offline) [18:18:40] PROBLEM - puppet last run on ms-be2008 is CRITICAL Puppet has 1 failures [18:53:19] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [19:08:51] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:22:49] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [19:34:39] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:49:41] (03PS1) 10BBlack: turn text/mobile-be retry503 behavior back on T97206 [puppet] - 10https://gerrit.wikimedia.org/r/216438 [19:50:09] (03CR) 10BBlack: [C: 032 V: 032] turn text/mobile-be retry503 behavior back on T97206 [puppet] - 10https://gerrit.wikimedia.org/r/216438 (owner: 10BBlack) [19:53:37] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 6 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1343619 (10BBlack) The graphs since turning off retry503 on the text/mobile backends have had a lot of odd spikes. I suspect th... [19:57:29] PROBLEM - puppet last run on labvirt1008 is CRITICAL Puppet has 4 failures [20:15:31] (03CR) 10Alex Monk: "Jeremy, please abandon this. (and upload an equivalent to puppet repository?)" [apache-config] - 10https://gerrit.wikimedia.org/r/24407 (owner: 10Jeremyb) [20:16:00] RECOVERY - puppet last run on labvirt1008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:40:10] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 24 failures [20:57:00] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:57:59] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 28.57% of data above the critical threshold [500.0] [21:11:40] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [21:37:31] (03PS4) 10Paladox: Add json and less highlight support to gitblit and gerrit [puppet] - 10https://gerrit.wikimedia.org/r/216421 [21:59:51] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [22:06:49] PROBLEM - puppet last run on mw1240 is CRITICAL Puppet has 1 failures [22:22:09] RECOVERY - puppet last run on mw1240 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [22:55:50] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [23:31:19] deploying a parsoid hotfix for T101599 [23:46:00] !log deployed parsoid 5172a446 (cherry-pick of 719c736f) -- hotfix for T101599 [23:46:04] Logged the message, Master [23:48:49] Thanks subbu for this quick fix. [23:49:55] Dereckson, happy to. looks like this was biting frwiktionary folks a bit badly. [23:50:33] Yes, they use templates in titles on every (main) page.