[00:00:02] !log mw1293 - restart hhvm [00:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:28] ohh yeh sorry got distracted :) [00:00:29] (03Merged) 10jenkins-bot: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348176 (https://phabricator.wikimedia.org/T162201) (owner: 10Jdlrobson) [00:00:39] (03CR) 10jenkins-bot: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348176 (https://phabricator.wikimedia.org/T162201) (owner: 10Jdlrobson) [00:00:40] 06Operations, 10puppet-compiler: hosts with puppet compiler failures on every run - https://phabricator.wikimedia.org/T162949#3181989 (10Andrew) I think these 'failed to parse' issues are something about the host on which the puppet compiler is running. '/conftool/v1/v1/pools is not a directory' [00:01:12] 308 Undefined variable: wmgRelatedArticlesFooterBlacklistedSkins in /srv/mediawiki/wmf-config/CommonSettings.php on line 2878 [00:01:24] oh you've already have it [00:02:13] !log niharika29@tin Synchronized wmf-config/CommonSettings.php: Remove use of blacklist for related pages feature (T162201) (duration: 00m 41s) [00:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:19] T162201: Cleanup artifacts of related pages desktop beta feature - https://phabricator.wikimedia.org/T162201 [00:03:11] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Remove use of blacklist for related pages feature (T162201) (duration: 00m 41s) [00:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:22] jdlrobson: Synced everywhere. [00:03:44] !log mw1297 - restart hhvm/apache [00:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:19] Niharika: may I add to SWAT https://gerrit.wikimedia.org/r/348174 - Fix Abuse Filter configuration for tr.wikipedia? It's a follow-up for a change deployed earlier. [00:05:20] Niharika: wooop [00:05:21] thank you [00:05:47] !log niharika29@tin Started scap: Reword ORES preferences (T162831), Put ORES r behind a preference (T162831), Deploy Special:Autoblocklist (T146414) [00:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:56] T162831: Tweak ORES-Related Preferences for Watchlist and RC Page ahead of next release - https://phabricator.wikimedia.org/T162831 [00:05:56] T146414: Create Special:AutoblockList - https://phabricator.wikimedia.org/T146414 [00:06:40] ebernhardson: Hey, One quick question since I want to fix https://phabricator.wikimedia.org/T161563. Does logstash gzip compressed logs? [00:06:51] *Does accept [00:09:34] bd808: Who's the other designer "jgs" behind the scap piggy? [00:10:04] Niharika: https://en.wikipedia.org/wiki/Joan_Stark [00:10:05] Joan G. Stark [00:10:09] Dereckson: Sure. [00:10:24] (03CR) 10Niharika29: [C: 032] Fix Abuse Filter configuration for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348174 (https://phabricator.wikimedia.org/T161960) (owner: 10Dereckson) [00:10:34] Amir1: hmm, lemme check what it does [00:10:36] that's a good edit for Wikidata , occupation: ASCII artist, heh [00:11:03] Amir1: "A GELF message is a GZIP’d or ZLIB’d JSON string with the following fields: [00:11:31] Great [00:11:46] Niharika: scappy started life as the flying pig from https://web.archive.org/web/20091027211201/http://www.geocities.com/SoHo/7373/farm.htm#pig [00:12:20] bd808: You colored it and added the "MW"? [00:12:32] (03Merged) 10jenkins-bot: Fix Abuse Filter configuration for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348174 (https://phabricator.wikimedia.org/T161960) (owner: 10Dereckson) [00:12:40] Amir1: i suppose a semi-easy way to check, it looks like logstash expects the message to start with either: 0x78 0x9c (zlib), or 0x1f 0x8b (gzip) [00:12:46] (03CR) 10jenkins-bot: Fix Abuse Filter configuration for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348174 (https://phabricator.wikimedia.org/T161960) (owner: 10Dereckson) [00:12:49] those headers should come from the library itself afaik [00:13:02] Niharika: among other things yeah. there were a lot of little tweaks from the original. [00:13:44] :) Why did you pick the pig? [00:13:57] because pigs might fly [00:14:05] ebernhardson: hmm, I'm not sure if I have access to logstash inputs (server-side) [00:14:24] because the original bash scripts were a horrible mess and I dressed them up and made them fly [00:14:39] but still a mess at heart [00:14:46] Ah. :) [00:14:50] quiddity: what do you mean with the chan will get moderated? [00:14:57] Niharika: thanks [00:15:05] also pigs are second only to unicorns in terms of awesome animals :) [00:15:24] Amir1: probably not, but if you were testing locally you could probably check with wireshark or something to see what udp messages are sending, using something simple like `nc -l -u >/dev/null` or some such to receive the messages [00:15:40] * ebernhardson realizes that needs a port too...bad example :P [00:17:14] did the pig get approval from the BoC of the WCA ? [00:17:19] Sagan, I assume that means it will be made "/mode +m" ("-Only opped/voiced users may talk in channel.") [00:17:41] quiddity: ok :) [00:17:46] the bots shall inherit this channel [00:21:02] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6158/" [puppet] - 10https://gerrit.wikimedia.org/r/348172 (https://phabricator.wikimedia.org/T162183) (owner: 10Dzahn) [00:21:09] (03PS2) 10Dzahn: tendril: skip cert monitoring where Letsencrypt is disabled [puppet] - 10https://gerrit.wikimedia.org/r/348172 (https://phabricator.wikimedia.org/T162183) [00:24:39] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2987460 keys, up 21 days 8 hours - replication_delay is 0 [00:26:48] RECOVERY - DPKG on naos is OK: All packages OK [00:26:49] (03PS1) 10Ladsgroup: service: use gzip for logging in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563) [00:26:58] RECOVERY - Check size of conntrack table on naos is OK: OK: nf_conntrack is 0 % full [00:26:58] RECOVERY - salt-minion processes on naos is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:26:59] RECOVERY - dhclient process on naos is OK: PROCS OK: 0 processes with command name dhclient [00:27:08] RECOVERY - Disk space on naos is OK: DISK OK [00:27:26] ebernhardson: Where can I find error reports of logstash in beta cluster? [00:27:38] RECOVERY - Check the NTP synchronisation status of timesyncd on naos is OK: OK: synced at Fri 2017-04-14 00:27:29 UTC. [00:27:38] RECOVERY - Check whether ferm is active by checking the default input chain on naos is OK: OK ferm input default policy is set [00:27:43] I want to cherry-pick https://gerrit.wikimedia.org/r/348184 and see if it fixes [00:27:48] RECOVERY - MD RAID on naos is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [00:28:15] RainbowSprinkles: We were going to wait, but Niharika wanted to try deploying a real feature as part of her deployment training [00:28:24] Amir1: deployment-logstash2 [00:28:34] kaldari: Eh, ok I guess. [00:29:11] bd808: okay, but in any particular directory? [00:29:15] RainbowSprinkles: And since I already had the backport sitting there it seemed convenient :P [00:29:23] /var/log/logstash [00:29:37] nice, in it [00:29:48] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Update Zookeeper heap usage configuration and set alarms - https://phabricator.wikimedia.org/T157968#3182142 (10Dzahn) 05Resolved>03Open re-opening since Icinga has many alerts: https://icinga.wikimedia.org/cgi... [00:30:32] !log niharika29@tin Finished scap: Reword ORES preferences (T162831), Put ORES r behind a preference (T162831), Deploy Special:Autoblocklist (T146414) (duration: 24m 44s) [00:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:40] T162831: Tweak ORES-Related Preferences for Watchlist and RC Page ahead of next release - https://phabricator.wikimedia.org/T162831 [00:30:40] T146414: Create Special:AutoblockList - https://phabricator.wikimedia.org/T146414 [00:31:10] Amir1: the errors are on the logstash machines themselves unfortunately, we didn't want to make some loop where logstash logs to itself [00:31:45] Dereckson: Your patch is on mwdebug1002. [00:31:50] Anything to check? [00:32:34] Amir1: so like logtash1002.eqiad.wmnet:/var/log/logstash/logstash-plain.log is where they would end up [00:33:03] ebernhardson: even in beta cluster? [00:33:24] Amir1: deployment-logstash2.eqiad.wmflabs i believe [00:33:32] 06Operations, 10DBA, 10Icinga, 10Monitoring, 13Patch-For-Review: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3182160 (10Dzahn) fixed. false positives are gone, the real check stays and is OK https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_stri... [00:33:32] yeah [00:33:38] 06Operations, 10DBA, 10Icinga, 10Monitoring, 13Patch-For-Review: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3182163 (10Dzahn) 05Open>03Resolved [00:33:38] RECOVERY - nutcracker port on naos is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [00:33:38] RECOVERY - nutcracker process on naos is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [00:33:54] I'm in that folder now but there is no ores in it [00:34:01] Dereckson: Are you around? [00:34:01] I'm trying to produce some :D [00:34:08] 06Operations, 10DBA, 10Icinga, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3154666 (10Dzahn) [00:35:55] Amir1: the messages that look like this are (most likely) ores: [2017-04-14T00:31:04,915][WARN ][logstash.inputs.gelf ] Gelfd failed to parse a message skipping {:exception=>#, :backtrace=>["/usr/share/logstash/vendor/bundle/jruby/1.9/gems/gelfd-0.2.0/lib/gelfd/parser.rb:14:in `parse'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-gelf- [00:35:55] ebernhardson: it seems the beta cluster is trying to write over production: https://phabricator.wikimedia.org/T161563 [00:36:10] "host":"deployment-sca03" [00:36:45] Amir1: i had to find those messages with tcpdump, logstash logging doesn't actually report what the invalid message ways (partially because it expects it to have been binary, but malformed i suppose) [00:37:05] okay [00:37:35] Niharika: ping [00:37:51] logs look good to me [00:37:54] Amir1: uhh.. are you jsut sending to the wrong port? does uwsgi really know how to do GELF? [00:37:59] Dereckson: Okay. Syncing. [00:38:14] bd808: it seems it's not compressed [00:38:34] bd808: it doesn't really do gelf, it's just sending json to udp 12201 [00:38:51] okay now it's time to cherry-pick it [00:38:54] bd808: so it won't chunk correctly, but the main problem right now is its also not sending it compressed, and logstash requires the udp messages to be compressed [00:39:25] !log niharika29@tin Synchronized wmf-config/abusefilter.php: Fix Abuse Filter configuration for tr.wikipedia (T161960) (duration: 00m 42s) [00:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:33] T161960: Enable the blocking feature of AbuseFilter on trwiki - https://phabricator.wikimedia.org/T161960 [00:39:43] Looks like https://dbtree.wikimedia.org/ is down and also I get 401 Unauthorized error when I try to go to Kibana. [00:39:49] All done! Woohoo. [00:40:06] Thanks for the deploy Niharika [00:40:14] kaldari: hmm, fwiw kibana loaded for me. But that doesn't mean a whole lot [00:40:14] ebernhardson: yeah, but it should really be sending to port 11514 [00:40:24] Thanks to RoanKattouw. :) [00:40:30] which is line oriented json [00:40:33] Niharika: \m/ [00:40:34] (03PS2) 10Dzahn: Use new pageassessments dblist to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/348173 (https://phabricator.wikimedia.org/T159438) (owner: 10Kaldari) [00:40:37] bd808: oh, i didn't realize that [00:41:10] ebernhardson: Are you using https://logstash.wikimedia.org/app/kibana ? [00:41:21] kaldari: yes [00:41:25] Amir1: I'm pretty sure your port is the problem [00:41:25] weird [00:41:41] striker sends to 11514 [00:42:15] there are a gazillion different input ports on the logstash boxes for different input types [00:42:18] (03CR) 10Dzahn: [C: 032] Use new pageassessments dblist to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/348173 (https://phabricator.wikimedia.org/T159438) (owner: 10Kaldari) [00:42:24] it's strange because the port is coming from hiera [00:42:25] and 11514 is line oriented json [00:42:41] somebody put the wrong value in :) [00:42:58] hiera isn't magic, gut config [00:43:05] *just config [00:43:26] I mean the weird part is someone didn't notice it before [00:43:48] ebernhardson: even weirder. It loads fine in Safari, but not Firefox. Guess I'll go file a bug. [00:44:14] kaldari: works in FF for me [00:44:17] I swear it worked a couple weeks ago [00:44:32] I've got FF 52.0.2 [00:44:45] from the ESR channel [00:45:12] kaldari: hmm, probably worthwhile i suppose. That authorization denied would be coming from the apache instance, rather than kibana itself. http auth is very standardized so not sure what could be wrong [00:46:05] Amir1: it's using service::configuration::logstash_port_logback which is the GELF port that the nodejs app use [00:46:52] shoot, we should fix it [00:47:33] (03CR) 10Dzahn: "this is a bit tricky. I _do_ agree that using regex to match an exact commandline is normally better to monitor a running service and we d" [puppet] - 10https://gerrit.wikimedia.org/r/348165 (owner: 10Paladox) [00:52:48] Amir1: uhhh well... modules/service/manifests/configuration.pp says "$logstash_port_logback = 11514" so I don't know how you are getting to port 12201 [00:52:50] (03PS2) 10Dzahn: standardize "include ::profile:*", "include ::nrpe" [puppet] - 10https://gerrit.wikimedia.org/r/347023 [00:53:21] last kaldari [00:53:25] oops [00:53:39] i just wanted to say the pageassesment thing is merged [00:53:40] (03PS2) 10Ladsgroup: service: use gzip for logging in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563) [00:53:51] and look at auth issue [00:56:20] kaldari: kibana should really let you login, does it ask you for login? the wikitech credentials should do it since you are in "wmf" group [00:57:01] and dbtree loads for me.. maybe a little slow at the beginning [00:57:05] mutante: It isn't even asking for credentials, it just immediately gives me an Unauthorized. Lemme try rebooting FF. [00:57:42] mutante: When I try to go to https://dbtree.wikimedia.org/ it says "database connection to tendril on tendril-backend.eqiad.wmnetfailed" [00:58:11] weird, i dont see that error [00:58:17] and i should also be using eqiad [00:58:34] mutante: I'm just not having any luck today :) [00:59:08] yea.. hmm.. works for me.. but that's not an error that sounds like local [00:59:40] mutante: OK, rebooted FF and now Kibana is working fine for me [00:59:56] ok! that's something:) [01:00:03] mutante: https://dbtree.wikimedia.org/ is still broken though [01:00:03] dbtree is working for me kaldari [01:00:34] bd808: found the issue behind it. It was an old cherry-pick in beta cluster [01:00:46] kaldari: if possible goto cmd prompt and type ipconfig /flushdns and then ipconfig /renew [01:01:06] i had the issue with enwiki before and it fixed it [01:01:25] dbtree is busted for me too. same "database connection to tendril on tendril-backend.eqiad.wmnetfailed" message [01:01:45] x-cache header says "cp2006 miss, cp4001 hit/2, cp4003 hit/3" [01:01:52] so dbtree uses misc-varnish, the director is "noc" [01:01:59] "noc" has 2 backends, terbium and wasat [01:02:09] hah, I'm not crazy! [01:02:15] :) [01:02:35] the varnish head I'm hitting is [2620:0:863:ed1a::3:d]:443 [01:03:08] RECOVERY - Check systemd state on naos is OK: OK - running: The system is fully operational [01:04:09] my x-cache:cp1058 miss, cp1058 hit/4 [01:04:19] tendril-backend.eqiad = db1011 [01:04:24] there is no tendril-backend.codfw [01:04:37] so we are not talking to different backends.. uhmm [01:04:53] my x-varnish:36791322, 6094755 6124425 [01:04:57] we should make another ticket :p [01:05:46] db1011 appears to be running normal afaict [01:05:58] cp2* is codfw and then what is cp4*? sf? [01:06:05] yeah [01:06:06] yes, 4 = ulsfo [01:06:26] so it looks like the sf varnish is the bad one [01:06:34] cp4003 [01:06:48] i take it cp1* is eqiad? [01:06:54] yes [01:06:58] and 2 is esams [01:07:03] uh, 3 is [01:07:09] but why would it be a varnish problem if it is "database connection to tendril" [01:07:14] and cp5* will be somewhere in asia :) [01:07:24] heh, yea [01:07:29] we are waiting for the name :) [01:07:50] we have a german one? [01:08:05] Last I checked, Amsterdam is not in Germany [01:08:11] we know it will start with "sin" [01:08:22] Zppix: https://wikitech.wikimedia.org/wiki/Esams_cluster [01:08:22] end, surely? ;) [01:08:24] eh, end [01:08:40] im shocked not having a server in germany is like not having servers at all :D [01:08:58] Zppix why? [01:09:08] ebernhardson bd808: After cherry-picking and fixing the port by removing old cherry-pick, now errors have stopped [01:09:16] Amir1: w00t! [01:09:19] germany is usually where al lthe servers are hosted in the EU atleast that ive seen [01:09:31] I think you're very wrong on that [01:09:35] yeah [01:09:40] There's some big german based providers, sure, hetzner etc [01:09:42] "that i've seen" [01:10:30] hmm, the cabeling demoed on the esams page is a bit messy :P (but don't look behind my tv either...) [01:10:37] cabling [01:11:04] It's a few years ago :P [01:11:07] ebernhardson: blame m.ark ;) [01:11:12] Zppix actually they are all spread out [01:11:15] (03PS3) 10Ladsgroup: service: use gzip for logging in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563) [01:12:36] ebernhardson: codfw is prettier -- https://wikitech.wikimedia.org/wiki/Codfw_cluster#/media/File:Wikimedia_Foundation_Servers_2015-86.jpg [01:18:52] fwiw, comparing Amsterdam IX with Frankfurt CIX https://ams-ix.net/technical/statistics vs https://www.de-cix.net/en/locations/germany/frankfurt/statistics they are both about the same in terms of traffic, over 5Terabyte/s [01:18:58] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:21:32] ooh, codfw is prettier :) [01:22:00] definitely :) that is papaul's work [01:26:54] (03CR) 10Ladsgroup: "I cherry-picked this in beta cluster. The errors have stopped but I couldn't see any logs coming from requests in logsatsh. Maybe still so" [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563) (owner: 10Ladsgroup) [01:27:31] ebernhardson: ORES logs are indeed saved in logstash [01:27:32] * Amir1 https://gerrit.wikimedia.org/r/#/c/348184/1 [01:27:48] Sorry, wrong link [01:27:55] https://logstash.wikimedia.org/app/kibana#/dashboard/ORES [01:28:29] It was only beta cluster [01:32:38] 06Operations, 10DBA: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3182225 (10Dzahn) [01:33:38] PROBLEM - Apache HTTP on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [01:33:48] PROBLEM - Nginx local proxy to apache on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [01:34:38] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.032 second response time [01:34:48] RECOVERY - Nginx local proxy to apache on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.048 second response time [01:34:53] Amir1: hmm, interesting. I'm sure i had found a bunch of error messages on the prod machines [01:35:13] (03PS4) 10Ladsgroup: service: use gzip for logging in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563) [01:37:00] ebernhardson: if you find them again. Can you inform me? maybe it's another issue [01:37:15] Amir1: checking, i'm not seeing them in any of todays logs, so perhaps something changed since then [01:38:35] Amir1: well, if it's not error now and the logs are in logstash, thats probably good enough [01:39:08] great, I hope it helps in moving the upgrade forward [01:39:46] 06Operations, 10DBA: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3182213 (10Paladox) it works for me. [01:47:58] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [01:48:06] (03PS1) 10Dzahn: contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822) [01:49:13] (03CR) 10jerkins-bot: [V: 04-1] contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822) (owner: 10Dzahn) [01:49:25] (03PS2) 10Dzahn: contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822) [01:50:20] (03CR) 10jerkins-bot: [V: 04-1] contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822) (owner: 10Dzahn) [01:50:24] (03PS3) 10Dzahn: contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822) [01:52:13] (03CR) 10Dzahn: [C: 031] "i removed the change in modules/role/manifests/memcached.pp:8 . I could not explain the syntax error. but it's gone without that. strange " [puppet] - 10https://gerrit.wikimedia.org/r/347023 (owner: 10Dzahn) [01:53:00] (03PS4) 10Dzahn: contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822) [01:55:52] (03CR) 10Dzahn: [C: 032] contint/icinga: skip zmq_publisher monitor if no jenkins [puppet] - 10https://gerrit.wikimedia.org/r/348191 (https://phabricator.wikimedia.org/T162822) (owner: 10Dzahn) [02:01:02] 06Operations, 10Continuous-Integration-Infrastructure, 10Icinga, 06Release-Engineering-Team, 13Patch-For-Review: remove/fix jenkins icinga monitoring on contint2001 - https://phabricator.wikimedia.org/T162822#3182243 (10Dzahn) fixed. gone on 2001, exists on 1001, no more cruft in Icinga https://icinga.w... [02:01:15] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#3182246 (10Dzahn) [02:01:17] 06Operations, 10Continuous-Integration-Infrastructure, 10Icinga, 06Release-Engineering-Team, 13Patch-For-Review: remove/fix jenkins icinga monitoring on contint2001 - https://phabricator.wikimedia.org/T162822#3182244 (10Dzahn) 05Open>03Resolved [02:02:06] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2795939 (10Dzahn) once jenkins is running on both servers, don't forget to remove https://gerrit.wikimedia.org/r/#/c/348171/... [02:12:23] 06Operations, 10DBA, 10Icinga, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3182269 (10Dzahn) [02:12:25] 06Operations, 10DBA, 10Traffic, 13Patch-For-Review: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#3182270 (10Dzahn) [02:14:09] 06Operations, 10vm-requests, 13Patch-For-Review: Site: 2 VM request for tendril - https://phabricator.wikimedia.org/T149557#3182272 (10Dzahn) [02:14:12] 06Operations, 10DBA, 10Icinga, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3154666 (10Dzahn) [02:22:36] 06Operations, 10vm-requests, 13Patch-For-Review: Site: 2 VM request for tendril - https://phabricator.wikimedia.org/T149557#3182302 (10Dzahn) 05Resolved>03Open re-opening. tendril in DNS is still an alias for einsteinium ``` tendril.wikimedia.org is an alias for einsteinium.wikimedia.org. einsteinium.... [02:23:39] 06Operations, 10vm-requests, 13Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3182307 (10Dzahn) [02:26:53] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 635 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2990878 keys, up 21 days 10 hours - replication_delay is 635 [02:45:53] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2986371 keys, up 21 days 10 hours - replication_delay is 0 [03:00:11] (03PS1) 10Dzahn: ci/labs/tendril: add some comments/FIXMEs about moving Hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/348194 [03:01:38] (03CR) 10Dzahn: [C: 032] "only comments" [puppet] - 10https://gerrit.wikimedia.org/r/348194 (owner: 10Dzahn) [03:19:44] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#3182333 (10faidon) Sure, that's OK. [03:32:53] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:32:53] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:53] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:35:53] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [03:40:56] (03PS1) 10Dzahn: base::kernel: add mod blacklist specific to R320, blacklist acpi_pad [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850) [03:43:29] (03PS2) 10Dzahn: base::kernel: add mod blacklist specific to R320, blacklist acpi_pad [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850) [03:44:45] (03PS3) 10Dzahn: base::kernel: add mod blacklist specific to R320, blacklist acpi_pad [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850) [03:47:58] (03Abandoned) 10Dzahn: base: blacklist acpi_pad kernel module [puppet] - 10https://gerrit.wikimedia.org/r/348016 (owner: 10Dzahn) [03:50:44] (03PS4) 10Dzahn: base::kernel: mod blacklist for Dell R320, blacklist acpi_pad [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850) [03:56:23] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6161/ (praseo* and xeon are R320, the other 2 are random others as control)" [puppet] - 10https://gerrit.wikimedia.org/r/348197 (https://phabricator.wikimedia.org/T162850) (owner: 10Dzahn) [04:09:23] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=461.40 Read Requests/Sec=579.20 Write Requests/Sec=26.70 KBytes Read/Sec=38439.20 KBytes_Written/Sec=158.40 [04:15:24] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=4.40 Read Requests/Sec=0.80 Write Requests/Sec=2.90 KBytes Read/Sec=3.20 KBytes_Written/Sec=70.00 [04:49:17] 06Operations, 10DBA: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3182380 (10Peachey88) [05:57:41] (03PS1) 10Matthias Mullie: Full path to xvfb-run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348199 [06:31:29] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3182388 (10elukey) From http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/Replication.Redis.Versions.html: ``` Redis Versions Prior to 2.8.22 Redis bac... [06:37:53] PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [07:05:54] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:10:21] (03PS1) 10Legoktm: Create view for "linter" table on Labs [puppet] - 10https://gerrit.wikimedia.org/r/348201 (https://phabricator.wikimedia.org/T160611) [07:23:53] !log executed CONFIG SET appendfsync no on redis2005:6780 as performance test [07:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:30] I started this test yesterday but got stopped by the unrelated replica lagging [07:25:21] Redis latency doctor is suggesting that all the redis jobqueues are showing spikes in latency for AOF related activities, that with the current (Default) config involve fsync() every 1s [07:26:01] what I am trying to test is if avoiding fsync could remove latency spikes registered [07:26:15] not suggesting to remove it everywhere of course, this is only to prove a point :) [08:00:13] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3182442 (10elukey) @hashar nice finding! So let me recap the errors that we are seein... [08:02:59] 06Operations, 10Traffic, 10netops: Network equipment order for SIN - https://phabricator.wikimedia.org/T162984#3182444 (10ayounsi) [08:03:53] 06Operations, 10Traffic, 10netops, 10procurement: Network equipment order for SIN - https://phabricator.wikimedia.org/T162984#3182458 (10ayounsi) [08:09:39] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3182459 (10elukey) While reviewing the above data and graphite metrics I realized that... [08:24:43] PROBLEM - swift-container-replicator on ms-be2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:53] PROBLEM - swift-container-auditor on ms-be2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:33] RECOVERY - swift-container-replicator on ms-be2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [08:25:43] RECOVERY - swift-container-auditor on ms-be2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:32:03] !log restored appendfsync to 'everysec' on Redis rdb2005:6380 (end of performance experiment) [08:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:46] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3182473 (10elukey) We removed the persistent connections in T129517#2113526 @aaron -... [09:18:53] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [09:19:53] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2978037 keys, up 21 days 17 hours - replication_delay is 0 [09:23:18] I know I know Redis you are not happy with the buffers [09:25:13] PROBLEM - swift-object-updater on ms-be2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:26:03] RECOVERY - swift-object-updater on ms-be2006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:43:34] !log temporarily set sysctl -w net.ipv4.ip_local_port_range="15000 64000" on mw1306 (jobrunner) as test - (rollback: sysctl -w net.ipv4.ip_local_port_range="32768 60999") - T157968 [09:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:42] T157968: Update Zookeeper heap usage configuration and set alarms - https://phabricator.wikimedia.org/T157968 [09:44:53] PROBLEM - Check size of conntrack table on mw1306 is CRITICAL: CRITICAL: nf_conntrack is 95 % full [09:45:04] elukey: contract is full :( [09:45:29] yeah makes sense [09:45:52] more different (src, src_port, dest, dest_port) to keep track of I guess [09:46:08] then since tcp reuse is around, I guess we can try lowering the local port range [09:47:11] so we have net.netfilter.nf_conntrack_max = 262144 maximum [09:47:12] mmm [09:47:53] RECOVERY - Check size of conntrack table on mw1306 is OK: OK: nf_conntrack is 64 % full [09:48:19] elukey@mw1306:~$ sudo sysctl net.netfilter.nf_conntrack_max [09:48:19] net.netfilter.nf_conntrack_max = 262144 [09:48:19] elukey@mw1306:~$ sudo sysctl net.netfilter.nf_conntrack_count [09:48:20] net.netfilter.nf_conntrack_count = 173125 [09:48:35] so yeah that value would need to be bumped a bit as well [09:50:26] could we set the local port range per process? [09:50:49] !log temporarily set sysctl -w net.netfilter.nf_conntrack_max=524288 on mw1306 (jobrunner) as test - (rollback: sysctl -w net.netfilter.nf_conntrack_max=262144") [09:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:47] forget me [09:51:51] I am mumbling really [09:52:22] the thing thing is that if the connections are reused, I am not sure why we would lack sockets [09:52:56] well not all of them are reused [09:53:03] only the ones that can be reused safely [09:56:46] so total conns in TIME_WAIT are now 102287 [09:59:54] hashar: what should I use as filter string in kibana to isolate redis messages and host:mw1306 [09:59:57] ? [10:00:10] yup [10:00:11] https://logstash.wikimedia.org/goto/0189564b88c008041dfd7eddb7c9e1a7 [10:00:17] and did channels.raw: "redis" [10:00:25] was actually looking at that [10:00:42] thanks :) [10:01:13] the thing is we have multiple errors [10:03:50] TCP: 110017 (estab 88, closed 109889, orphaned 0, synrecv 0, timewait 109886/0), ports 0 [10:03:57] this is a bit crazy to look on a client [10:05:07] then give the client reuses tcp connections [10:05:22] I am not sure how it would end up falling to connect due to lack of local fd [10:05:33] maybe there is a similar issue on the server side as well [10:05:44] I am lost really :( [10:06:17] I am not super expert in this particular case but I suspect that whatever is listed as TIME_WAIT is not recycled for some reason [10:06:43] otherwise if I got it correctly the socket should stay in TIME_WAIT only for a couple of second [10:06:46] *seconds [10:07:14] that option helps but it is probably not the silver bullet [10:07:29] the bigger issue in my opinion is that we lack proper connection pooling to Redis [10:08:13] supposedly that is twice a value of 60 seconds [10:08:21] or 2 minutes locked in TIME_WAIT stte [10:08:38] but I guess recycle actually reuse that [10:09:49] I am off for lunch. [10:14:13] will leave these settings for a bit and then rollback [10:14:20] I don't see a massive change [10:29:43] !log rollback systctl settings on mw1306 after experiment (stop jobchron/runner, stop hhvm, restore systctl settings, restart hhvm and job* daemons) [10:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:31] didn't have to stop hhvm at all, only job* daemons [10:34:16] will keep an eye on mw1306 [10:35:21] I really hoped for a better result :/ [10:42:27] 06Operations, 10Domains, 10Education-Program-Dashboard, 10Traffic: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3182571 (10Yury_Bulka) Is it likely to have this implemented by September? [10:49:27] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3182578 (10elukey) So on every mw host we set `net.ipv4.tcp_tw_reuse=1`, this is proba... [11:04:11] (03PS1) 10Elukey: Fix Zookeeper's alarm for heap usage [puppet] - 10https://gerrit.wikimedia.org/r/348206 (https://phabricator.wikimedia.org/T157968) [11:13:54] (03CR) 10Elukey: [C: 032] Fix Zookeeper's alarm for heap usage [puppet] - 10https://gerrit.wikimedia.org/r/348206 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [11:16:32] 06Operations, 10ops-codfw, 06DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#3182620 (10ayounsi) Here are the changes I suggest to push to get this going, a 1:1 copy of the setup in eqiad (except IPs). > ayounsi@mr1-codfw# commit check > configuration check succeeds Paste... [11:23:31] 06Operations, 10netops: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3182655 (10ayounsi) I suggest we change this alert to only triggers on core/transit/peering/cust links and add a more global alert for all the other kind of links... [11:25:45] elukey: so widening the local port range just allowed for the connections to spread on a larger ranger [11:25:56] leading to X * more connections in TIME_WAIT [11:27:05] and a raise of conntrack [11:28:47] hashar: yep, but that's was expected [11:29:22] and with netstat -n -o (which shows the time spent in a state) [11:29:31] the TIME_WAIT seems to idle for 60 secs [11:29:59] ah nice I didn't know that option [11:30:25] neither did I until I opened the man page :-} [11:31:35] this is kinda supporting my point that tw_reuse is not applicable to all the TIME_WAIT sockets [11:31:47] it helps but it is not the silver bullet [11:34:22] and netstat -n --statistics has a bunch of tcp related stats [12:34:46] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:46:10] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Update Zookeeper heap usage configuration and set alarms - https://phabricator.wikimedia.org/T157968#3182747 (10elukey) Multiple PEBKACs from my side: 1) I acked permanently the alarms in Icinga without realizing i... [12:59:12] (03PS1) 10Elukey: Fix thresholds for Zookeeper Heap usage alarms [puppet] - 10https://gerrit.wikimedia.org/r/348214 (https://phabricator.wikimedia.org/T157968) [13:00:47] (03CR) 10Elukey: [C: 032] Fix thresholds for Zookeeper Heap usage alarms [puppet] - 10https://gerrit.wikimedia.org/r/348214 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [13:03:46] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:04:37] PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:09:31] (03CR) 10MarkTraceur: [C: 031] Full path to xvfb-run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348199 (owner: 10Matthias Mullie) [13:31:32] (I'm not sure here is proper place ;) )Could you please help me rename works? Accroding to GRP, to rename account with more than 50,000 edits should not be performed without supervision of a sysadmin. [13:32:54] Sounds about right [13:32:58] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:34:26] Oh, it's good luck. [13:35:18] What should I do for it? [13:40:01] Sotiale: I think the usual reason is to make sure people are around incase something goes wrong (or it causes big problems on one of the wikis) [13:43:58] (03PS3) 10Jgreen: exim/fundraising: barium -> civi1001, donate mails to civicrm [puppet] - 10https://gerrit.wikimedia.org/r/348158 (https://phabricator.wikimedia.org/T162952) (owner: 10Dzahn) [13:45:37] (03CR) 10Jgreen: [C: 032] exim/fundraising: barium -> civi1001, donate mails to civicrm [puppet] - 10https://gerrit.wikimedia.org/r/348158 (https://phabricator.wikimedia.org/T162952) (owner: 10Dzahn) [13:46:00] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Update Zookeeper heap usage configuration and set alarms - https://phabricator.wikimedia.org/T157968#3182893 (10elukey) 05Open>03Resolved @Dzahn thanks a lot for the heads up, I should have fixed the issues. My... [13:47:56] @Reedy: I already notified information about expected problems(ex. account login errors, etc..). As a result, He/She understood it and He/She said he/she would not object to any problem related to the problems. [13:48:13] Well, it's more if you cause database replication problems [13:48:18] Which has been know to happen in the past [13:48:31] I don't know how much better this is now as to how it was previously [13:53:16] from memory, legoktm recommends filing a task in phabricator and will look at the request [13:53:38] there are some that due to the number of edits can't be preformed safely [14:22:27] (03Abandoned) 10Hashar: Revert "ldap: Add warning to ldaplist" [puppet] - 10https://gerrit.wikimedia.org/r/337842 (https://phabricator.wikimedia.org/T114063) (owner: 10Hashar) [14:23:00] 06Operations, 10Analytics: sync bohrium and apt.wikimedia.org piwik versions - https://phabricator.wikimedia.org/T149993#3182966 (10elukey) [14:26:58] (03Draft33) 10Hashar: (WIP) Crazy rspec for the role module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/307425 [14:27:27] (03Abandoned) 10Hashar: (WIP) Crazy rspec for the role module (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/307425 (owner: 10Hashar) [14:32:10] (03Abandoned) 10Hashar: check_graphite anomaly option to set minimum upper band [puppet] - 10https://gerrit.wikimedia.org/r/338095 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [14:33:17] (03Abandoned) 10Hashar: mediawiki-firejail: lint python scripts [puppet] - 10https://gerrit.wikimedia.org/r/338978 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [14:33:20] (03Abandoned) 10Hashar: mediawiki-firejail: explicitly signal end of options [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [14:33:43] (03Abandoned) 10Hashar: mediawiki-firejail: quiet firejail [puppet] - 10https://gerrit.wikimedia.org/r/338980 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [14:47:38] 06Operations, 10DBA: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3182213 (10Marostegui) Works for me [14:51:57] 06Operations, 10ops-eqiad, 10netops: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3183025 (10elukey) [14:56:06] 06Operations, 13Patch-For-Review, 07Wikimedia-log-errors: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3183050 (10hashar) I have found a straightforward case: MediaWiki invokes `convert --version` bu... [14:59:13] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.56 seconds [15:02:13] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [15:47:43] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:15:33] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:54:37] 06Operations, 10Domains, 10Education-Program-Dashboard, 10Traffic: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3183349 (10Ragesoss) For a wikimedia.org link, very unlikely. Getting all of the requirements in place to move this to WMF production servers is... [17:07:06] !log deployed phabricator hotfix for T162943 [17:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:16] T162943 [17:09:30] bd808: is task detection link thing broken? [17:10:13] oh nevermind its a restricted task [17:13:24] doh! https://twitter.com/usrbingrump/status/852173471949914112 [17:15:08] twentyafterfour: when in doubt revert to the commit called "Initalise Repo" that commit always seems to work for some reason :P [17:37:03] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:53] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [17:41:17] 06Operations: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401#3183419 (10Halfak) [17:41:58] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Update Zookeeper heap usage configuration and set alarms - https://phabricator.wikimedia.org/T157968#3183439 (10Dzahn) @elukey thank you for fixing :) They all look green now. I'll comment if i see them again. [17:42:03] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:44:03] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 2.678 second response time [17:44:05] 06Operations, 06Labs, 10hardware-requests: Codfw: (2) hardware access request for labtest [region 2] - https://phabricator.wikimedia.org/T161766#3142263 (10RobH) [17:44:36] 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2919499 (10RobH) [17:45:06] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for labnodepool1002 - https://phabricator.wikimedia.org/T161753#3141989 (10RobH) [17:46:44] (03CR) 10Dzahn: [C: 04-1] "i think the best monitoring is probably if we add a second check, leave the current one as it is, but add a second one to specifically jus" [puppet] - 10https://gerrit.wikimedia.org/r/348165 (owner: 10Paladox) [17:48:06] !log mw1297 - restarted hhvm and apache [17:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:03] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:54:53] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:56:34] (03PS3) 10Paladox: Phabricator: Update nrpe command for checking if phd is running [puppet] - 10https://gerrit.wikimedia.org/r/348165 [17:57:28] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for labnodepool1002 - https://phabricator.wikimedia.org/T161753#3141989 (10RobH) [17:57:43] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for labnodepool1002 - https://phabricator.wikimedia.org/T161753#3141989 (10RobH) [18:07:43] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:07:53] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:08:33] PROBLEM - zotero on sca1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:08:43] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [18:08:43] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:08:43] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:23] RECOVERY - zotero on sca1003 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.007 second response time [18:10:33] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:10:33] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [18:10:33] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [19:01:42] 06Operations, 10DBA: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3183666 (10jcrespo) 05Open>03stalled Most likely a one-time error that got cached for some time? Tendril db tends to fail quite regularly due to large queries asking for large reports (but that is mostly ok). W... [19:22:29] (03PS1) 10Hashar: swift: make rewrite_thumb_server optional [puppet] - 10https://gerrit.wikimedia.org/r/348236 [19:30:02] 06Operations, 10hardware-requests: eqiad: (2) hardware access request for californium and silver (labweb1001/1002) - https://phabricator.wikimedia.org/T161752#3141925 (10RobH) [19:35:16] Dereckson: are you around? Trwiki is having some abusefilter issues agaiin [19:35:56] ^^ or if anyone else is around to assist with that i would be grateful [19:36:09] (03PS2) 10Hashar: swift: feature flag the proxy rewriting [puppet] - 10https://gerrit.wikimedia.org/r/348236 [19:39:53] (03PS3) 10Hashar: swift: feature flag the proxy rewriting [puppet] - 10https://gerrit.wikimedia.org/r/348236 [19:40:13] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:42:45] (03PS1) 10Aklapper: Phabricator monthly email: Also include Differential user activity [puppet] - 10https://gerrit.wikimedia.org/r/348238 [19:44:48] (03CR) 10Aklapper: "Probably needs additional permissions to allow the script to access the phabricator_differential DB (which the script did not access befor" [puppet] - 10https://gerrit.wikimedia.org/r/348238 (owner: 10Aklapper) [19:53:13] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [19:53:47] jouncebot: now [19:53:47] No deployments scheduled for the next 229 hour(s) and 6 minute(s) [19:54:13] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2970804 keys, up 22 days 3 hours - replication_delay is 0 [19:54:30] i know theres no deployments today but is there a chance i could have something deployed its a mediawiki/config change... https://gerrit.wikimedia.org/r/#/c/347807/ [19:56:34] it's not a mediawiki config change [19:56:38] it's a localisation change [19:57:32] It also doesn't look like it's anything actually broken, it's an enhancement [20:00:56] thats my fault i misread the repo [20:01:05] thats what happens when you look at mutiple gerrit changes [20:01:07] at once [20:02:43] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:03:17] Zppix: we usually dont deploy anything on friday [20:03:39] and next week is on deployment freeze because the service is going to be switched from a datacenter to another [20:03:53] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:03:53] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:04:53] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [20:05:43] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [20:05:53] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [20:07:43] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:07:53] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:08:13] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:09:43] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:09:53] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:43] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [20:10:43] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [20:11:43] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:11:43] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:11:43] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [20:12:43] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [20:14:32] (03PS1) 10Reedy: Grant sysop and interface-editor 'abusefilter-modify-restricted' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348239 (https://phabricator.wikimedia.org/T161960) [20:14:43] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [20:14:43] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [20:15:43] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:16:27] (03CR) 10TerraCodes: [C: 031] Grant sysop and interface-editor 'abusefilter-modify-restricted' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348239 (https://phabricator.wikimedia.org/T161960) (owner: 10Reedy) [20:18:23] (03CR) 10Reedy: [C: 032] Grant sysop and interface-editor 'abusefilter-modify-restricted' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348239 (https://phabricator.wikimedia.org/T161960) (owner: 10Reedy) [20:18:43] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [20:18:49] (03CR) 10Luke081515: [C: 031] Grant sysop and interface-editor 'abusefilter-modify-restricted' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348239 (https://phabricator.wikimedia.org/T161960) (owner: 10Reedy) [20:18:53] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:18:55] uhm, to slow :o [20:19:12] *to [20:19:14] *too [20:19:32] (03Merged) 10jenkins-bot: Grant sysop and interface-editor 'abusefilter-modify-restricted' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348239 (https://phabricator.wikimedia.org/T161960) (owner: 10Reedy) [20:19:42] (03CR) 10jenkins-bot: Grant sysop and interface-editor 'abusefilter-modify-restricted' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348239 (https://phabricator.wikimedia.org/T161960) (owner: 10Reedy) [20:19:43] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [20:19:43] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:19:53] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 1055468 msg (=800000 warning): ocg_render_job_queue 3044 msg (=3000 critical) [20:20:13] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 1047741 msg (=800000 warning): ocg_render_job_queue 3063 msg (=3000 critical) [20:21:00] wonder why ^^ that keeps happening [20:21:43] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:21:43] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [20:22:43] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [20:24:23] paladox: OCG exploded again :( [20:24:28] oh [20:24:39] thanks for explaning :) [20:25:24] https://grafana.wikimedia.org/dashboard/db/ocg?orgId=1 [20:25:28] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestneutron refresh - https://phabricator.wikimedia.org/T154706#2921133 (10RobH) [20:25:39] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestnet2003 [region 2] - https://phabricator.wikimedia.org/T161764#3142232 (10RobH) [20:26:11] !log reedy@tin Synchronized wmf-config/abusefilter.php: abusefilter-modify-restricted for trwiki T161960 (duration: 01m 38s) [20:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:19] T161960: Enable the blocking feature of AbuseFilter on trwiki - https://phabricator.wikimedia.org/T161960 [20:27:33] oh [20:27:43] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:29:43] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:30:43] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [20:30:43] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:33:33] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [20:33:44] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:33:44] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:33:58] 06Operations, 10DBA: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3183833 (10bd808) ``` Accept-Ranges: bytes Age: 10 Content-Encoding: gzip Content-Length: 76 Content-Type: text/html; charset=UTF-8 Date: Fri, 14 Apr 2017 20:31:37 GMT Server: Apache Strict-Transport-Security: max-... [20:34:39] PROBLEM - MariaDB Slave SQL: s4 on db1084 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:34:43] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [20:34:53] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:05] What's up? [20:35:16] checking [20:35:25] <_joe_> hey [20:35:28] RECOVERY - MariaDB Slave SQL: s4 on db1084 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:35:32] good evening [20:35:34] <_joe_> I'm checking mobileapps now [20:35:36] and it's already back [20:35:43] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [20:35:43] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [20:35:53] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [20:36:04] <_joe_> uhm [20:36:41] db1084 is at 34 of loadavg with 16 cores and 3k connections [20:37:21] let me check the other DB too [20:37:53] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:37:53] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:39:53] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:40:53] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [20:41:53] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [20:41:53] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:44:44] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [20:44:53] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:45:53] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:18] PROBLEM - MariaDB Slave SQL: s4 on db1081 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:43] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [20:47:08] RECOVERY - MariaDB Slave SQL: s4 on db1081 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:47:14] what is going on with that one [20:47:53] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [20:47:53] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [20:47:53] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:48:46] marostegui: see "query" [20:49:53] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:52] (03PS1) 10Reedy: Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348241 [20:50:53] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:51:00] (03PS2) 10Reedy: Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348241 [20:51:02] (03PS1) 10Volans: Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348242 [20:51:04] (03CR) 10Reedy: [C: 032] Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348241 (owner: 10Reedy) [20:51:53] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [20:51:53] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [20:51:53] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:52:15] (03Merged) 10jenkins-bot: Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348241 (owner: 10Reedy) [20:52:29] (03CR) 10jenkins-bot: Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348241 (owner: 10Reedy) [20:52:37] (03Abandoned) 10Volans: Revert "Deploy Linter to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348242 (owner: 10Volans) [20:53:38] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Disable Linter on larger wikis T148609 (duration: 00m 41s) [20:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:46] T148609: Review and deploy Linter extension to Wikimedia wikis - https://phabricator.wikimedia.org/T148609 [20:53:53] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:53] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [20:54:43] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [20:54:53] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [20:57:23] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [20:59:23] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [21:15:48] 06Operations, 10Gerrit, 07LDAP: Change LDAP cn to something more useful (was Rename "Dzahn" to "Daniel Zahn" in Gerrit) - https://phabricator.wikimedia.org/T113792#1676037 (10bd808) We may be able to do something about this trivially after completing {T161859}. As @hashar points out in the summary, today we... [21:24:59] Okay [21:25:02] Why was Linter pulled? [21:25:36] ShakespeareFan00: https://phabricator.wikimedia.org/T148609#3183893 [21:26:59] So it broke the wiki [21:27:06] Why doesn't this suprise me? [21:27:08] XD [21:27:30] it was creating issues, it surely need more investigation to understand what triggered it [21:38:13] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:38:43] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:40:04] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 5.726 second response time [21:43:34] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 5.707 second response time [21:45:43] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:46:03] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:46:53] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [21:49:04] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:03] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 7.791 second response time [21:51:33] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [21:52:04] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:54:41] Wonder why they are going now [21:54:53] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [21:55:02] so far its just those 3 and they flap back [21:55:04] but its odd [21:56:20] oh yep [21:58:43] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:04] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:53] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [22:04:33] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.756 second response time [22:05:04] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:10:04] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:10:13] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:12:35] robh ^^ more it seems [22:12:41] different mw number [22:12:53] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [22:13:23] PROBLEM - MariaDB Slave SQL: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1007, Errmsg: Error Cant create database pawikisource: database exists on query. Default database: pawikisource. [Query snipped] [22:13:58] ... [22:15:22] Wheee, more bugs [22:16:13] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 7.735 second response time [22:16:28] !log created linter tables on pawikisource T148609 [22:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:36] T148609: Review and deploy Linter extension to Wikimedia wikis - https://phabricator.wikimedia.org/T148609 [22:17:26] !log created linter tables on wbwikimedia T148609 [22:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:00] Reedy any more bugs? [22:23:13] Parsoid is causing a load of spam in the logs [22:23:19] oh. [22:23:27] Is that because of linter? [22:30:53] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [22:34:03] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:37:03] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.005 second response time [22:39:02] ah, it's video scalers [22:39:13] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:39:51] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestvirt2003 [region 2] - https://phabricator.wikimedia.org/T161765#3184042 (10RobH) [22:40:03] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestvirt2003 [region 2] - https://phabricator.wikimedia.org/T161765#3142249 (10RobH) [22:45:16] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3029754 (10RobH) [22:45:27] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestvirt2003 [region 2] - https://phabricator.wikimedia.org/T161765#3184051 (10RobH) [22:48:03] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.014 second response time [22:49:01] !log restarting parsoid to get the disable linter change T148609 [22:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:08] T148609: Review and deploy Linter extension to Wikimedia wikis - https://phabricator.wikimedia.org/T148609 [22:58:28] !log skipping CREATE DATABASE pawikisource on dbstore2001- duplicate declaration due to multi-source [22:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:23] RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [23:12:23] PROBLEM - MariaDB Slave SQL: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1007, Errmsg: Error Cant create database wbwikimedia: database exists on query. Default database: wbwikimedia. [Query snipped] [23:12:59] jynus: ^^^ [23:14:06] ha ha [23:14:39] !log skipping CREATE DATABASE wbwikimedia on dbstore2001- duplicate declaration due to multi-source [23:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:23] RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [23:19:56] (03CR) 10Jcrespo: "It may have, but the previous was was scheduled long time ago?/puppet wasn't run properly? I didn't see alerts this time. However, we got " [puppet] - 10https://gerrit.wikimedia.org/r/347996 (owner: 10Jcrespo) [23:28:39] 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#3184078 (10jcrespo) [23:28:42] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#3184077 (10jcrespo) [23:45:30] 06Operations, 10DBA, 10Traffic: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3184091 (10jcrespo) 05stalled>03Open I assume that is a hit of an error message? Traffic: What is tendril.wikimedia.org's caching policy so that this can happen? I would expect a smaller TTL than... [23:50:24] (03CR) 10Krinkle: Move contribution tracking config to CommonSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad)