[00:03:20] (03CR) 10Dzahn: "we already had several "lint:ignore:puppet_url_without_modules" in the repo. did we really have to give up on this and disable it globally" [puppet] - 10https://gerrit.wikimedia.org/r/243177 (https://phabricator.wikimedia.org/T87132) (owner: 10Andrew Bogott) [00:04:43] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1699044 (10GWicke) @ottomata, yes. One of the motivations for having a REST interface is having,... an interface. [00:05:44] (03CR) 10Dzahn: [C: 032] "needs a new package version and be deployed though before it's "live"" [debs/wikistats] - 10https://gerrit.wikimedia.org/r/242176 (owner: 10John F. Lewis) [00:06:42] (03PS4) 10Dzahn: lint: fix 'variable not enclosed' pt2 [puppet] - 10https://gerrit.wikimedia.org/r/242057 [00:07:11] mutante, to allow some people to get to bastions *only*? [00:07:47] Krenair: yes [00:08:05] yeah, that's not how it's used in the vast majority of cases [00:08:06] it's like having flags, you can have one or more of them [00:08:21] It's silly for one flag to not do anything without another [00:08:28] one flag does exactly one thing [00:08:31] it does something [00:08:38] it creates the user on certain hosts [00:08:44] Not if the user can't actually use it for anything [00:09:09] All groups that need bastion access to be able to do anything should imply bastion access. [00:09:39] all this just because somebody forgot to add a group which took seconds to fix? [00:09:59] mutante there is no group outside of traceback roots for which that is true. everyone else needs bastion access [00:10:02] Because people are continually forgetting to add the right group, it's a recurring issue. [00:10:18] it is just optimizing the process to remove manual steps [00:10:28] for the price of having the problem next time you have a traceback-root users [00:10:37] errr [00:10:39] yagni [00:10:41] I thought ops were trying to firewall everything at the moment? [00:10:53] PROBLEM - puppet last run on aqs1001 is CRITICAL: CRITICAL: puppet fail [00:11:07] mutante also i talked to paravoid, is OK to give traceback-roots bastion access [00:11:49] mutante and yes if the 'price' of a simpler system is that complex non standard things are complex that seems like a worthy price to pay. [00:11:50] If you complete that, I'm pretty sure only bastions will be able to SSH in to any other hosts. [00:11:55] it's not simpler to me [00:11:56] And then everyone will need bastion access. [00:11:58] yeah that too [00:12:05] simple is that one group does one thing [00:12:09] sigh [00:12:12] I give up [00:12:15] and don't care anymore [00:12:18] No it's not mutante. [00:12:41] yuvipanda: i did not even vote, do what you like [00:13:45] actually the opposite, i had removed myself from the patch to not be blocking anyone [00:13:52] so go ahead, Krenair too [00:15:50] mutante: ok thanks. [00:20:24] happy firday ya'll! [00:21:12] 6operations, 5Patch-For-Review, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1699053 (10Dzahn) a:5Slaporte>3Dzahn [00:22:34] RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [00:25:38] 6operations, 6WMF-Legal, 7domains: wikipedia.lol - https://phabricator.wikimedia.org/T88861#1021828 (10Dzahn) [00:34:20] 6operations, 10Math, 5Patch-For-Review: Install texlive-extra-utils on mw appservers - https://phabricator.wikimedia.org/T109195#1699071 (10Dzahn) @Physikerwelt ok, thanks! so trying to make this simpler again. is your request still to add the package "texlive-extra-utils" on appservers just like the ticket... [00:37:03] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [00:45:14] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:34:59] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1699162 (10GWicke) 5Open>3Resolved Closing as resolved, [considering our fairly good alert coverage at this point](https:... [01:54:30] 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1699192 (10Andrew) Ah, mdns is a new service which I haven't thought much about. Thanks for fixing in the short-term, I'll work on a better solutio... [01:59:14] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [02:06:04] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:11:38] (03PS1) 10Aaron Schulz: Set page purge limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243363 [02:17:19] (03PS1) 10John Vandenberg: Replace Bugzilla with Phabricator [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243364 [02:28:51] (03CR) 10John Vandenberg: Extend maximum allowed mediawiki version to 1.26 (033 comments) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/171976 (https://phabricator.wikimedia.org/T68661) (owner: 10Wpmirrordev) [02:31:48] !log l10nupdate@tin Synchronized php-1.27.0-wmf.1/cache/l10n: l10nupdate for 1.27.0-wmf.1 (duration: 08m 22s) [02:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:33] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.1) at 2015-10-03 02:36:33+00:00 [02:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:59:37] (03CR) 10John Vandenberg: [C: 04-1] "this change does not add any support for 1.23; it only says it does." [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/111728 (owner: 10Wpmirrordev) [03:10:18] (03CR) 10John Vandenberg: [C: 04-1] "this changeset is just fixing up Iaf2702b868 , which is a pretty good changeset, albeit limited to only adding 1.23 support. But that wou" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/113124 (owner: 10Wpmirrordev) [03:11:04] !log shutting down graphite-web for brief sqlite database schema update [03:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:11:14] (03CR) 10John Vandenberg: Extend maximum allowed mediawiki version to 1.23 (032 comments) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/113103 (owner: 10Wpmirrordev) [03:12:10] !log done; graphite-web back up; url shortening will now work. [03:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:50:45] (03CR) 10Ori.livneh: "Django ships with support for such an authentication scheme, btw: https://docs.djangoproject.com/en/1.8/howto/auth-remote-user/" [puppet] - 10https://gerrit.wikimedia.org/r/241578 (owner: 10Gergő Tisza) [03:57:01] (03PS1) 10John Vandenberg: Skip LiquidThread namespaces [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243365 [03:57:20] (03CR) 10John Vandenberg: Extend maximum allowed mediawiki version to 1.26 (031 comment) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/171976 (https://phabricator.wikimedia.org/T68661) (owner: 10Wpmirrordev) [03:57:44] (03CR) 10John Vandenberg: Extend maximum allowed mediawiki version to 1.26 (031 comment) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/171976 (https://phabricator.wikimedia.org/T68661) (owner: 10Wpmirrordev) [04:03:54] (03PS1) 10Ori.livneh: Tell graphite-web's apache to set REMOTE_USER to LDAP user's uid [puppet] - 10https://gerrit.wikimedia.org/r/243366 [04:04:07] (03PS2) 10Ori.livneh: Tell graphite-web's apache to set REMOTE_USER to LDAP user's uid [puppet] - 10https://gerrit.wikimedia.org/r/243366 [04:04:14] (03CR) 10Ori.livneh: [C: 032 V: 032] Tell graphite-web's apache to set REMOTE_USER to LDAP user's uid [puppet] - 10https://gerrit.wikimedia.org/r/243366 (owner: 10Ori.livneh) [04:08:49] (03PS1) 10Ori.livneh: graphite-web: enable Django's REMOTE_USER auth middleware [puppet] - 10https://gerrit.wikimedia.org/r/243367 [04:09:15] (03CR) 10Ori.livneh: [C: 032 V: 032] graphite-web: enable Django's REMOTE_USER auth middleware [puppet] - 10https://gerrit.wikimedia.org/r/243367 (owner: 10Ori.livneh) [05:07:44] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: puppet fail [05:23:44] PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [05:28:54] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Oct 3 05:28:54 UTC 2015 (duration 28m 53s) [05:32:10] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures [05:36:05] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:47:43] PROBLEM - NFS read/writeable on labs instances on labstore1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:49:13] RECOVERY - NFS read/writeable on labs instances on labstore1002 is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.076 second response time [05:56:14] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:07:25] RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0] [06:21:00] huh [06:29:14] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: puppet fail [06:54:58] (03PS1) 10ArielGlenn: pylint for monitor.py [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243368 [06:56:47] (03CR) 10ArielGlenn: [C: 032 V: 032] pylint for monitor.py [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243368 (owner: 10ArielGlenn) [07:02:55] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:04] PROBLEM - puppet last run on mw2012 is CRITICAL: CRITICAL: Puppet has 1 failures [07:11:42] (03PS1) 10ArielGlenn: indents to spaces for deployment files [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243369 [07:13:47] (03CR) 10ArielGlenn: [C: 032 V: 032] indents to spaces for deployment files [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243369 (owner: 10ArielGlenn) [07:23:27] 6operations, 10Adminbot, 6Labs, 10Tool-Labs: upgrade mwclient (morebots no more log because of MediaWiki semantic versionning) - https://phabricator.wikimedia.org/T114365#1699308 (10Steinsplitter) means that mwclient is broken right now on labs? [07:32:14] RECOVERY - puppet last run on mw2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:43:45] (03PS1) 10ArielGlenn: WikiDump pylint: internal camelcase, indentation, spaces, line too long [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243371 [07:45:57] (03CR) 10ArielGlenn: [C: 032 V: 032] WikiDump pylint: internal camelcase, indentation, spaces, line too long [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243371 (owner: 10ArielGlenn) [07:50:48] 6operations, 10Adminbot, 6Labs, 10Tool-Labs: upgrade mwclient (morebots no more log because of MediaWiki semantic versionning) - https://phabricator.wikimedia.org/T114365#1699312 (10zhuyifei1999) >>! In T114365#1699308, @Steinsplitter wrote: > means that mwclient is broken right now on labs? @Andrew fixed it [08:05:04] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100 [08:18:24] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100 [08:20:47] 6operations, 10Analytics, 6Services, 7Monitoring: Icinga configuration broken by aqs - https://phabricator.wikimedia.org/T114556#1699340 (10jcrespo) 3NEW [08:31:09] (03PS1) 10Jcrespo: Changing tmpdir from /tmp to /srv/labsdb/tmp [puppet] - 10https://gerrit.wikimedia.org/r/243372 [08:31:51] (03CR) 10jenkins-bot: [V: 04-1] Changing tmpdir from /tmp to /srv/labsdb/tmp [puppet] - 10https://gerrit.wikimedia.org/r/243372 (owner: 10Jcrespo) [08:41:13] (03PS1) 10Jcrespo: Add automatic buffer pool dumping for tools [puppet] - 10https://gerrit.wikimedia.org/r/243373 [08:42:34] PROBLEM - Parsoid on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:44:04] PROBLEM - Parsoid on wtp1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:44:13] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 6.033 second response time [08:44:34] PROBLEM - Parsoid on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:45:35] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.171 second response time [08:46:04] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.006 second response time [08:47:05] (03PS2) 10Jcrespo: Changing tmpdir from /tmp to /srv/labsdb/tmp [puppet] - 10https://gerrit.wikimedia.org/r/243372 [08:49:23] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:49:34] PROBLEM - Parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:46] (03PS1) 10ArielGlenn: some flake8 for worker.py and related files [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243374 [08:50:54] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 8.731 second response time [08:51:44] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [08:52:53] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 8.361 second response time [08:57:27] (03CR) 10ArielGlenn: [C: 032 V: 032] some flake8 for worker.py and related files [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243374 (owner: 10ArielGlenn) [08:57:34] PROBLEM - Parsoid on wtp1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:58:24] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:59:23] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 7.986 second response time [09:07:51] (03CR) 10John Vandenberg: [C: 04-1] "This only provides v1.23 support; e.g. it doesnt provide page.page_lang support." (031 comment) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/139413 (https://bugzilla.wikimedia.org/66663) (owner: 10Wpmirrordev) [09:11:24] PROBLEM - Parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:41] <_joe_> what's happening to parsoid? [09:14:25] <_joe_> memory exhaustion I'd say [09:14:35] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.044 second response time [09:14:35] <_joe_> !log restarting parsoid on wtp1021 [09:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:30] <_joe_> nope, more like infinite loops [09:16:29] <_joe_> so ofc restarting parsoid heals it [09:21:46] (03PS2) 10John Vandenberg: Support MediaWiki version to 1.23 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/113103 (owner: 10Wpmirrordev) [09:24:07] (03PS3) 10John Vandenberg: Support MediaWiki version 1.23 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/113103 (owner: 10Wpmirrordev) [09:29:26] (03CR) 10John Vandenberg: "I've merged this into Iaf2702b8680d ; this changeset can be abandoned" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/113124 (owner: 10Wpmirrordev) [09:30:23] (03CR) 10John Vandenberg: "I've merged this into Iaf2702b8680d ; this changeset can be abandoned" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/111728 (owner: 10Wpmirrordev) [09:33:01] (03PS4) 10John Vandenberg: Support MediaWiki version 1.23 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/113103 (https://phabricator.wikimedia.org/T68661) (owner: 10Wpmirrordev) [09:33:40] (03PS2) 10John Vandenberg: Skip LiquidThread namespaces [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243365 (https://phabricator.wikimedia.org/T68661) [09:34:23] (03PS5) 10John Vandenberg: Support MediaWiki version 1.23 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/113103 (https://phabricator.wikimedia.org/T68663) (owner: 10Wpmirrordev) [09:36:50] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1699371 (10Joe) 3NEW [09:38:12] <_joe_> !log rolling restarting all parsoids in eqiad [09:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:07] (03CR) 10John Vandenberg: "Alternatively I could fix up I8993f3a6058b4, but I would kinda prefer working this smaller changeset through code review first, and then t" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/113103 (https://phabricator.wikimedia.org/T68663) (owner: 10Wpmirrordev) [10:25:44] 6operations, 10Math, 5Patch-For-Review: Install texlive-extra-utils on mw appservers - https://phabricator.wikimedia.org/T109195#1699410 (10Physikerwelt) @Dzahn: That was not my request and Im not in favour of that. We, @gwicke, @mobrovac, and others, are making good progress with the new rendering mode das... [11:17:44] PROBLEM - puppet last run on mw2100 is CRITICAL: CRITICAL: Puppet has 1 failures [11:28:00] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1699440 (10ssastry) Weird. Happened to be up early before heading out for the day (Saturday), but results of some quick investigation. Looking at https://logstash.wikimedia.org/#/da... [11:28:24] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1699441 (10ssastry) But, thanks Joe for the initial investigation and the restarts. [11:44:25] RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:47:03] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000.0] [11:48:47] (03CR) 10Hashar: [C: 031] lint: fix 'variable not enclosed' pt2 [puppet] - 10https://gerrit.wikimedia.org/r/242057 (owner: 10Dzahn) [12:26:44] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [5000.0] [12:31:53] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000.0] [13:13:03] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [13:17:54] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [13:24:48] I think they are only snapshot hosts, no production affection [13:27:45] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000.0] [13:32:54] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [13:42:44] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000.0] [13:47:43] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [13:52:34] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [13:56:51] (03PS1) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) [13:57:33] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000.0] [13:58:59] 6operations, 10Dumps-Generation, 10MediaWiki-extensions-Maintenance: snapshot1004 is complaining about connection errors to localhost:11212 (memcache/nutcracker) - https://phabricator.wikimedia.org/T114571#1699600 (10jcrespo) 3NEW [14:01:39] so I cannot "depool" it [14:02:07] I prefer to leave the dump running, unless it starts causing problems somewhere else [14:02:35] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000.0] [14:04:30] 6operations, 10Dumps-Generation: snapshot1004 is complaining about connection errors to localhost:11212 (memcache/nutcracker) - https://phabricator.wikimedia.org/T114571#1699618 (10Krenair) [14:04:51] (03PS1) 10Hoo man: Include nutcracker in snapshot::packages [puppet] - 10https://gerrit.wikimedia.org/r/243396 (https://phabricator.wikimedia.org/T114571) [14:04:58] jynus: ^ [14:05:48] ha, you are more updated about mediawiki changed that I am [14:07:34] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [14:08:47] will the defaults work? [14:09:43] mediawiki::nutcracker itself doesn't take any parameters [14:09:56] I guess it will [14:10:27] role::mediawiki:common has two additional ferm rules [14:10:30] yes, sorry, I got confused with the movement [14:10:34] which exempt it from connection tracking [14:11:19] But it should work without that [14:11:44] let's try [14:11:51] I think they don't even have base::firewall [14:11:59] at least I can't see where that would be added [14:12:20] (03CR) 10Jcrespo: [C: 032] Include nutcracker in snapshot::packages [puppet] - 10https://gerrit.wikimedia.org/r/243396 (https://phabricator.wikimedia.org/T114571) (owner: 10Hoo man) [14:14:27] (03PS2) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) [14:15:54] I think these hosts were upgraded recently, so they may not be at 100% tuning [14:16:37] (03PS3) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) [14:16:38] Yeah, and they're quite special [14:16:58] now we have an answer for netstat -tnl | grep 11212 [14:17:19] Logs look good for snapshoot1004 as well as well [14:17:25] let's see what the logs have to say about that [14:17:34] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [14:18:20] :) [14:18:37] thank you, I am completelly disconnected from some parts of our infrastructure [14:19:20] those role movements are dangerous indeed [14:19:46] You don't notice such breakage until you set up a new node [14:21:17] yes, also that [14:21:24] thanks to puppet [14:21:54] it is not a binary state, package => ensure installed / removed [14:22:02] 6operations, 10Dumps-Generation: snapshot1004 is complaining about connection errors to localhost:11212 (memcache/nutcracker) - https://phabricator.wikimedia.org/T114571#1699660 (10hoo) 5Open>3Resolved a:3hoo [14:22:14] it has a hidden state which his "undetermined" [14:22:15] (03PS4) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) [14:22:55] 6operations, 10Dumps-Generation: snapshot1004 is complaining about connection errors to localhost:11212 (memcache/nutcracker) - https://phabricator.wikimedia.org/T114571#1699663 (10jcrespo) Apparently, snapshots hosts do indeed require nutcracker, but it was not installed due to https://gerrit.wikimedia.org/r/... [14:23:15] (I like to add a comment of why, etc, as a documentation) [14:23:39] for silly people like myself [14:26:01] (03PS2) 10Jcrespo: Add automatic buffer pool dumping for tools [puppet] - 10https://gerrit.wikimedia.org/r/243373 [14:27:48] (03CR) 10Jcrespo: [C: 032] Add automatic buffer pool dumping for tools [puppet] - 10https://gerrit.wikimedia.org/r/243373 (owner: 10Jcrespo) [14:28:20] (03PS3) 10Jcrespo: Changing tmpdir from /tmp to /srv/labsdb/tmp [puppet] - 10https://gerrit.wikimedia.org/r/243372 [14:30:28] (03CR) 10Jcrespo: [C: 032] Changing tmpdir from /tmp to /srv/labsdb/tmp [puppet] - 10https://gerrit.wikimedia.org/r/243372 (owner: 10Jcrespo) [14:34:03] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:33] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 35.00 ms [14:48:11] (03PS1) 10Cscott: Update cxserver Parsoid configuration. [puppet] - 10https://gerrit.wikimedia.org/r/243400 [14:51:21] (03CR) 10Cscott: "Note that the parsoid configuration is currently unused in production." [puppet] - 10https://gerrit.wikimedia.org/r/243400 (owner: 10Cscott) [14:56:36] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1699684 (10MZMcBride) Does it cost us money to keep domains such as `wikipedia.lol` and `wikimedia.xyz` registered? If we're spending donor money on these f... [14:57:18] wikipedia.wtf [14:58:07] (03PS5) 10BBlack: varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 [14:59:56] (03PS5) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) [15:04:54] (03PS6) 10BBlack: varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 [15:04:56] (03PS6) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) [15:04:58] (03PS1) 10BBlack: remove last vcl_config fe default [puppet] - 10https://gerrit.wikimedia.org/r/243401 [15:07:50] (03PS2) 10BBlack: remove last vcl_config fe default [puppet] - 10https://gerrit.wikimedia.org/r/243401 [15:07:52] (03PS7) 10BBlack: varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 [15:09:17] (03CR) 10BBlack: varnish: misspass limiter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [15:17:08] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1699698 (10BBlack) @MZMcBridge - I'm assuming for now that for at least some of the seemingly-superfluous ones, we have some legal reason we want to own them. [15:24:11] Who's MZMcBridge? :) [15:26:09] my fingers have developed autocomplete apparently, including all the usual bugs :P [15:33:53] !log stopping temporarily labsdb1004 mariadb to complete clone process [15:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:48] (03CR) 10Andrew Bogott: "I'm in favor of fixing those -- I just wanted to turn on the voting tests as soon as possible before there was backsliding elsewhere." [puppet] - 10https://gerrit.wikimedia.org/r/243177 (https://phabricator.wikimedia.org/T87132) (owner: 10Andrew Bogott) [15:40:37] It's a fairly common typo, my client actually is set to match on [m(c|z)]?mcbrid(g)?e -regexp :P [16:10:16] (03PS1) 10BBlack: Move all X-Analytics code to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [16:13:59] (03PS2) 10Alex Monk: Move all X-Analytics code to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [16:20:23] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [16:23:18] (03PS1) 10Jcrespo: Smallest change needed to unbreak nagios config [puppet] - 10https://gerrit.wikimedia.org/r/243408 (https://phabricator.wikimedia.org/T114556) [16:24:45] (03PS2) 10Jcrespo: Smallest change needed to unbreak nagios config [puppet] - 10https://gerrit.wikimedia.org/r/243408 (https://phabricator.wikimedia.org/T114556) [16:26:12] please vote +1 if you want to unbreak icinga: https://gerrit.wikimedia.org/r/#/c/243408/ [16:31:04] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [16:34:33] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [16:39:24] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:39:42] (03Restored) 10MarcoAurelio: Enable Education Program extension at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [16:39:47] (03PS3) 10MarcoAurelio: Enable Education Program extension at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) [16:40:09] (03CR) 10MarcoAurelio: Enable Education Program extension at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [16:41:33] (03PS1) 10ArielGlenn: git deploy: update for salt bug fix, pylint [puppet] - 10https://gerrit.wikimedia.org/r/243411 [16:41:50] 6operations, 10Analytics, 6Services, 7Monitoring, 5Patch-For-Review: Icinga configuration broken by aqs - https://phabricator.wikimedia.org/T114556#1699757 (10jcrespo) ^This will unbreak icinga, but it may make people angry (e.g. @Milimetric), so I will not apply it without someone, op or analytics agree... [16:44:10] 6operations, 10Analytics, 6Services, 7Icinga, and 2 others: Icinga configuration broken by aqs - https://phabricator.wikimedia.org/T114556#1699759 (10jcrespo) [16:44:22] 6operations, 10Analytics, 6Services, 7Icinga, 5Patch-For-Review: Icinga configuration broken by aqs - https://phabricator.wikimedia.org/T114556#1699340 (10jcrespo) [16:50:28] (03PS3) 10BBlack: Move all X-Analytics code analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [16:51:17] (03CR) 10jenkins-bot: [V: 04-1] Move all X-Analytics code analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [16:53:35] (03PS1) 10Giuseppe Lavagetto: Minor fixes to instrumentation [debs/pybal] - 10https://gerrit.wikimedia.org/r/243413 [16:53:38] (03PS1) 10Giuseppe Lavagetto: Fix signal handling, some cleanup [debs/pybal] - 10https://gerrit.wikimedia.org/r/243414 [16:53:40] (03PS1) 10Giuseppe Lavagetto: New package version [debs/pybal] - 10https://gerrit.wikimedia.org/r/243415 [16:55:55] (03PS4) 10BBlack: Move all X-Analytics code analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [16:56:43] (03PS5) 10BBlack: Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [16:58:22] (03PS6) 10BBlack: Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [17:04:14] (03PS7) 10BBlack: Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [17:05:55] (03PS3) 10BBlack: remove last vcl_config fe default [puppet] - 10https://gerrit.wikimedia.org/r/243401 [17:07:52] (03Abandoned) 10BBlack: HTTP/2 Alpha Patch [software/nginx] (wmf-1.9.4-1-h2) - 10https://gerrit.wikimedia.org/r/237646 (owner: 10BBlack) [17:23:42] (03PS8) 10BBlack: Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [17:31:19] 6operations, 10Analytics, 6Services, 7Icinga, 5Patch-For-Review: Icinga configuration broken by aqs - https://phabricator.wikimedia.org/T114556#1699808 (10ArielGlenn) I wonder if the list of folks ought to be this: T114383 and T113416, pretty unclear to me though. [18:42:53] 6operations, 10Deployment-Systems, 10Salt: service-restart or git deploy service restart does not wait between batches - https://phabricator.wikimedia.org/T114583#1699900 (10ArielGlenn) 3NEW a:3ArielGlenn [19:52:32] (03PS1) 10Ori.livneh: Revert "graphite: make compatible with Apache 2.4" [puppet] - 10https://gerrit.wikimedia.org/r/243430 [20:05:22] (03PS2) 10Ori.livneh: Revert "graphite: make compatible with Apache 2.4" [puppet] - 10https://gerrit.wikimedia.org/r/243430 [20:07:03] (03PS1) 10Ori.livneh: graphite-web: set REMOTE_USER_AUTHENTICATION = True [puppet] - 10https://gerrit.wikimedia.org/r/243432 [20:07:40] (03CR) 10Ori.livneh: [C: 032] Revert "graphite: make compatible with Apache 2.4" [puppet] - 10https://gerrit.wikimedia.org/r/243430 (owner: 10Ori.livneh) [20:08:19] (03PS2) 10Ori.livneh: graphite-web: set REMOTE_USER_AUTHENTICATION = True [puppet] - 10https://gerrit.wikimedia.org/r/243432 [20:08:38] (03CR) 10Ori.livneh: [C: 032 V: 032] graphite-web: set REMOTE_USER_AUTHENTICATION = True [puppet] - 10https://gerrit.wikimedia.org/r/243432 (owner: 10Ori.livneh) [20:10:23] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: puppet fail [20:10:55] (03PS1) 10Ori.livneh: graphite-web: Handle boolean values correctly for `remote_user_auth` [puppet] - 10https://gerrit.wikimedia.org/r/243433 [20:10:59] (03CR) 10jenkins-bot: [V: 04-1] graphite-web: Handle boolean values correctly for `remote_user_auth` [puppet] - 10https://gerrit.wikimedia.org/r/243433 (owner: 10Ori.livneh) [20:11:07] (03PS2) 10Ori.livneh: graphite-web: Handle boolean values correctly for `remote_user_auth` [puppet] - 10https://gerrit.wikimedia.org/r/243433 [20:12:15] (03PS3) 10Ori.livneh: graphite-web: Handle boolean values correctly for `remote_user_auth` [puppet] - 10https://gerrit.wikimedia.org/r/243433 [20:12:25] (03CR) 10Ori.livneh: [C: 032 V: 032] graphite-web: Handle boolean values correctly for `remote_user_auth` [puppet] - 10https://gerrit.wikimedia.org/r/243433 (owner: 10Ori.livneh) [20:13:34] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:39:32] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700006 (10GWicke) From a mail thread: Latencies are still up, and p99 from RESTBase's POV is often at the configured timeout of 2 minutes (see attached graph). {F2656944} On Sat,... [20:39:44] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700007 (10GWicke) Not saying that the outage was caused by memory usage, but one thing to keep in mind that it's often hard to separate CPU usage and memory usage. Node processes ne... [20:57:20] (03PS9) 10BBlack: Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [20:57:47] (03PS1) 10Ori.livneh: xenon-grep: add `--slice` arg; support 'all' entrypoint [puppet] - 10https://gerrit.wikimedia.org/r/243472 [20:58:02] (03CR) 10Ori.livneh: [C: 032 V: 032] xenon-grep: add `--slice` arg; support 'all' entrypoint [puppet] - 10https://gerrit.wikimedia.org/r/243472 (owner: 10Ori.livneh) [21:15:22] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Investigate additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1700123 (10brion) Adding operations (is that right?) -- need to update ffmpeg2theora package in our apt repo with updated patch. [21:16:34] PROBLEM - Disk space on stat1003 is CRITICAL: DISK CRITICAL - free space: /srv 166416 MB (3% inode=95%) [21:17:09] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1700124 (10brion) [21:21:03] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: puppet fail [21:49:13] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [23:16:34] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 15.38% of data above the critical threshold [500.0] [23:33:05] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:50:27] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700245 (10ssastry) Check http://grafana.wikimedia.org/#/dashboard/db/parsoid-times-vs-doc-size ... and you will find a few different interesting tidbits: 1. The outage corresponds...