[00:00:04] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 03Scap3: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2279070 (10mmodell) As for the blocker, I don't mind if we automate it some other way, I just wanted to note that porting it to scap is on the rada... [00:00:16] AaronSchulz: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=499742&oldid=498004 [00:01:29] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2279073 (10Dzahn) @thcipriani could you specify who is "CI" and "the releng" team in this context? [00:03:49] (03CR) 10Dzahn: "not used in labs: https://tools.wmflabs.org/watroles/role/role::performance" [puppet] - 10https://gerrit.wikimedia.org/r/285313 (owner: 10Dzahn) [00:06:18] (03PS2) 10Dzahn: performance site: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285313 [00:07:08] (03PS3) 10Dzahn: performance site: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285313 [00:07:28] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/2711/" [puppet] - 10https://gerrit.wikimedia.org/r/285313 (owner: 10Dzahn) [00:10:10] (03CR) 10Dzahn: [V: 032] performance site: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285313 (owner: 10Dzahn) [00:11:47] (03CR) 10Dzahn: "no-op on graphite1001" [puppet] - 10https://gerrit.wikimedia.org/r/285313 (owner: 10Dzahn) [00:12:26] (03PS1) 10Dzahn: role performance::site, adjust name in comment [puppet] - 10https://gerrit.wikimedia.org/r/287870 [00:12:56] (03PS3) 10Dzahn: noc site: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285312 [00:13:23] (03CR) 10Dzahn: [C: 032 V: 032] role performance::site, adjust name in comment [puppet] - 10https://gerrit.wikimedia.org/r/287870 (owner: 10Dzahn) [00:14:28] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [00:15:54] (03PS4) 10Dzahn: noc site: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285312 [00:16:28] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6029098 keys - replication_delay is 0 [00:23:33] 06Operations, 06Labs, 10Tool-Labs: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2279145 (10Peachey88) There is already a ticket in #procurement for this. I think @Dzahn mentioned on IRC at the time about making this Lets Encrypt. [00:24:46] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2279149 (10aude) ran the script on test.wikidata, test2wiki, etc. and have some backups of the tables. now can try running it for other wikis [00:55:44] (03PS1) 10Jforrester: Centralise feedback for the visual editor at the Hindi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287873 (https://phabricator.wikimedia.org/T134789) [00:59:24] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: puppet fail [01:02:51] (03CR) 10Dzahn: [C: 032] "no-op except motd http://puppet-compiler.wmflabs.org/2712/mw1152.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/285312 (owner: 10Dzahn) [01:09:48] 06Operations, 06Labs, 10Tool-Labs: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2279222 (10Dzahn) T134363 [01:11:01] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/2713/" [puppet] - 10https://gerrit.wikimedia.org/r/286165 (owner: 10Dzahn) [01:18:40] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2279268 (10aude) ran the script on all wikis in wikidataclient.dblist I can already add jamwiki site links on test.wikidata. It might take up... [01:19:23] (03CR) 10Dzahn: "where does "role snapshot::cron" that is currently used in site.pp come from? There is only role::cron::primary and role::cron::secondary" [puppet] - 10https://gerrit.wikimedia.org/r/286165 (owner: 10Dzahn) [01:27:42] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:10:00] i'm in recent days (last time few secs ago) hitting the insecure connection error (browser blocks the loading). checking the certificate, i see, that although loading cs.wikisource.org page, the certificate is for *.wikipedia.org, assuming that will be related... [02:24:02] Danny_B: the alt name on that certificate should have a lot more domains listed [02:24:18] all of the sister projects are behind a unified certificate [02:24:21] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.23) (duration: 09m 23s) [02:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:24:42] yup, found that in the meantime... so there must be other cause of the issue [02:25:05] i'm hitting that in a frequency about 5 hits/day [02:25:41] in last about two weeks (maybe even more, i started to pay attention more to it very recently as it became more suspicious) [02:27:05] what should i try to find out next time the issue occurs for the records/issue tracking? [02:29:33] the fingerprint of the cert you are seeing would probably be helpful and the ip address that your browser is talking to [02:30:45] bb.lack and fa.idon would be the folks I'd recommend pinging here and on phabricator [02:33:10] 87:F5:BA:BB:D8:97:C5:79:B6:6A:F5:2F:D8:63:8B:99:BD:1C:E8:26 sha1 [02:33:11] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [02:33:17] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue May 10 02:33:17 UTC 2016 (duration 8m 57s) [02:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:43] heh. bad stashbot, no cookie [02:34:36] https://cs.wikisource.org/wiki/Speci%C3%A1ln%C3%AD:Nastaven%C3%AD [02:34:52] submitting the changed preferences hence POST [02:36:04] also i should add, i'm hitting that when connected from various providers in totally different ip ranges [02:36:56] what browser and OS? [02:37:35] this is certainly worth opening a phab ticket to track [02:54:36] Danny_B: what is that fingerprint above from? (like, what field?) [02:54:59] PROBLEM - RAID on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:56:29] PROBLEM - very high load average likely xfs on ms-be2016 is CRITICAL: CRITICAL - load average: 249.23, 163.18, 75.51 [02:57:11] oh it is the fingerprint, I was looking at the wrong field :) [02:57:14] 87 F5 BA BB D8 97 C5 79 B6 6A F5 2F D8 63 8B 99 BD 1C E8 26 [02:57:14] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [02:58:03] we've had a report like this in the past, which we assumed to be related to local software IIRC [02:58:27] (as in, some kind of AV/SafeBrowsing software on the user's machine, which also proxies browser requests at least some of the time) [03:03:21] bblack: various computers, various browsers, various providers... [03:03:53] 06Operations, 10Ops-Access-Requests: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2279414 (10yuvipanda) Hello! I was the one who suggested the root access (and still think it is necessary) so let me list out the reasons. 1. This is for imp... [03:03:54] hmm, just hit that again on cs.wikisource [03:03:55] well, we need some kind of correlation to go on [03:04:19] I'm pretty sure it's not generally the case that we're doing anything wrong on our end for basic cert errors [03:04:19] bblack: just tell me which data i should record in future [03:04:38] the whole request (the exact one that errored) if you can [03:05:06] the URL being hit, the server-side IP the traffic to that URL was, the fingerprint of the cert if claims is bad, etc [03:05:26] if the browser is even claiming that the cert is bad. there are other sorts of "insecure connection error" than that I suppose [03:05:36] PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:11] the browser doesn't say anything particular [03:06:15] it shows [03:06:15] e.g. for Chrome, if you get an SSL connection error, there's an Adanced link on the page, which will tell you something like: [03:06:18] This server could not prove that it is text-lb.esams.wikimedia.org; its security certificate is from *.wikipedia.org. This may be caused by a misconfiguration or an attacker intercepting your connection. [03:07:07] https://support.cdn.mozilla.net/media/uploads/gallery/images/2016-03-27-02-50-02-0b2266.png [03:07:21] also in Chrome, in grey text without even clicking on Advanced, there will be an error indicator like: NET::ERR_CERT_COMMON_NAME_INVALID [03:07:25] RECOVERY - Swift HTTP backend on ms-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 396 bytes in 0.149 second response time [03:07:30] (that's from ff help, so i don't have to do screenshot myself) [03:07:41] and in my case, when I tried to copy/paste that, the error message screen flipped to showing me a bunch of debug dump about the cert [03:07:43] it does not show any other info [03:08:14] So what does "Learn more..." say? anything else useful? [03:09:05] PROBLEM - puppet last run on dbstore2002 is CRITICAL: CRITICAL: puppet fail [03:09:11] there are probably a thousand things that can go wrong, we need more detail [03:09:49] if it's happening with multiple browsers, try to reproduce it with Chrome, at least it has detailed debugging [03:10:06] PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:11] when you hit it with Chrome, you'll get a big grey screen with a red X lock icon that says things like: [03:10:14] Your connection is not private [03:10:17] Attackers might be trying to steal your information [03:10:35] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:11:03] and right under that will be an errors string like "NET::ERR_CERT_COMMON_NAME_INVALID", which will be clickable to pop out more information, as will the Advanced link below it [03:11:31] including (critically) "PEM encoded chain" [03:11:53] you can see all this by trying to browse to https://text-lb.esams.wikimedia.org/ (which is our servers, but that's a debugging hostname that's not valid for our certificates) [03:13:58] bblack: learn more links to mozilla help púage where i copied the sshot from... https://support.mozilla.org/en-US/kb/what-does-your-connection-is-not-secure-mean [03:14:46] bblack: atm it happens totally randomly, so i can't enforce replication :-/ [03:14:52] it says you can "Click on Advanced for more information on why the connection is not secure. Some common errors are described below:" [03:15:03] so when it happens again, click on Advanced [03:15:50] and then copy/screenshot everything Advanced gives you [03:16:24] again - there was no advanced... [03:16:33] i had the text shown on bottom sshot [03:16:34] well, the links you keep pasting have an Advanced [03:16:44] scroll down ;-) [03:16:55] you asked where "learn more" links to [03:17:05] it links to that helppage [03:17:11] well, that's useless [03:17:33] i know, blame mozilla :-/ [03:17:44] hence why i'm asking what else i can monitor [03:17:47] you did say [03:17:54] "various computers, various browsers, various providers..." [03:18:02] and randomly [03:18:03] use a better browser and capture the error again [03:18:19] if i'll hit it with other browser with reasonable message, i'll definitely copy it [03:18:23] (03PS1) 10Andrew Bogott: labs_bootstrap-vz: customize the ldap hosts according to domain. [puppet] - 10https://gerrit.wikimedia.org/r/287886 [03:18:33] but it's extremely unlikely this is anything on our end. most likely it's your browser, os, network, ISP, or some evil entity is trying to mess with things. [03:19:02] try to figure out specifically what's affected and what isn't. [03:19:15] (03PS2) 10Andrew Bogott: labs_bootstrapvz: customize the ldap hosts according to domain. [puppet] - 10https://gerrit.wikimedia.org/r/287886 [03:20:21] most likely it's your browser, os, network, ISP, --- "various computers, various browsers, various providers...", everything is windows though ;-) [03:20:24] (03PS3) 10Andrew Bogott: labs_bootstrapvz: customize the ldap hosts according to domain. [puppet] - 10https://gerrit.wikimedia.org/r/287886 [03:21:14] also - may it be somehow related to those european issues with european network (i don't remember exactly what it was, but it happened at least twice quite recently) [03:21:43] basically wikis were unaccessible [03:22:17] no, it's not related to that [03:22:20] ok [03:22:24] https://support.mozilla.org/en-US/questions/1084406 [03:22:28] well, let's see [03:23:27] do you use some common anti-virus or other security software on all your windows machines? [03:23:41] (03PS4) 10Andrew Bogott: labs_bootstrapvz: customize the ldap hosts according to domain. [puppet] - 10https://gerrit.wikimedia.org/r/287886 [03:25:10] (03CR) 10Andrew Bogott: [C: 032] labs_bootstrapvz: customize the ldap hosts according to domain. [puppet] - 10https://gerrit.wikimedia.org/r/287886 (owner: 10Andrew Bogott) [03:25:13] i don't have other machines handy atm, but yes, on mine i have avast. i'll play with it according to what is written on that page. thanks for research [03:25:46] yeah with "security" software the situation is nightmarish [03:26:36] anything installed on your machine to do security has full privileges to act on your behalf. some of them go so far as to (a) proxy all your HTTPS traffic through their software and/or (b) install an alernate root certificate of their own, allowing their software to fake our certificates to the browser. [03:26:54] basically compromising security and causing bugs, in the name of security. you really have to be able to trust whoever you install software from :/ [03:27:02] hmm, but i don't have "scan https" turned on [03:27:26] that it even has such a button means it's probably capable of doing those kinds of things and has code and hooks for it [03:27:49] I wouldn't know tbh, I haven't used any windows platform for anything in well over a decade, sorry [03:28:24] I only know what other people report or I can find in google, as far as windows-specific issues go [03:28:26] no prob, now i know that it may be connected to av [03:28:37] i'll see if other machines have it as well [03:28:54] and see what is the settings [03:29:03] i'll play with it [03:29:09] ok [03:29:10] thanks for all help [03:35:12] RECOVERY - puppet last run on dbstore2002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [03:38:01] PROBLEM - NTP on ms-be2016 is CRITICAL: NTP CRITICAL: No response from NTP server [03:52:23] (03PS1) 10Yuvipanda: aptly: Add ferm rule to role [puppet] - 10https://gerrit.wikimedia.org/r/287887 [03:52:52] (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: Add ferm rule to role [puppet] - 10https://gerrit.wikimedia.org/r/287887 (owner: 10Yuvipanda) [04:01:22] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [04:02:03] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [04:02:03] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [04:03:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:09:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:10:44] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:16:41] (03PS2) 10Yuvipanda: ldap: Get rid of cleanup-pam-config script [puppet] - 10https://gerrit.wikimedia.org/r/287660 [04:18:03] (03CR) 10Yuvipanda: [C: 032] ldap: Get rid of cleanup-pam-config script [puppet] - 10https://gerrit.wikimedia.org/r/287660 (owner: 10Yuvipanda) [04:18:21] (03PS2) 10Yuvipanda: ldap: Fix another arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/287661 [04:18:33] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Fix another arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/287661 (owner: 10Yuvipanda) [04:18:54] (03PS2) 10Yuvipanda: ldap: Remove some more ensure => absents no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/287662 [04:19:02] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Remove some more ensure => absents no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/287662 (owner: 10Yuvipanda) [04:26:35] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [04:27:23] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [04:33:23] PROBLEM - Disk space on elastic1022 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80204 MB (15% inode=99%) [04:47:03] RECOVERY - Disk space on elastic1022 is OK: DISK OK [04:49:33] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 1 failures [04:51:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [04:52:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [04:56:24] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:59:02] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:07:18] PROBLEM - swift-container-updater on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:19] PROBLEM - salt-minion processes on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:28] PROBLEM - swift-object-updater on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:38] PROBLEM - swift-account-reaper on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:47] PROBLEM - DPKG on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:48] PROBLEM - swift-account-auditor on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:49] PROBLEM - swift-object-replicator on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:57] PROBLEM - swift-account-replicator on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:08] PROBLEM - swift-account-server on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:28] PROBLEM - swift-container-auditor on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:37] PROBLEM - Check size of conntrack table on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:38] PROBLEM - swift-container-server on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:47] PROBLEM - configured eth on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:47] PROBLEM - swift-container-replicator on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:48] PROBLEM - dhclient process on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:58] PROBLEM - swift-object-auditor on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:09:08] PROBLEM - swift-object-server on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:15:18] PROBLEM - Disk space on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:15:48] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [05:23:08] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:18] PROBLEM - Disk space on elastic1031 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79834 MB (15% inode=99%) [05:37:22] PROBLEM - Disk space on elastic1031 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80543 MB (15% inode=99%) [05:50:42] RECOVERY - Disk space on elastic1031 is OK: DISK OK [06:10:04] (03PS1) 10Yuvipanda: tools: 'backup' packages [puppet] - 10https://gerrit.wikimedia.org/r/287900 [06:15:23] (03PS2) 10Yuvipanda: tools: 'backup' packages [puppet] - 10https://gerrit.wikimedia.org/r/287900 [06:15:56] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: 'backup' packages [puppet] - 10https://gerrit.wikimedia.org/r/287900 (owner: 10Yuvipanda) [06:23:32] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:24:33] !log restbase deploy start of 1c890c4 [06:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:52] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:22] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:41] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:46] !log restbase deploy end of 1c890c4 [06:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:35:02] for rb2006 that's a false alarm ^ [06:35:09] (the puppet failure) [06:35:42] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:07] 06Operations, 10DBA, 07Performance, 07RfC, 05codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523#2279579 (10jcrespo) Writes generate errors at 1000-10000 (rate per second), I am concerned about the logging infrastructure, specifically thi... [06:48:42] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp main page via mobile-sections-lead) is WARNING: Test retrieve lead section of en.wp main page via mobile-sections-lead responds with unexpected body: /description = Main page of Wikimedia projects: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICA [06:52:01] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:52:58] really? [06:56:22] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:56:33] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:57:52] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:33] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:13:43] (03PS1) 10Mobrovac: service::node: Add a convenience script to pretty-tail logs [puppet] - 10https://gerrit.wikimedia.org/r/287902 [07:22:29] (03CR) 10Mobrovac: "Looking good for the compiler - https://puppet-compiler.wmflabs.org/2715/" [puppet] - 10https://gerrit.wikimedia.org/r/287902 (owner: 10Mobrovac) [07:24:25] (03PS4) 10Mobrovac: Change prop: Add the rule for MobileApps re-renders [puppet] - 10https://gerrit.wikimedia.org/r/286847 [07:25:10] (03CR) 10Mobrovac: Change prop: Add the rule for MobileApps re-renders (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/286847 (owner: 10Mobrovac) [07:26:03] akosiaris: Can you review https://gerrit.wikimedia.org/r/#/c/286395/ ? [07:26:25] akosiaris: also when possibaly we can deploy it. [07:49:14] 06Operations, 10ops-codfw, 06DC-Ops: db2012 degraded RAID - https://phabricator.wikimedia.org/T124645#2279614 (10jcrespo) 05Open>03Resolved ``` Device Present ================ Virtual Drives : 1 Degraded : 0 Offline : 0 Physical Devices : 14 Di... [07:51:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Very nice indeed, just remove the defined guard that is unnecessary." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287902 (owner: 10Mobrovac) [07:52:10] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:57:16] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team: Write a test to check for clearly bogus hostnames - https://phabricator.wikimedia.org/T133047#2279644 (10hashar) I have left a note regarding updating /typos on the [[ https://wikitech.wikimedia.org/w/index.php?title=Infrastructure_na... [08:01:09] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T134755#2279647 (10jcrespo) a:05jcrespo>03RobH @Robh I would reuse for parts 5-7, decom 8-10 (with maybe salvage some drives), and repurpose 1-4 (which seem to be in a pristine state). I... [08:01:15] (03CR) 10Hashar: [C: 031] "Thanks! Cron entry looks legit as well :-)" [puppet] - 10https://gerrit.wikimedia.org/r/274788 (owner: 10Alex Monk) [08:01:17] 06Operations, 10ops-codfw, 06DC-Ops: es2009 degraded RAID - https://phabricator.wikimedia.org/T125442#2279652 (10jcrespo) [08:01:19] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T134755#2279654 (10jcrespo) [08:05:19] (03CR) 10Hashar: "Puppet compiler for iridium.eqiad.wmnet https://puppet-compiler.wmflabs.org/2716/iridium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/274788 (owner: 10Alex Monk) [08:07:13] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2279658 (10MoritzMuehlenhoff) p:05Triage>03High [08:14:42] 06Operations, 10Ops-Access-Requests: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2274165 (10MoritzMuehlenhoff) Ok, so summarising: - This needs to be wired up in site.pp with at least a mininal stub role before we can actually assign permi... [08:22:04] 06Operations, 10Ops-Access-Requests: Grant root access for user madhuvishy for servers notebook1001 and 1002 - https://phabricator.wikimedia.org/T134716#2279676 (10yuvipanda) Thanks @MoritzMuehlenhoff. I'll add a stub role later today. [08:23:55] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:23:56] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:27:08] (03PS1) 10Filippo Giunchedi: cassandra: add restbase2007-b [puppet] - 10https://gerrit.wikimedia.org/r/287904 [08:28:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase2007-b [puppet] - 10https://gerrit.wikimedia.org/r/287904 (owner: 10Filippo Giunchedi) [08:29:38] !log bootstrap restbase2007-b T132976 [08:29:39] T132976: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976 [08:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:32:27] (03PS3) 10Filippo Giunchedi: graphite: export /var/lib/carbon via rsync [puppet] - 10https://gerrit.wikimedia.org/r/287608 [08:32:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: export /var/lib/carbon via rsync [puppet] - 10https://gerrit.wikimedia.org/r/287608 (owner: 10Filippo Giunchedi) [08:38:40] (03PS1) 10Filippo Giunchedi: graphite: fix hostnames / brown paperbag [puppet] - 10https://gerrit.wikimedia.org/r/287905 [08:38:56] hashar: ^ typos catched in post-merge, nice! [08:39:19] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: fix hostnames / brown paperbag [puppet] - 10https://gerrit.wikimedia.org/r/287905 (owner: 10Filippo Giunchedi) [08:41:22] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:32] godog: on post merge ??? [08:42:36] oh PS1 had a typo, got rebased/merged and the result arrived after [08:48:08] hashar: yeah and at the time I first submitted PS1 your change wasn't merged yet [08:48:30] <_joe_> what's up with mobileapps? [08:50:07] <_joe_> ok so right now it's not timing out [08:50:11] <_joe_> but mobrovac: [08:50:13] <_joe_> /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp main page via mobile-sections-lead) is WARNING: Test retrieve lead section of en.wp main page via mobile-sections-lead responds with unexpected body: /description => Main page of Wikimedia projects; /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is WARNING: Test retrieve en.w [08:50:19] <_joe_> p main page via mobile-sections responds with unexpected body: /lead/description => Main page of Wikimedia projects [08:51:21] 06Operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, and 3 others: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#1981274 (10hashar) From a mail I sent to the ops list, slightly extended: Icinga has a contrib plugin for IPMI checks: https://www.thomas-krenn.com/en/wiki/IPMI_S... [08:52:15] !log restarting tor on radium for openssl update [08:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:52:51] godog: we could revisit the Gerrit merge strategy for operations/puppet. It is able to cherry pick patches on tip of the branch, saves you from having to rebase. And maybe we can get Zuul to merge the change for you whenever tests are green [08:56:15] !log restarting exim on fermium/lists.wikimedia.org for openssl update [08:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:56:48] hashar: nah I think it's been just racy, the typos works as advertised it's just been racy and I didn't wait for post-rebase checks [09:01:13] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551#2279722 (10Joe) a:05GWicke>03Joe [09:03:29] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 2 failures [09:03:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:04:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [09:06:20] !log restarting etherpad-lite on etherpad1001 for openssl update [09:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:06:59] <_joe_> is anyone else looking at the 5xx surge? [09:07:20] PROBLEM - cassandra-b CQL 10.192.16.177:9042 on restbase2007 is CRITICAL: Connection refused [09:07:43] yeah I was taking a look, shows up in https://grafana.wikimedia.org/dashboard/db/varnish-http-errors-datacenters for the last 3h but not on https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json ? [09:07:51] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:07:59] -.- [09:08:41] (03Abandoned) 10Hashar: beta: disable spdy on Nginx tlsproxies [puppet] - 10https://gerrit.wikimedia.org/r/286821 (https://phabricator.wikimedia.org/T134362) (owner: 10Hashar) [09:09:01] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2279764 (10Gehel) This incident coincide in timing with a deployment of wmf.23 (according to @Dzahn, but the [[ https://gerrit.wikimedia.org/r/#/c/287759/ | change I fo... [09:09:20] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:09:26] 06Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 13Patch-For-Review, 07WorkType-Maintenance: beta cluster varnish cache can't apt-get upgrade nginx-full: nginx: [emerg] unknown "spdy" variable - https://phabricator.wikimedia.org/T134362#2279766 (10hashar) 05Open>03Resolved That was a transient... [09:09:53] (03PS1) 10Giuseppe Lavagetto: nagios_common: Add command for using service_checker [puppet] - 10https://gerrit.wikimedia.org/r/287907 (https://phabricator.wikimedia.org/T134551) [09:09:55] (03PS1) 10Giuseppe Lavagetto: monitoring: use service_checker for mobileapps LVS [puppet] - 10https://gerrit.wikimedia.org/r/287908 (https://phabricator.wikimedia.org/T134551) [09:10:20] <_joe_> mobrovac: ^^ [09:10:29] 06Operations, 03Discovery-Search-Sprint: Check Icinga alert on CirrusSearch response time - https://phabricator.wikimedia.org/T134852#2279774 (10Gehel) [09:10:59] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Add command for using service_checker [puppet] - 10https://gerrit.wikimedia.org/r/287907 (https://phabricator.wikimedia.org/T134551) (owner: 10Giuseppe Lavagetto) [09:11:18] (03CR) 10jenkins-bot: [V: 04-1] monitoring: use service_checker for mobileapps LVS [puppet] - 10https://gerrit.wikimedia.org/r/287908 (https://phabricator.wikimedia.org/T134551) (owner: 10Giuseppe Lavagetto) [09:11:52] <_joe_> I hate the strict linters [09:12:11] <_joe_> they "fix" stupid things and let you happily write crappy code all along [09:12:28] <_joe_> but god forbid you forget a whitespace! [09:12:30] 06Operations, 03Discovery-Search-Sprint: Enable GC logs on Elasticsearch JVM - https://phabricator.wikimedia.org/T134853#2279787 (10Gehel) [09:13:00] PROBLEM - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.69 seconds [09:13:10] PROBLEM - MariaDB Slave Lag: s1 on db2062 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.11 seconds [09:13:16] PROBLEM - MariaDB Slave Lag: s1 on db1052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.20 seconds [09:13:31] PROBLEM - MariaDB Slave Lag: s1 on db2034 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 338.68 seconds [09:13:32] PROBLEM - MariaDB Slave Lag: s1 on db2070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.44 seconds [09:14:11] <_joe_> gehel: I think we turned them off because they were creating gigs of logs on the ES servers? [09:14:12] heh db1052 ? looking in tendril [09:14:19] * volans checking [09:14:27] <_joe_> volans: thanks :) [09:14:35] PROBLEM - MariaDB Slave Lag: s1 on db1051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 397.28 seconds [09:15:07] <_joe_> looks serious [09:15:10] _joe_: might well be the case. And you need to restart the JVM to rotate those logs... But Gigs of GC logs seems a bit high... [09:15:10] RECOVERY - MariaDB Slave Lag: s1 on db2055 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [09:15:12] it's already recovered actually [09:15:17] mmm [09:15:17] I'll do some digging [09:15:20] I guess was a heavy query on the master [09:15:20] RECOVERY - MariaDB Slave Lag: s1 on db2062 is OK: OK slave_sql_lag Replication lag: 0.52 seconds [09:15:23] I can check which one [09:15:34] RECOVERY - MariaDB Slave Lag: s1 on db1052 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [09:15:34] <_joe_> I saw db1051 too [09:15:50] RECOVERY - MariaDB Slave Lag: s1 on db2034 is OK: OK slave_sql_lag Replication lag: 6.67 seconds [09:15:50] RECOVERY - MariaDB Slave Lag: s1 on db2070 is OK: OK slave_sql_lag Replication lag: 7.39 seconds [09:16:19] CategoryMembershipChangeJob::run [09:17:10] RECOVERY - NTP on ganeti1003 is OK: NTP OK: Offset -0.009291648865 secs [09:17:21] actually [09:17:34] LinksUpdate::incrTableUpdate 127.0.0.1 [09:17:45] 42 seconds [09:17:45] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2279814 (10elukey) {F3990458} [09:18:57] jynus: again? [09:19:54] this is new [09:20:08] "new" [09:20:34] delete from LinksUpdate::incrTableUpdate on pagelinks [09:20:42] https://tendril.wikimedia.org/report/slow_queries?host=db1057&user=&schema=&qmode=eq&query=&hours=1 [09:20:56] yep [09:20:57] it was an issue mainly on s3 [09:21:10] but as "peaks of 10 seconds" [09:21:51] due to concurrency [09:22:09] the updates themselves didn't use to take so long [09:22:35] so, the issue was known, but has become worse [09:22:36] RECOVERY - MariaDB Slave Lag: s1 on db1051 is OK: OK slave_sql_lag Replication lag: 0.37 seconds [09:23:31] basically, https://phabricator.wikimedia.org/T109943 [09:24:10] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:24:11] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:24:11] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:24:42] but this is the first time I've seen it on pagelinks and enwiki [09:25:40] the list of deleted items is pretty long, could be that? [09:25:47] maybe before was batched in smaller batches [09:26:30] It could be https://gerrit.wikimedia.org/r/#/c/287007/ [09:27:29] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:08] const BATCH_SIZE = 500 [09:30:59] funny, because I think max lag was 300 seconds, just on the threshold of paging [09:31:11] yes [09:32:23] no, 400 on 51 [09:32:41] !log powercycling ms-be2016, unresponsive and serial console is dead [09:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:07] in any case, it is the observed lag, only sampled every 5 minutes [09:36:29] RECOVERY - swift-account-replicator on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:36:40] RECOVERY - very high load average likely xfs on ms-be2016 is OK: OK - load average: 12.55, 3.38, 1.14 [09:36:40] RECOVERY - DPKG on ms-be2016 is OK: All packages OK [09:36:41] RECOVERY - swift-container-server on ms-be2016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:36:41] RECOVERY - swift-account-server on ms-be2016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:37:00] RECOVERY - dhclient process on ms-be2016 is OK: PROCS OK: 0 processes with command name dhclient [09:37:00] RECOVERY - configured eth on ms-be2016 is OK: OK - interfaces up [09:37:01] RECOVERY - swift-account-auditor on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:37:11] RECOVERY - swift-container-updater on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:37:29] RECOVERY - swift-object-auditor on ms-be2016 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:37:30] RECOVERY - swift-object-replicator on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [09:37:30] RECOVERY - swift-container-auditor on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:37:30] RECOVERY - Disk space on ms-be2016 is OK: DISK OK [09:37:40] RECOVERY - salt-minion processes on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:37:59] RECOVERY - swift-object-server on ms-be2016 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:38:09] RECOVERY - swift-account-reaper on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:38:10] RECOVERY - RAID on ms-be2016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [09:38:11] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [09:38:11] RECOVERY - Check size of conntrack table on ms-be2016 is OK: OK: nf_conntrack is 10 % full [09:38:11] RECOVERY - swift-container-replicator on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:38:11] RECOVERY - swift-object-updater on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:39:30] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:40:49] RECOVERY - NTP on ms-be2016 is OK: NTP OK: Offset 0.0004633665085 secs [09:41:51] (03PS2) 10Giuseppe Lavagetto: nagios_common: Add command for using service_checker [puppet] - 10https://gerrit.wikimedia.org/r/287907 (https://phabricator.wikimedia.org/T134551) [09:41:54] (03PS2) 10Giuseppe Lavagetto: monitoring: use service_checker for mobileapps LVS [puppet] - 10https://gerrit.wikimedia.org/r/287908 (https://phabricator.wikimedia.org/T134551) [09:44:31] !log rolling restart of swift backend servers in codfw and eqiad [09:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:47:04] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2279857 (10dcausse) GC issue seems to perfectly coincide with a merge on elastic1026. We could also start to monitor all segment related stats, most of them are bytes s... [09:48:41] RECOVERY - MariaDB Slave SQL: s2 on dbstore2002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:51:10] RECOVERY - MariaDB Slave SQL: s3 on dbstore2002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:51:15] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2279861 (10Nikki) I can add jamwiki links using QuickStatements (e.g. https://www.wikidata.org/w/index.php?diff=334730538) but if I try and do i... [09:51:49] RECOVERY - MariaDB Slave SQL: s6 on dbstore2002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:52:22] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.16.177:9042 on restbase2007 is CRITICAL: Connection refused Filippo Giunchedi bootstrap [09:59:42] (03PS1) 10Faidon Liambotis: Bump slave lag check's retries to 10 [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/287911 [09:59:49] jynus, volans ^^^ [10:00:25] tell me what you think of the idea and the number specifically, 10 is just a rough proposal [10:00:47] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2279888 (10elukey) Number of pages vs chunk size for mc1008 and mc1004: {F3990567} {F3990569} [10:04:27] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2279894 (10elukey) So as far as I can see there is a high demand (i.e. more pages allocated) of chunk sizes smaller than 20K. The growth factor of 1.10 could be... [10:05:02] (03CR) 10Jcrespo: [C: 04-1] "> not worthy to page everyone (especially since people are" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/287911 (owner: 10Faidon Liambotis) [10:06:08] I won't object if you want to be paged for something extra [10:06:27] but in general, we try to keep pages to be minimal and indicative of serious user-facing problems [10:07:49] most of the times this check has alerted the response was "just a small peak, nothing to see here, move along" [10:11:59] paravoid, actually [10:12:08] the issue is that those should ping devels [10:12:14] so it is very actionable [10:12:19] just not ops-actionable [10:12:21] (03CR) 10Alexandros Kosiaris: [C: 031] "When should we merge this ?" [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) (owner: 10KartikMistry) [10:12:26] devels? [10:12:36] mediawiki issues [10:12:57] no, that's not how we use paging [10:13:00] as in, it is not an infrastructure issue [10:13:05] I'm sure this all caused by some underlying code issue [10:13:10] yes [10:13:20] actuall, I pointed the exact place [10:13:22] but paging is not for annoying people to fix long-standing bugs [10:13:35] then let's do what I wanted to do [10:13:42] which is set all lag alerts to warnings [10:13:55] that's fine too :) [10:14:44] or non-paging problems, whatever you prefer [10:14:45] my ultimate goal is that if I'm at home and not at my computer and I get a page, I should be reaching out to my computer and joining IRC to investigate [10:15:03] I have to do that [10:15:22] all of us should :) [10:15:28] most of the devel issues are because I prompt them to do something [10:15:34] after receiving a page [10:16:00] I think what is missing are email-only alarms, probably more worth for specific groups to be alerted is something is going on the wrong way, but not so critical to page [10:16:03] the question is I cannot differenciate from code issues [10:16:11] yeah, volans' is a good point [10:16:15] from infra issues [10:16:35] I agree [10:16:38] paging is for really serious issues and we have 24/7 & weekend coverage of it [10:16:50] but as I said, I had no time to implement it [10:17:11] so if these things you're responding to are really urgent/important, then others should be able to do them (or call you after initial investigation) even during weekends etc. [10:17:18] I don't think the slave lags fall into that category, right? [10:17:26] actually, I do that ecery time [10:17:39] I've spent the past 2 weekends preciselly doing that [10:18:16] because if not, things get worse [10:18:54] paravoid: it's more tricky, lag of all the slaves of a shard or only one? how much lag and for how much time? not so easy to configure on a simple icinga check ;) [10:19:08] paravoid, I said that the right fix [10:19:16] is to page on service failures [10:19:21] not on server failures [10:19:33] indeed, that has been our policy traditionally [10:19:36] e.g. can mediawiki contact a database (any)? [10:19:38] in an ideal world each delay > 10s should be looked at (not forcely as an emergency) to find the reason and try to avoid it [10:20:10] paravoid, not true, I asked mark or you and our whole page system does not consider that [10:20:22] e.g. "is mediawiki up" is not a page [10:20:36] only individual services/servers [10:20:44] (which do not page) [10:20:56] we page on appservers.svc.eqiad/wmnet.wmnet HTTP failures, not individual mw* HTTP/HHVM failures [10:21:11] that is the point of entry [10:21:27] similarly, we page on text-lb.svc.eqiad/codfw.wmnet HTTP/HTTPS failures, not individual cp* varnish/nginx/host down etc. failures [10:21:31] not the whole layer [10:21:34] paravoid: yes but there you have an homogeneous cluster that do one thing [10:22:02] basically I want mediawiki.eqiad and db.eqiad [10:22:09] (or eqivalent) [10:22:16] a single DB shard has a master, several slaves of which some have special roles from the application point of view, so it should be the application then to alarm when their checks fails [10:22:21] I'm saying that the policy has traditionally been to page on service failures (meaning: "esams text caches"), not individual hosts, as jynus proposed [10:22:22] I hope you understand what I mean [10:22:45] yes, I am saying we are not covering that on certain layers, and needs to get better [10:22:45] for databases it's clearly much more complicated [10:22:55] agreed [10:23:13] pages I want: [10:23:21] we agree on the goal, I think [10:23:21] "can we write to enwiki?" [10:23:30] "can I read from enwiki?" [10:23:48] unless you have spare cycles to work on this -which I doubt- these won't happen for a while [10:24:31] which is why I think the imperfect alert system we have now is ok, if you just disable the pages [10:24:45] so I can check them on my own app [10:24:54] until then, we should do our best to decrease false positives (and, by extension, non-urgent immediate actionable actionables) to allow serious alerts to not be ignored by the rest of opsens [10:24:59] and also, to not burn you out completely [10:25:57] if slave lags on a weekend briefly, for one minute, due to a long-standing code bug, you can triage it and raise it with code authors on monday, hopefully :) [10:26:05] paravoid, I have been more frustrated by people calling me after a non-page and being a non-issue than otherwise :-) [10:26:11] and enjoy your weekend, without your phone ruining your off day? :) [10:26:26] the main issue right now is labs and analytics [10:26:37] and we do not even page on those [10:27:08] core is only bad in terms of automatization [10:27:20] first things first :) [10:27:46] if you want to help, I would ask you: either disable paging for replication [10:27:54] but keep the critical [10:28:30] or page only for a specific group [10:28:52] ok [10:29:14] I'd prefer the former, if you do too [10:29:15] but 10 minutes or 10 checks is not a good thing- it is like setting the threshold on 300 seconds or 600 [10:29:38] a skip can always go up [10:29:43] (03PS1) 10Elukey: Add the possibility to specify memcached's chunk growth factor. [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) [10:29:44] *spike [10:29:56] well the idea was that if we have persistent slave lag for > X minutes, it's a much more serious issue (that warrants paging) than slave lag for 10 seconds, then gone again [10:30:16] not, it shoul be the other way round [10:30:24] lower the time, but do not page [10:30:40] in the sense that spikes might be of interest to you, for raising code bugs, but persistent slave lag is of interest to everyone as the sites are in serious trouble [10:30:40] detect before, just do not immediately assume the worse [10:30:48] we could do both [10:30:54] (03PS1) 10Ema: mdadm boot-time race condition: sleep in init-premount [puppet] - 10https://gerrit.wikimedia.org/r/287914 (https://phabricator.wikimedia.org/T131961) [10:31:18] non-paging alert for the current checks, paging alert for persistent lag [10:31:36] but I'll start with your idea, that's fine :) [10:32:01] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2252558 (10Lydia_Pintscher) I can confirm. It still doesn't work in the UI for me either. [10:32:02] no, what you call "presistent lag" [10:32:08] should be a mediawiki check [10:32:24] 90% of the servers will be depooled immediately on lag [10:32:46] I am well aware [10:32:59] if it is so bad that causes issues, mediawiki will notice, and only then should page [10:33:23] until we have that, I am ok with no pagging [10:33:36] in fact, it was in my todo, but not a priority [10:34:24] I wonder what would be a good place to add checks to mediawiki, terbium? [10:36:33] depends on what you mean checks to mediawiki [10:36:52] if you need high-level HTTP endpoint checks, running them on the icinga server (no NRPE) is best) [10:36:53] well, ideally it would be a distributed check [10:37:18] (03CR) 10KartikMistry: "Thanks. We can go ahead with this sometime later today? 16:00–17:00 UTC Puppet SWAT is good for me." [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) (owner: 10KartikMistry) [10:37:30] if you want to run mediawiki code to check, e.g. poll memcached for mediawiki's view of slave lag, then terbium probably, yeah [10:37:35] but may need some source donde dependency [10:38:01] that is why having a couple of checks on terbium and its equivalent on codfw [10:38:16] may be (still a hack), but easier [10:38:39] and we could check a couple of things common to that layer [10:39:00] in the end, it all comes back to having load balancing inside the app [10:39:29] if we had, like upper layers, the load balancing independent, we could check that [10:39:43] (03PS1) 10Faidon Liambotis: mariadb: set is_critical to false for checks [puppet] - 10https://gerrit.wikimedia.org/r/287916 [10:39:51] I don't disagree with any of that [10:40:06] we're just all pressured on time, and perfect is the enemy of good :) [10:41:10] no, actually that should be prioritized [10:41:21] because it solves not only monitoring [10:41:30] but also pooling issues [10:41:33] (03PS2) 10Elukey: Add the possibility to specify memcached's chunk growth factor. [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) [10:41:54] we are doing that on next quarter [10:42:18] and hopefuly introduce proxying (or equivalent) after that [10:42:48] and finally, introduce syncronous replication to avoid masters SPOF [10:43:05] are you ok with that patch above? [10:43:09] yes [10:43:49] (03PS1) 10Filippo Giunchedi: scap: update to 3.2.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/287918 [10:44:03] cool, thanks [10:44:10] (03CR) 10Faidon Liambotis: [C: 032] mariadb: set is_critical to false for checks [puppet] - 10https://gerrit.wikimedia.org/r/287916 (owner: 10Faidon Liambotis) [10:44:35] (03Abandoned) 10Faidon Liambotis: Bump slave lag check's retries to 10 [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/287911 (owner: 10Faidon Liambotis) [10:44:54] wait [10:45:04] what? [10:45:06] those will not be show on IRC now [10:45:15] why not? [10:45:22] oh is dba not showing here? [10:45:25] ok, I'll fix [10:45:26] dba does not send things to irc [10:45:40] I think admins does [10:46:36] the puppet is missleading, because is_critical means paging all [10:47:31] (03PS1) 10Faidon Liambotis: mariadb: set replication check's contact_group to admins [puppet] - 10https://gerrit.wikimedia.org/r/287919 [10:48:24] that ok? [10:48:28] yes [10:48:32] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mariadb: set replication check's contact_group to admins [puppet] - 10https://gerrit.wikimedia.org/r/287919 (owner: 10Faidon Liambotis) [10:49:06] cool [10:49:25] akosiaris, mobrovac: what's with scb* MCS warnings outstanding for the past 1h20? [10:53:43] (03PS3) 10Faidon Liambotis: added SPF record to phabricator.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) (owner: 10Mschon) [10:53:58] (03PS4) 10Faidon Liambotis: Add SPF record to phabricator.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) (owner: 10Mschon) [10:54:15] (03CR) 10Faidon Liambotis: [C: 032] Add SPF record to phabricator.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) (owner: 10Mschon) [10:55:39] paravoid: I have no idea... looking [10:55:49] (03CR) 10Faidon Liambotis: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/287914 (https://phabricator.wikimedia.org/T131961) (owner: 10Ema) [10:57:05] 06Operations, 10ops-codfw: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2280037 (10jcrespo) 05stalled>03Open a:03jcrespo Several days and no new power loss. I will reimage it and set its master with transactional replication. [10:57:30] 06Operations, 10ops-codfw, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2280040 (10jcrespo) [11:00:16] (03PS3) 10Elukey: Add the possibility to specify memcached's chunk growth factor. [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) [11:01:52] !log rolling restart of swift frontend servers in codfw and eqiad for openssl update [11:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:02:08] !log jmm@palladium conftool action : set/pooled=no; selector: ms-fe2001.codfw.wmnet [11:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:02:48] !log jmm@palladium conftool action : set/pooled=yes; selector: ms-fe2001.codfw.wmnet [11:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:21] <_joe_> moritzm: you might want to use --quiet [11:04:35] <_joe_> since you already logged that [11:04:44] <_joe_> btw I will invert the logic this week [11:04:50] <_joe_> add --log instead of --quiet [11:06:42] heh, I was just thinking "it would be awesome if confctl had a --quiet option" :-) [11:23:34] paravoid: i'm back and aware of them [11:25:00] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:26:17] !log stopping es2017 and es2019 for cloning 17 -> 19 + regular conf/upgrades [11:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:29:00] PROBLEM - puppet last run on ganeti2002 is CRITICAL: CRITICAL: puppet fail [11:30:38] (03PS2) 10Mobrovac: service::node: Add a convenience script to pretty-tail logs [puppet] - 10https://gerrit.wikimedia.org/r/287902 [11:36:55] !log apache restart on bohrium/piwik for openssl update [11:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:36] (03CR) 10Mobrovac: service::node: Add a convenience script to pretty-tail logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287902 (owner: 10Mobrovac) [11:40:31] (03CR) 10Mobrovac: "MobileApps has already been scheduled for PuppetSWAT for the same purpose, I'd prefer to do it either before (aka ASAP) or tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) (owner: 10KartikMistry) [11:41:26] (03CR) 10Mobrovac: [C: 031] "I think this can go out now." [puppet] - 10https://gerrit.wikimedia.org/r/287918 (owner: 10Filippo Giunchedi) [11:41:30] mobrovac: let's schedule tomorrow. I'm deploying cxserver today for MT changes. [11:41:41] mobrovac: let me reply in patch. [11:41:47] kk kart_ [11:42:27] (03CR) 10KartikMistry: "OK. Let's deploy tomorrow. I'm updating cxserver for other registry changes (MT, languages..) today." [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) (owner: 10KartikMistry) [11:44:30] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, and 2 others: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2280120 (10faidon) [11:44:32] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2280118 (10faidon) 05Resolved>03Open The disk may have been replaced, but it wasn't partitioned/re-added to RAID, so the original request (RAID failure) is still not resolved. @Dzahn, will you handle? [11:47:52] (03CR) 10Elukey: "The puppet compiler results seems good:" [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [11:47:55] PROBLEM - dhclient process on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:47:56] PROBLEM - RAID on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:25] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:34] PROBLEM - puppet last run on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:46] PROBLEM - salt-minion processes on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:54] PROBLEM - Disk space on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:55] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:56] PROBLEM - Check size of conntrack table on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:59] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2280141 (10Nikki) I've also noticed that there are a lot of pages not appearing in the categories they're supposed to be in (compare [[https://j... [11:49:04] PROBLEM - configured eth on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:16] PROBLEM - DPKG on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:23] (03PS1) 10Jcrespo: [WIP] Removing references to pnds database on s5 [puppet] - 10https://gerrit.wikimedia.org/r/287922 (https://phabricator.wikimedia.org/T128737) [11:50:24] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [11:50:54] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:52:45] 06Operations, 06Services, 03Mobile-Content-Service: Automatic monitoring checks for the MCS failing in production - https://phabricator.wikimedia.org/T134866#2280175 (10mobrovac) [11:52:55] PROBLEM - puppet last run on labstore2003 is CRITICAL: CRITICAL: puppet fail [11:52:57] 06Operations, 06Services, 03Mobile-Content-Service: Automatic monitoring checks for the MCS failing in production - https://phabricator.wikimedia.org/T134866#2280191 (10mobrovac) p:05Triage>03High [11:53:37] 06Operations, 06Services, 03Mobile-Content-Service, 15User-mobrovac: Automatic monitoring checks for the MCS failing in production - https://phabricator.wikimedia.org/T134866#2280175 (10mobrovac) [11:54:09] (03CR) 10Mobrovac: "Still looking good for PCC - https://puppet-compiler.wmflabs.org/2725/" [puppet] - 10https://gerrit.wikimedia.org/r/287902 (owner: 10Mobrovac) [11:54:20] _joe_: ^ [11:55:05] RECOVERY - puppet last run on ganeti2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:55:35] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 694 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6185993 keys - replication_delay is 694 [11:56:04] kraz is likely affected by that qemu bug, looking into it [11:58:46] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:00:36] RECOVERY - salt-minion processes on kraz is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:00:45] RECOVERY - Disk space on kraz is OK: DISK OK [12:00:54] RECOVERY - Check size of conntrack table on kraz is OK: OK: nf_conntrack is 0 % full [12:00:55] RECOVERY - configured eth on kraz is OK: OK - interfaces up [12:01:05] RECOVERY - DPKG on kraz is OK: All packages OK [12:01:45] RECOVERY - dhclient process on kraz is OK: PROCS OK: 0 processes with command name dhclient [12:01:55] RECOVERY - RAID on kraz is OK: OK: no RAID installed [12:02:25] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [12:03:57] !log "powercycled" kraz (stuck by qemu bug) [12:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:04:15] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:05:05] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:05:45] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:06:44] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:07:32] !log rolling restart of maps cluster for openssl update [12:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:11:05] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:11:36] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:12:14] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [12:12:35] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:12:40] (03PS1) 10Hashar: typos file using extended regular expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287926 (https://phabricator.wikimedia.org/T133047) [12:15:08] (03PS2) 10Hashar: typos file using extended regular expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287926 (https://phabricator.wikimedia.org/T133047) [12:17:08] (03CR) 10Gilles: [C: 031] Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [12:19:15] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:19:24] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2280299 (10dcausse) Another small fact : when GC issue started most of the queries reported as WARN (> 1s) in /var/log/elasticsearch/production-search-eqiad_index_searc... [12:24:28] 06Operations, 10Traffic: Support webockets in cache_misc - https://phabricator.wikimedia.org/T134870#2280319 (10BBlack) [12:24:49] 06Operations, 10Traffic: Support webockets in cache_misc - https://phabricator.wikimedia.org/T134870#2280333 (10BBlack) [12:24:51] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2280332 (10BBlack) [12:25:14] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:25:18] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287926 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [12:27:00] Hi, I can't join to any channel @irc.wikimedia.org:6667 . IRC have no channels. [12:27:28] 06Operations, 10Traffic: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2280338 (10BBlack) [12:28:10] 06Operations, 10Traffic: Support webockets in cache_misc - https://phabricator.wikimedia.org/T134870#2280353 (10BBlack) [12:28:12] 06Operations, 10Traffic: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2280352 (10BBlack) [12:28:33] !log mobileapps deploying b8c396ae [12:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:28:55] rxy: I'm taking a look [12:30:15] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6149190 keys - replication_delay is 0 [12:32:14] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [12:32:35] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [12:33:14] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:33:29] rxy: should be good [12:33:34] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [12:33:39] !log restart ircecho on kraz [12:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:34:15] godog: thank you :) [12:34:40] rxy: np, thank you for reporting it! [12:38:06] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287926 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [12:39:26] godog: can you please file a Phab task? this should really auto-start upon reboots [12:39:58] (03PS1) 10Addshore: DNM Load the RevisionSlider extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) [12:40:11] moritzm: yeah I'm looking at what was the root cause, "ircecho code" being likely [12:40:14] (03CR) 10Hashar: [C: 032] typos file using extended regular expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287926 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [12:40:23] (03CR) 10jenkins-bot: [V: 04-1] DNM Load the RevisionSlider extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [12:40:41] it did start, but throws its toys out of the pram if it can't connect to the irc server [12:40:57] (03Merged) 10jenkins-bot: typos file using extended regular expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287926 (https://phabricator.wikimedia.org/T133047) (owner: 10Hashar) [12:41:13] godog: Additionally, Why I can't join to #central @irc.wikimedia.org? That happened since yesterday. [12:41:28] moritzm: thanks, all me buddy^ [12:41:32] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team, 13Patch-For-Review: Write a test to check for clearly bogus hostnames - https://phabricator.wikimedia.org/T133047#2280412 (10hashar) 05Open>03Resolved a:03hashar I have: * added a `/typos` file to operations/mediawiki-config.... [12:41:57] chasemp: k, I'll ping in 20 mins for a brief handover [12:42:09] k [12:43:44] rxy: looks like channels are created on demand as messages flow in, so it might be that no changes from (I'm assuming) centralauth [12:44:48] (03PS1) 10Muehlenhoff: Assign salt grains for oresrdb and wire up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/287937 [12:46:37] !log hashar@tin Synchronized typos: (no message) (duration: 01m 55s) [12:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:37] entirely harmless deployments ^^^ [12:47:38] !log hashar@tin Synchronized multiversion/MWMultiVersion.php: Typo fix in a couple comment blocks (duration: 00m 28s) [12:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:48:57] 06Operations: ircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2280451 (10fgiunchedi) [12:48:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for oresrdb and wire up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/287937 (owner: 10Muehlenhoff) [12:49:03] moritzm: ^ [12:49:54] thx [12:50:29] godog: It is strange. #central is feeding every "SUL creation" (creating new account and auto creating account at All WMF public wikis) logs. [12:52:02] rxy: I see, I'm not sure how to further debug that though, mind filing a phabricator task? [12:54:22] gehel: so, maps! [12:54:46] let's start by the basics. It's an HTTP service so our routing layer is varnishes [12:55:00] there is a dedicated 4 box varnish cluster dedicated to maps [12:55:24] it used to be eqiad only but now it's codfw as well. It's using varnish 4 these days [12:55:35] 16 varnishes now, if we count eqiad, codfw, ... [12:55:47] yup [12:56:22] so when a user requests a map tile (as instructed by say lealet.js on maps.wikimedia.org), it's a standard HTTP request that goes the stack as always [12:56:44] the backend varnish layer routes to the maps-test200X boxes which run a service called kartotherian [12:56:49] 06Operations: ircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2280451 (10MoritzMuehlenhoff) A "Requires=ircd.service" in the ircecho unit would probably fix this. [12:56:59] https://git.wikimedia.org/summary/maps%2Fkartotherian [12:57:14] rxy: Is there a phabricator about this IRC bug? [12:57:24] rxy: Can you try connecting to the (old) argon instead? [12:57:29] the kartotherian service is the one responsible for serving the tile to the user [12:57:50] what is does most of the times is fetch the tile from cassandra. [12:58:07] <_joe_> so it's yet another cassandra frontend? [12:58:20] <_joe_> we're getting good at duplicating things! [12:58:27] _joe_: :P [12:58:45] duplication isn't part of our official goal? [12:59:15] it does that by knowing the zoom level of the tile (since it is in the request anyway) and looks it up in a cassandra keyspace [12:59:24] <_joe_> gehel: our official goal is not using anything not invented here [12:59:34] <_joe_> (ok I'll stop with the sarcasm right now, sorry) [12:59:37] _joe_: yes, I saw the RFC on that one... [12:59:47] if it does not find it, it will go to a lower zoom level [13:00:22] like say the request is for zoom level 16, if it is not found in cassandra, kartotherian will request zoom level 15, 14 and so on until it finds one [13:00:27] <_joe_> akosiaris: oh, so there is no tile rendering on demand? [13:00:31] Krinkle: I did searching about this IRC bug in phab, but I can't find that related task. What is "(old) argon"? [13:00:32] so we have prerenredered tiles in Cassandra , Kartotherian as a NodeJs API/web service frontend and Varnish merely grab from that svc ? [13:00:46] 06Operations: cronspam from argon - apache2 logrotate - https://phabricator.wikimedia.org/T132896#2213456 (10Krinkle) Is this still an issue now that irc.wikimedia.org no longer has an Apache serving an HTTP(S) redirect? What about argon's replacement in Ganeti? [13:00:53] _joe_: there is. kind of [13:01:09] (03PS1) 10Bartosz Dziewoński: Simplify $wgApiFrameOptions configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287939 [13:01:16] rxy: nevermind about argon, the migration is complete and the old irc server was shutdown. [13:01:19] so it is easy to render a tile of say zoom level 16 from zoom level 14 or 15 or 13 or something close [13:01:45] that is because we are using vector tiles [13:01:46] rxy: Recently we migrated the server behind irc.wikimedia.org from argon (2011-2016) to kraz (2016). These are nicknames for internal Wikimedia servers. [13:01:47] akosiaris: so karthoterian will render the zoom level request by the client from lower level tiles [13:01:56] gehel: exactly [13:02:01] This also included an upgrade of the Linux OS [13:02:10] which works quite well for a lot of tiles [13:02:15] akosiaris: I did some diagram of what I understood. Careful, long URL coming: [13:02:16] http://plantuml.com/plantuml/png/ZLOzRzim4DtzAmvUaW1L4RlD8A38qide1uAsoP9qe2JA4YAJ1ad7BeByzzvHb4LI3EnDUlVqxiZ7E-ho2kO_s5O_YSljhkZQyceEJCC6eRlPRClAPxRcKCggtiFhR0OmGFOhA5dPOBZRQSEL_R97Nf4986I5mUto-lgJGm0UNvWIhMwoMabXkAZ-qbSlDITMp2amsP5IZ9ItI3u_Ipu_BP-dv1StvyWlRaxAEqealroS8xzSd9Ht599_8wikHszRL5E2TQExAsBO6kRqcXck3Mx00iaRHcwEL89TYO_Vc3BID5orJDZCFa34dlQdxRXYHYhBERu [13:02:16] b2FUha7A_Ef8gPbZ1D8UlJ6icAiM8UNoPydxngjN4iG_J-FnEQhDM77UaNwDTuW1QllqWg1UoZjSckur48mx60vq4vJOgFUJqwxP2Oo9BYNVOSk9TU4kwkSm-ev-wOozS1tukp_JpS3ZXvuIGyjEHJlePys8_uxZdPx_6o4_anqOS3h6VJZJY17w-ImbhdnfAD3zHmi-aacjWswUGfNei7UfJk4b9gVjE7w1y0k5eYw7bOeG8gvqOWkC8TWFLG_C19Sa1jGFFZQyHnJarku2Tm6jbi-720DOFHIVBPoXu9_TW93HefyKtw4BtZwZRZXiE3_tqGq08NffM2erYoJAJS9w4iT0KZ9LUZ84yhPw [13:02:16] drVEqUlwEkhzjdde0vqdOK_VXGDtVVayzWFCaxAd77lQdy0v1x_wEe7m2sJCHGhsOts_m9OldQT0X69_Em6EVsMWmHON4ewFQTMD0ri7VXhqPlSE4Hz3iEHSsE1osuJe6J0VkthK4fkY4AkxEv-43PGBTo7qhrZsZw_0SunZ37kKV4I8PAcTQrj2603yvhwWBy5SAcwAjUmMQXJRy_U4toMmb5WzI4M6erB3kMVTJibMu96mTrA6_WDhFZUF5viPnAN3hCkn6lSDbi4Bk3EmRPcnpxaHRCjeMKgGfmwa8jqDml6dipQDdW_f_ [13:02:19] rxy: So the issue is that the #central channel doesn't exist or can't be joined? [13:02:29] gehel: not sure what to do with that [13:02:31] :P [13:02:32] btw [13:02:37] https://www.mediawiki.org/wiki/Maps/Tile_server_implementation [13:02:48] it probably is a bit outdated, but only for minor stuff [13:03:02] gehel: URL ends up being broken in parts :/ [13:03:13] (03PS2) 10Ema: mdadm boot-time race condition: sleep in init-premount [puppet] - 10https://gerrit.wikimedia.org/r/287914 (https://phabricator.wikimedia.org/T131961) [13:03:23] longer than I thought, http://tinyurl.com/z4ek263 [13:03:23] (03CR) 10Ema: [C: 032 V: 032] mdadm boot-time race condition: sleep in init-premount [puppet] - 10https://gerrit.wikimedia.org/r/287914 (https://phabricator.wikimedia.org/T131961) (owner: 10Ema) [13:03:41] (03CR) 10Yuvipanda: "Minor quibble about comments but otherwise looks great!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287723 (owner: 10Rush) [13:04:13] so, PNG generation happens after you got that vector tile in the correct zoom level, regardless whether kartotherian got it directly from cassandra or it did generate it from a lower level zoom vector tile [13:04:23] (again from cassandra) [13:04:32] gehel: looks neat [13:04:35] why cassandra ? there is no good answer on that one [13:04:44] akosiaris: I did not find this implementation page yet. Thanks! [13:04:57] IIRC, it was "it's easy to do it right now" [13:05:12] and that's the request flow part [13:06:36] the idea is to also have vector tiles served directly at some point instead of just png [13:06:45] but that's in the future IIRC [13:06:54] now the vector tile generation part [13:07:14] that happens on the demand of an "administrator" (that would always be yuri up to now) [13:07:21] One more question: we do not store higher zoom levels in Cassandra? [13:07:39] (03PS1) 10BBlack: tlsproxy: minimize keepalives diff in config [puppet] - 10https://gerrit.wikimedia.org/r/287940 (https://phabricator.wikimedia.org/T134870) [13:07:41] (03PS1) 10BBlack: Pipe websockets through traffic layers [puppet] - 10https://gerrit.wikimedia.org/r/287941 (https://phabricator.wikimedia.org/T134870) [13:07:43] depends on the tile [13:07:46] and the zoom level [13:07:56] for example there is a lot of deduplication for water tiles [13:08:00] Depending on what? Or is that the tiles generation part you were starting? [13:08:22] there is no point in storing zoom level 18 for a lake for example [13:08:39] since you can generate very fast all of them from say level 16 [13:08:49] gehel: this chart is always helpful when thinking about tiles, too: http://wiki.openstreetmap.org/wiki/Tile_disk_usage [13:08:54] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 642 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6152806 keys - replication_delay is 642 [13:09:05] gives you a feel for what zoom levels have excessive tile counts, where it's not sane to pre-render them all, etc [13:09:44] gehel: btw, always remember the disctinction between vector tiles and png tiles [13:09:46] But on what do we take this decision? Previous traffic to this tile? [13:09:59] Krinkle: Thank you for telling details of argon. I think currently issue is both. but yesterday, Although SULWatcher is working, I couldn't join to a #central. [13:10:05] The column "Maximum (4^zoom)" is the theoretical number of tiles in a zoom level. and then there's that %viewed, showing that most are never seen [13:10:07] (03PS1) 10Krinkle: Restore missing CentralAuth messages to irc-recentchanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287943 (https://phabricator.wikimedia.org/T134877) [13:10:16] rxy: ^ [13:10:17] rxy: https://phabricator.wikimedia.org/T123729 [13:10:32] gehel: oh no, the tile itself. but lemme get to that [13:10:38] so vector tile generation [13:10:46] (03PS2) 10Krinkle: Restore missing CentralAuth messages to irc-recentchanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287943 (https://phabricator.wikimedia.org/T134877) [13:10:48] it is done by the tilerator service [13:10:58] which is not really a service [13:11:20] it's more like a worker [13:11:25] PROBLEM - NTP on kafka1001 is CRITICAL: NTP CRITICAL: No response from NTP server [13:11:26] well, many workers [13:11:33] Krenair: mutante: sanity check on https://gerrit.wikimedia.org/r/#/c/287943/2/wmf-config/ProductionServices.php [13:11:48] so, there's a redis database that functions as a queue on maps-test2001 (or maps::master) [13:12:13] the "administrator" (via another service I 'll discuss later on) submits jobs to that "queue" [13:12:16] Krinkle: Many thanks :) [13:12:29] (03CR) 10Krinkle: [C: 032] "Pushing to stage (mw1099)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287943 (https://phabricator.wikimedia.org/T134877) (owner: 10Krinkle) [13:12:53] and the tilerator service, which runs on all 4 hosts picks them up and starts the vector tile generation process, eventually storing them in cassandra [13:13:09] (03Merged) 10jenkins-bot: Restore missing CentralAuth messages to irc-recentchanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287943 (https://phabricator.wikimedia.org/T134877) (owner: 10Krinkle) [13:13:24] so, in reality, and unlike the kartotherian service, tilerator never accepts requests [13:13:32] there is one exception, which is monitoring [13:13:47] so we only use tilerator's HTTP endpoint to monitor it from icinga, nothing else [13:13:57] looking into kafka1001 [13:14:19] same as kartotherian, tilerator runs on all 4 nodes and is quite CPU intensive when jobs exist [13:14:30] which is not very often [13:14:46] 06Operations, 13Patch-For-Review: decom argon - https://phabricator.wikimedia.org/T134223#2258481 (10Krinkle) MediaWiki production is still sending UDP packets to argon's old IP on every event. Fixed in 7be1ed68374a2cb4204fddc884fd801f95f14339. [13:14:50] how are the jobs submitted ? there is one more service called tileratorui [13:15:02] affected by T126733, fixed [13:15:02] T126733: ntp restart sometimes unrealiable - https://phabricator.wikimedia.org/T126733 [13:15:17] it shares the exact same code with tilerator with one small (large?) difference in the configuration [13:15:46] it's not working in a "generate" configuration, but in a "post jobs" configuration [13:15:57] (03PS1) 10Bartosz Dziewoński: Undeploy UploadWizard from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287944 [13:16:14] it does also expose an HTTP interface, albeit it is only used over a tunneled SSH connection by the "administrator" (yuri) [13:16:37] yep [13:16:59] (03PS1) 10Jdrewniak: T132520 bumping portals to master. Disabling A/B test on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287945 (https://phabricator.wikimedia.org/T132520) [13:17:00] yurik: hello, I was giving gehel a brief rundown of the maps infra [13:17:09] (03PS1) 10Hashar: contint: create /mnt/redis [puppet] - 10https://gerrit.wikimedia.org/r/287946 [13:17:15] akosiaris, thx :) [13:17:24] feel free to correct me whenever I am wrong [13:17:25] tileratorui runs only on the master node (maps-test2001) I expect... [13:17:30] * yurik reads back [13:17:34] !log krinkle@tin Synchronized wmf-config/ProductionServices.php: (no message) (duration: 00m 28s) [13:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:44] gehel: actually no, it runs everywhere, but yuri is careful enough to only use one instance at a time [13:17:55] and it is not multiprocess either IIRC [13:18:02] cause it doesn't need to be [13:18:04] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 28s) [13:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:13] but what you say makes sense ofc [13:18:42] tileratorui submits these jobs into redis from which tilerator picks them up [13:18:53] so where does tilerator get the data ? postgres! [13:19:00] get the data from* [13:19:24] RECOVERY - NTP on kafka1001 is OK: NTP OK: Offset -0.001733422279 secs [13:19:25] there is a postgres install on every box, but only one is the master (maps-test2001 or maps::master) [13:19:44] gehel, sure, we could only run it on master because only master has the redis (for que management), and tilerator is dead without the que, so might as well combine. If we setup distributed redis, than running tilerator ui from any of the server is also ok (just in case master is down) [13:20:00] it has postgis installed, and cronjobs that run osmosis [13:20:33] osmosis is a java software the queries planet-osm and gets diffs from a previously know state [13:20:33] akosiaris is doing such an amazing job explining, i'm not sure i'm needed :D [13:20:47] *known [13:20:48] the state is recorded in the /srv/osmosis directory in a file [13:20:59] yurik: don't worry, I'll have questions for you a well at some point [13:21:04] oki ;) [13:21:07] so it's local [13:21:35] the diffs created by osmosis are then piped into osm2pgsql which uses them to update the database [13:22:04] update means all the CRUD operations for points, lines, ways etc [13:22:37] the initial state was loaded into the postgres master by max quite some time ago and since then osmosis just keeps it updated [13:23:08] the initial state was again a run of osm2pgsql, just with a few different arguments and a .pbf file fed to it [13:23:27] it run for quite a while. IIRC on those boxes it was somewhere around 14-15 hours [13:23:32] (03PS1) 10Bartosz Dziewoński: Remove misleading comment in UploadWizard configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287948 [13:23:37] on spinning disks it's closer to 2-3 days [13:23:49] (03CR) 10Hashar: [C: 031] "Cherry picked on CI puppet master. Confirmed to work on both Trusty and Jessie instances after I have uninstalled the redis package and dr" [puppet] - 10https://gerrit.wikimedia.org/r/287946 (owner: 10Hashar) [13:23:49] (we 've done that too in the past) [13:24:09] akosiaris, i would hate for it to be lost - we really ought to put it on https://wikitech.wikimedia.org/wiki/Maps [13:24:22] some of these things i didn't know myself [13:24:30] (the db import stuff) [13:24:41] rxy: fixed now? [13:24:58] yurik: an incarnation of it already is in puppet, the exact one max used, yes we should put it on wikitech [13:25:24] (03PS2) 10Hashar: contint: create /mnt/redis [puppet] - 10https://gerrit.wikimedia.org/r/287946 [13:26:04] gehel: there is also an import_waterlines cron the updates the postgres database. it fetches precalculated waterlines and imports them in the database. another tool is used for that [13:26:11] Krinkle: Yes, It is fixed :) [13:26:17] it's called shp2pgsql [13:26:42] and it's part of the postgis debian package [13:27:03] akosiaris: I'm trying to take notes and update my diagrams while we speak, but they are less and less readable... [13:27:05] as the name suggests, it converts shapes to SQL [13:27:23] (03CR) 10Hashar: [C: 031] "Extracted the redis port to a $redis_port variable to make sure it is consistent between the redis::instance and the Service requirement o" [puppet] - 10https://gerrit.wikimedia.org/r/287946 (owner: 10Hashar) [13:27:36] gehel: I am almost done, if that's any consolation [13:27:42] shp2pgsql just does the pgsql import? Or also gets the data from OSM? [13:28:00] it gets data from http://data.openstreetmapdata.com/water-polygons-split-3857.zip [13:28:20] it's a zip file, unzip, fed to shp2pgsql which pipes to psql [13:28:36] import_waterlines.erb is the file you want in puppet [13:29:03] the above is done in order to save us the trouble of calculating the waterlines, since somebody already does that regularly [13:30:03] gehel: and I think I am done.. nothing else comes to mind [13:30:06] questions ? [13:30:06] !log depooling and rebooting cp1065 to test mdadm boot workaround T131961 [13:30:07] T131961: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961 [13:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:41] do we re-update things like waterlines periodically? or is that included in the normal update stream once you've imported it once? [13:31:04] yes we do it periodically [13:31:06] gehel, could you go over the https://wikitech.wikimedia.org/wiki/Maps to see if it makes any sense, and change it? It might really help you because you will stumble on all the missing peaces, whereas to me or akosiaris it may look just fine [13:31:07] I figure they must shift around as more-accurate surveys are done, as weather events and sea level rise take effect, etc [13:32:02] bblack: once per month from what I see [13:32:08] the waterlines that is [13:32:15] the standard stream is updated every day [13:32:29] well that sentence sounds wrong [13:32:40] the standard stream of updates is a cron job running once per day [13:33:07] Ok, updated diagram : http://tinyurl.com/j4jjm6a [13:33:11] given the rate at which we re-render vector tiles it's enough up to now but we might want to take it down to the hour or minute level in the future [13:34:16] gehel: the frontend varnishes use consistent hashing to contact the backend varnishes, so any frontend may contact any backend [13:34:17] akosiaris: the load related to Osmosis is probably pretty much the same if we change the frequency (more updates, but smaller) [13:35:01] gehel: yup. a bit more load, but more evenly spread across the time of the day [13:35:24] gehel: https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=maps-test2001.codfw.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=Maps+Cluster+codfw [13:35:27] see the spike ? [13:35:36] it's osmosis [13:35:52] 06Operations, 10Traffic: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2280645 (10BBlack) [13:35:53] nice! [13:36:07] Back to the tiles generation, how do we know which tiles to generate? [13:36:59] yurik: I think you know the algorithm best, wanna take that one ? [13:37:18] I only know the basics on that one, like avoiding water tiles since they are duplicates [13:37:21] gehel, that' a tricky one :) [13:37:35] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6150563 keys - replication_delay is 0 [13:37:48] Other question, what is the content of the Redis queue? Just the coordinates of the tile to generate? Or the data extracted from postgresql? In other words, does Tilerator itself has to communicate with postgres? [13:38:10] gehel, redis just has a list of tiles (ranges) [13:38:26] tilerator runs long runing SQL and converts it into tiles [13:38:53] brb, coffee [13:38:54] Were does this list of tiles comes from? Postgres? Plus some magic to remove uninteresting tiles? [13:39:15] akosiaris: take your time! I'll get some coffee too... [13:39:28] as for which tiles - i generate ALL tiles (including water) for z0-9, and for higher zooms i generetae all tiles that exist in previous level unless and save it only if its not empty [13:39:34] gehel, ^ [13:40:17] (03CR) 10Faidon Liambotis: Pipe websockets through traffic layers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287941 (https://phabricator.wikimedia.org/T134870) (owner: 10BBlack) [13:40:32] this way if i try to generate a water or a blank land tile at z10, i don't save it, and when i do z11, i don't even try to generate the 4 tiles underneath it because the one above does not exist [13:40:32] ok, so there is no communication between TileratorUI and Postgres? [13:40:37] nope [13:40:41] gehel, well [13:40:53] not exactly, because i could do it for testing [13:41:00] gehel, want me to show it to you [13:41:02] ? [13:42:14] lemme get my coffee first... [13:43:07] yurik: could you have a look at http://tinyurl.com/jay6g4b and tell me if I got everything wrong? [13:43:55] * yurik looks [13:44:24] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2280671 (10elukey) @Cmjohnson: I tried to make a very simple partman recipe (https://phabricator.wikimedia.org/P3025) but I am sure that it is wrong on some many leve... [13:45:44] akosiaris,you might also want to see it :) [13:46:15] gehel, i would drop the tileratorui on slaves in the picture - it doesn't add value there [13:46:26] yes it may run there, but just as a backupp [13:46:51] actually, feel free to remove it from the slaves at all [13:47:22] gehel, i have 15 min before i have to go [13:47:32] and back in a few hours [13:48:50] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2280706 (10Dzahn) @Robh will handle, he already started with the RAID rebuild. [13:50:02] mutante: cool, thanks [13:50:15] mutante: btw, can you reenable puppet on unupentium? 11 days w/ puppet disabled is a Bad Idea™ [13:50:27] mutante: if need be, remove the role from site.pp, so that it won't mess with your setup [13:51:02] 06Operations, 06Services, 03Mobile-Content-Service, 15User-mobrovac: Automatic monitoring checks for the MCS failing in production - https://phabricator.wikimedia.org/T134866#2280709 (10mobrovac) 05Open>03Resolved [13:52:15] (03PS1) 10Jcrespo: Add db1023 db weight; reduce weight on db1035 and db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287953 [13:52:44] paravoid: yes [13:52:57] thanks so much :) [13:55:38] !log restarting nginx on dataset1001, ms1001 and francium for openssl update [13:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:56:06] gehel: kartotherian is indeed an LVS service (answering your question there) [13:56:17] (03PS2) 10Jcrespo: Add db1023 db weight; reduce weight on db1035 and db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287953 [13:56:49] akosiaris: Thanks! That saves a bit of time... [13:57:03] (03CR) 10Jcrespo: [C: 032] Add db1023 db weight; reduce weight on db1035 and db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287953 (owner: 10Jcrespo) [13:57:28] gehel: tilerator gets data from postgres, does not save anything into postgres (that is change the arrow's direction) [13:57:38] And now, a not entirely related question... I have to setup the new Maps servers. We probably want to set them up without interfering with the current cluster and switch to it once we are ready. [13:57:42] and get's stuff from redis as well [13:58:13] there is yet no redis on maps-test2002-4 [13:58:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Add db1023 db weight; reduce weight on db1035 and db1044 (duration: 00m 26s) [13:58:15] akosiaris: arrow are not data flow, but initiators of communication [13:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:23] ah, ok then [13:58:31] <_joe_> gehel: if you're going to do anything fancy with redis replica (as in: not a simple 1:1 master-slave setup) let me know and I can help [13:59:00] gehel, ready? I have about 10 min to show it [13:59:14] gehel: ok, the rest LGTM then [13:59:17] _joe_: I have no idea yet if that's needed or not. As I understand it, redis is not used in any client facing scenario, so HA is probably not an issue... [13:59:29] yurik: ok, how? [14:00:23] (03PS1) 10Dzahn: ununpentium: temp remove the RT role [puppet] - 10https://gerrit.wikimedia.org/r/287954 [14:00:30] akosiaris: the current cluster has quite a bit of hiera configuration in files related to the "maps" role. Do we have an easy way to duplicate that role? [14:00:32] gehel, ringing [14:00:48] yurik: yes, I have 2 phones ringing, but not my computer... [14:01:29] !log depooling and rebooting cp1066 to test mdadm boot workaround T131961 [14:01:30] T131961: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961 [14:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:39] 06Operations, 13Patch-For-Review: decom argon - https://phabricator.wikimedia.org/T134223#2280781 (10Dzahn) @Krinkle there was alread https://gerrit.wikimedia.org/r/#/c/287797/ which i was aware of and +1ed [14:04:26] 06Operations, 13Patch-For-Review: decom argon - https://phabricator.wikimedia.org/T134223#2280789 (10Dzahn) The mentions on DNS the mgmt interface for T134826 and as intended. [14:04:51] (03PS1) 10BBlack: cache_misc: add stream.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/287956 (https://phabricator.wikimedia.org/T134871) [14:05:57] (03PS2) 10Dzahn: ununpentium: temp remove the RT role [puppet] - 10https://gerrit.wikimedia.org/r/287954 [14:06:11] (03PS1) 10BBlack: cache_misc: turn on websocket support [puppet] - 10https://gerrit.wikimedia.org/r/287958 (https://phabricator.wikimedia.org/T134870) [14:06:13] (03CR) 10Dzahn: [C: 032] ununpentium: temp remove the RT role [puppet] - 10https://gerrit.wikimedia.org/r/287954 (owner: 10Dzahn) [14:07:25] (03CR) 10Krinkle: "Went ahead in a separate patch 7be1ed68374a because this was missing handling for CentralAuth (defined separately), which caused #central " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287797 (owner: 10Alex Monk) [14:08:45] !log restarting apache on uranium for openssl update [14:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:29] !log restarting apache on neon for openssl update [14:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:12:19] (03PS1) 10Bartosz Dziewoński: Prepare Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287959 (https://phabricator.wikimedia.org/T134775) [14:12:31] 06Operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, and 3 others: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#2280817 (10BBlack) Just to stab randomly at things: I installed the packages `freeipmi` and `libipc-run-perl` on cp1008 (test/prod cache host) and downloaded the ch... [14:14:15] 06Operations: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928#2280820 (10ema) >>! In T131928#2235125, @fgiunchedi wrote: > note: while rebooting machines with software raid it'll help understand if {T131961} is fixed I've tried rebooting cp1065 and cp1066 with https://... [14:14:35] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:05] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:14] !log restarting ntp on achernar for openssl update [14:15:19] mutante: Can you check that https://phabricator.wikimedia.org/P3001 is resolved now? I heard back from my hosting, should be fixed now. [14:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:35] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:54] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:58] there we go again [14:15:59] wth? [14:16:26] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:17:25] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [14:17:45] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [14:18:07] hm all i did was run service_checker manually on scb1001 [14:18:55] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [14:20:16] (03PS2) 10Andrew Bogott: Removing references to pnds database on s5 [puppet] - 10https://gerrit.wikimedia.org/r/287922 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [14:20:34] mobrovac: with what arguments? [14:24:26] !log restarting ntp on hydrogen for openssl update [14:24:29] (03PS5) 10Bartosz Dziewoński: Configure testwiki as foreign file repo for test2wiki, allow cross-wiki uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) [14:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:50] 06Operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, and 3 others: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#2280856 (10BBlack) A more-verbose run: ``` root@cp1008:~# ./check_ipmi_sensor -H localhost -vv IPMI Status: Critical [Presence = Critical ('Entity Absent'), Power S... [14:24:54] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 614 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6155232 keys - replication_delay is 614 [14:25:03] (03CR) 10Andrew Bogott: [C: 031] "I ran this through the puppet compiler and it looks safe from my end." [puppet] - 10https://gerrit.wikimedia.org/r/287922 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [14:25:33] 06Operations, 03Discovery-Search-Sprint: Check Icinga alert on CirrusSearch response time - https://phabricator.wikimedia.org/T134852#2280863 (10Dzahn) My 2 cents from the past experience: Icinga checks via graphite always sound good in theory but often end up having issues like this or similar ones. The setu... [14:27:43] !log upgrading cp3007 (misc) to varnish 4 [14:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:28:41] ema: welcome back to the V4 migration :P [14:28:51] (03CR) 10MarkTraceur: [C: 031] Undeploy UploadWizard from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287944 (owner: 10Bartosz Dziewoński) [14:29:19] elukey: cheers! [14:29:28] (03PS2) 10Bartosz Dziewoński: Undeploy UploadWizard from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287944 [14:30:54] (03CR) 10MarkTraceur: [C: 031] Configure testwiki as foreign file repo for test2wiki, allow cross-wiki uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) (owner: 10Bartosz Dziewoński) [14:31:34] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2280887 (10Dzahn) >>! In T112765#2242776, @BBlack wrote: > ... or if phab's websocket stuff would be on a completely different public hostname and/or backend server... [14:31:44] <_joe_> mobrovac: time to fix those citoid specs :P [14:32:03] <_joe_> (I can do it if you point me to where in the code they are) [14:32:29] !log restarting ntp on chromium for openssl update [14:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:57] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2280892 (10Dzahn) a:05Dzahn>03RobH [14:33:28] 06Operations, 10ops-eqiad: Investigate cp1008 psu status - https://phabricator.wikimedia.org/T134888#2280893 (10Cmjohnson) [14:37:42] 06Operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#2280911 (10Dzahn) Meanwhile i'm interested in misc servers that we want to have in BOTH eqiad AND codfw ganeti clusters. [14:37:50] (03Abandoned) 10Elukey: Restore basic memcached settings to mc1009 as part of a performance test. [puppet] - 10https://gerrit.wikimedia.org/r/287237 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [14:37:52] !log restarting ntp on acamar for openssl update [14:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:08] bblack: c/p the nagios def for it [14:39:23] _joe_: i really think there's something wrong with the nrpe checks themselves [14:39:35] _joe_: running service_checker gives me all good [14:40:08] <_joe_> mobrovac: on citoid too? [14:40:32] <_joe_> mobrovac: s/nrpe checks/nrpe/ [14:40:41] yes _joe_ [14:40:45] <_joe_> but yeah I suspect as much too [14:40:54] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/1: down - Core: cr2-eqiad:xe-5/2/3 (Zayo, OGYX/120003//ZYO, 36ms) {#11519} [10Gbps wave]BR [14:42:26] 06Operations, 10DBA, 10MediaWiki-Database, 07Performance: Implement GTID replication on MariaDB 10 servers - https://phabricator.wikimedia.org/T133385#2230318 (10jcrespo) [14:44:52] 06Operations: ircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2280927 (10Dzahn) a:03Dzahn [14:44:58] andrewbogott, thanks [14:45:01] !log restarting ntp on maerlant for openssl update [14:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:48] 06Operations, 07Graphite: put additional graphite machines in service - https://phabricator.wikimedia.org/T134889#2280928 (10fgiunchedi) [14:46:04] jynus: I'm much happier having that switch-over finished :) [14:46:29] I've made a last dump of pdns on m5-master [14:46:33] before dropping it [14:47:24] andrewbogott, I think in the end, the migration was necessary but not the cause, right, or am I wrong? [14:47:42] (03CR) 10Jcrespo: [C: 032] Removing references to pnds database on s5 [puppet] - 10https://gerrit.wikimedia.org/r/287922 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [14:47:49] (03PS3) 10Jcrespo: Removing references to pnds database on s5 [puppet] - 10https://gerrit.wikimedia.org/r/287922 (https://phabricator.wikimedia.org/T128737) [14:48:04] jynus: It's unclear. Right now we're operating under the theory that the problem is 'general underperformance' rather than a specific race condition. [14:48:17] So, the db change will improve performance, but it wasn't exactly low-hanging-fruit in that regard. [14:48:34] at least it helped a bit debugging [14:48:38] I suppose [14:49:02] yep, it eliminated complications. And now the setup is something that the designate upstream devs regard as normal rather than weird [14:49:21] 06Operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#2280954 (10Dzahn) For example T134507 to create planet2001 so we have 1001 and 2001 for failover. Do we want that? And for all misc services that are VMs? We could e... [14:50:24] (03CR) 10Jcrespo: [V: 032] Removing references to pnds database on s5 [puppet] - 10https://gerrit.wikimedia.org/r/287922 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [14:52:30] !log restarting pybal on lvs3004 [14:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:52:56] 06Operations, 10Wikimedia-IRC-RC-Server: IRC RC server still mentions pmtpa on various places - https://phabricator.wikimedia.org/T133328#2280974 (10Dzahn) 07:51 !irc.wikimedia.org *** Processing connection to irc.wikimedia.org 07:51 !irc.wikimedia.org *** Looking up your hostname... 07:06 -!- Welcome to the... [14:53:10] (03CR) 10Thcipriani: [C: 031] "Puppet compiler output: http://puppet-compiler.wmflabs.org/2686/" [puppet] - 10https://gerrit.wikimedia.org/r/287112 (https://phabricator.wikimedia.org/T129147) (owner: 10BearND) [14:53:23] !log restarting ntp on nescio for openssl update [14:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:51] !log restarting pybal on lvs3002 [14:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:24] 06Operations: check status of multiple systemd units - https://phabricator.wikimedia.org/T134890#2280987 (10fgiunchedi) [14:56:12] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2281001 (10mobrovac) Given the request, I think it'd be wise to set up //something new//. My request is not tied to a particular service role, but to the hosts themselves. [14:56:22] 06Operations, 10Wikimedia-IRC-RC-Server: IRC RC server still mentions pmtpa on various places - https://phabricator.wikimedia.org/T133328#2281002 (10Dzahn) >>! In T133328#2261417, @Danny_B wrote: > @Dzahn Thanks. Pls note here, after the service is restarted somewhen in future, so we can check... That happen... [14:56:55] 06Operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, and 3 others: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#2281003 (10BBlack) Checked with @Cmjohnson and the failures reported above about the power supply are real. So +1 for the ipmi checker :) > 14:38 < cmjohnson1> bb... [14:56:59] !log dropped pdns db and associated user accounts on m5-master [14:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:33] 06Operations, 03Discovery-Search-Sprint: Check Icinga alert on CirrusSearch response time - https://phabricator.wikimedia.org/T134852#2281004 (10EBernhardson) The current alert is also against prefix search which has very low volume now. We should probably switch to comp suggest or the aggregate all queries me... [14:57:34] (03PS1) 10Andrew Bogott: bootstrapvz: fix regexp for ldap server in firstboot [puppet] - 10https://gerrit.wikimedia.org/r/287966 [15:00:05] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160510T1500). Please do the needful. [15:00:05] Krenair jan_drewniak MatmaRex: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:16] I'm having timeout issues getting tokens on dashbaord.wikiedu.org. Any known problems going on on the mediawiki side? [15:00:39] hi. both my patches should be no-ops. [15:01:00] Hello. I just added a patch for SWAT. [15:01:06] 06Operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, and 3 others: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#2281011 (10hashar) Impressive. The nice thing is that it is a single check to add on all a server, so that limit the load burden on the Icinga server and the targe... [15:01:09] 06Operations, 10Wikimedia-IRC-RC-Server: IRC RC server still mentions pmtpa on various places - https://phabricator.wikimedia.org/T133328#2281012 (10Danny_B) 05Open>03Resolved a:03Danny_B >>! In T133328#2281002, @Dzahn wrote: >>>! In T133328#2261417, @Danny_B wrote: >> @Dzahn Thanks. Pls note here, afte... [15:01:28] I can SWAT today. [15:01:49] jan_drewniak: Krenair ping me when you're around for SWAT. [15:02:08] (03PS2) 10Thcipriani: Remove misleading comment in UploadWizard configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287948 (owner: 10Bartosz Dziewoński) [15:02:08] thcipriani: o/ [15:02:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287948 (owner: 10Bartosz Dziewoński) [15:02:28] jan_drewniak: hello [15:03:04] Oh, I didn't realize we switched to []-syntax in mediawiki-config. Sorry, will amend. [15:03:06] (03Merged) 10jenkins-bot: Remove misleading comment in UploadWizard configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287948 (owner: 10Bartosz Dziewoński) [15:03:18] ragesoss: yeah, it doesn't seem like the logins work there. but it doesn't look like a problem on the Wikipedia side [15:03:51] MatmaRex: thanks. I'm getting timeouts with token request. I'll investigate further. [15:04:03] 06Operations: ircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2281016 (10fgiunchedi) yeah I think it would, we should test ircd restarts while ircecho is running [15:04:08] ragesoss: i just tried looking at https://dashboard.wikiedu.org/users/auth/mediawiki and it displays a 500 error [15:04:18] and takes exactly 30 seconds to load, which looks like some timeout somewhere [15:04:30] yeah. I have related stack traces. [15:04:31] 06Operations, 10Wikimedia-IRC-RC-Server: IRC RC server still mentions pmtpa on various places - https://phabricator.wikimedia.org/T133328#2281017 (10Danny_B) a:05Danny_B>03None [15:06:07] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Remove misleading comment in UploadWizard configuration [[gerrit:287948]] (duration: 00m 47s) [15:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:43] (03PS2) 10Glaisher: Configure $wgCheckUserCAMultiLock for CentralAuth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286927 (https://phabricator.wikimedia.org/T128605) [15:07:01] (03CR) 10Andrew Bogott: [C: 032] bootstrapvz: fix regexp for ldap server in firstboot [puppet] - 10https://gerrit.wikimedia.org/r/287966 (owner: 10Andrew Bogott) [15:07:06] (03PS3) 10Glaisher: Configure $wgCheckUserCAMultiLock for CentralAuth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286927 (https://phabricator.wikimedia.org/T128605) [15:07:11] 06Operations: check status of multiple systemd units - https://phabricator.wikimedia.org/T134890#2281018 (10fgiunchedi) also I think the current check could be simplied and decoupled checking 'active' status from the last successful run of a periodic unit, before discovering `check_systemd_unit_state` I came up... [15:07:16] (03PS2) 10Thcipriani: T132520 bumping portals to master. Disabling A/B test on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287945 (https://phabricator.wikimedia.org/T132520) (owner: 10Jdrewniak) [15:07:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287945 (https://phabricator.wikimedia.org/T132520) (owner: 10Jdrewniak) [15:07:59] !log restarted apache on californium for openssl update [15:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:08] (03Merged) 10jenkins-bot: T132520 bumping portals to master. Disabling A/B test on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287945 (https://phabricator.wikimedia.org/T132520) (owner: 10Jdrewniak) [15:09:05] (03PS3) 10Filippo Giunchedi: graphite: add cluster_servers graphite-web setting [puppet] - 10https://gerrit.wikimedia.org/r/281631 (https://phabricator.wikimedia.org/T85451) [15:09:13] !log running sync-portals for portals SWAT [15:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:21] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 27s) [15:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:37] 06Operations, 10Pybal, 10Traffic: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893#2281050 (10ema) [15:10:54] !log thcipriani@tin Synchronized portals: (no message) (duration: 00m 33s) [15:11:00] ^ jan_drewniak check please [15:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:27] thcipriani: looks good, thanks! [15:11:34] jan_drewniak: thank you for checking. [15:12:53] (03PS4) 10Filippo Giunchedi: graphite: add cluster_servers graphite-web setting [puppet] - 10https://gerrit.wikimedia.org/r/281631 (https://phabricator.wikimedia.org/T85451) [15:13:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "compiler: https://puppet-compiler.wmflabs.org/2729/" [puppet] - 10https://gerrit.wikimedia.org/r/281631 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [15:13:41] MatmaRex: hmm. may not be a way to do https://gerrit.wikimedia.org/r/#/c/287939/ without spiking the error logs (rsync is somewhat non-atomic). I'll just do a sync-dir. sound good to you? [15:15:09] thcipriani: hmm, bah. i can split that in two patches if you want [15:15:16] or just not do it. it seemed like trivial cleanup ;) [15:15:27] !log graphite-web reload on graphite1001 after merging https://gerrit.wikimedia.org/r/281631 [15:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:35] !log restarting apache on silver (wikitech host) for openssl update [15:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:44] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2281071 (10RobH) a:05RobH>03Papaul The new disk isn't showing in the software. @Papaul: Is the new disk showing a green LED for power, and if not can it be reseated? Please advise, and don't close this task u... [15:18:23] (03PS2) 10Bartosz Dziewoński: Simplify $wgApiFrameOptions configuration (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287939 [15:18:24] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281073 (10TheDJ) [15:18:25] (03PS1) 10Bartosz Dziewoński: Simplify $wgApiFrameOptions configuration (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287968 [15:18:25] MatmaRex: if you could split it, that would be ideal. I can't imagine that doing a sync-dir would create too large a blip in the logs, but ideally that could be avoided. It seems like an innocuous change, sorry things aren't quite atomic just yet :\ [15:18:27] thcipriani: ^ that should be painless [15:18:31] thanks. [15:18:33] (03PS6) 10Filippo Giunchedi: graphite: add 'big_users' route and cluster [puppet] - 10https://gerrit.wikimedia.org/r/277490 (https://phabricator.wikimedia.org/T85451) [15:19:40] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287939 (owner: 10Bartosz Dziewoński) [15:20:28] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2281080 (10Papaul) @RobH the disk is insert and showing green light [15:20:36] (03Merged) 10jenkins-bot: Simplify $wgApiFrameOptions configuration (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287939 (owner: 10Bartosz Dziewoński) [15:21:07] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2281081 (10RobH) a:05Papaul>03RobH Ok, I'll try rebooting and looking at it in bios. [15:21:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287968 (owner: 10Bartosz Dziewoński) [15:21:49] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281083 (10BBlack) Can we get some more debugging details from the browser? Is there some way you can ask it for more detail about the nature of the failure when it happens? There was a similar report from... [15:22:00] (03Merged) 10jenkins-bot: Simplify $wgApiFrameOptions configuration (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287968 (owner: 10Bartosz Dziewoński) [15:22:38] !log sinistra powercycling and troubleshooting starting for disk issue [15:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:05] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Simplify $wgApiFrameOptions configuration (1/2) [[gerrit:287939]] (duration: 00m 26s) [15:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:54] Krenair: Got thesame message now again :/ [15:23:57] the same* [15:24:01] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6150882 keys - replication_delay is 0 [15:24:19] (03PS1) 10Filippo Giunchedi: graphite: add cluster_users for codfw/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/287970 (https://phabricator.wikimedia.org/T134889) [15:25:37] I can't do anything with the Wikipedia API, even just queries, from the dashboard.wikiedu.org server. It times out every time. Is it possible that the server is being blacklisted by mediawiki? [15:25:43] MatmaRex: ^ [15:26:46] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Simplify $wgApiFrameOptions configuration (2/2) PART I [[gerrit:287968]] (duration: 00m 26s) [15:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:58] ragesoss: i have no idea (i'm not ops :) ). it's unlikely but theoretically possible, i guess [15:27:18] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Simplify $wgApiFrameOptions configuration (2/2) PART II [[gerrit:287968]] (duration: 00m 27s) [15:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:37] ^ MatmaRex thanks for splitting those up. There will come a day when that's a simple as it should be :) [15:27:39] hi thcipriani [15:27:43] 06Operations, 06Labs: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896#2281144 (10chasemp) [15:27:45] Krenair: hello [15:27:50] thcipriani: :) everything seems still fine after the changes. thanks [15:28:08] meh, even the reboot of sinistra sees sdd immediately [15:29:16] Now "The database has been automaticly locked while the slave database catch up to the master." [15:29:44] (03PS2) 10Filippo Giunchedi: graphite: add cluster_servers for codfw/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/287970 (https://phabricator.wikimedia.org/T134889) [15:29:48] (03PS4) 10Thcipriani: Configure $wgCheckUserCAMultiLock for CentralAuth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286927 (https://phabricator.wikimedia.org/T128605) (owner: 10Glaisher) [15:30:14] 06Operations, 06Labs: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896#2281144 (10ArielGlenn) The proposed size of the filesystem for dumps looks fine to me. [15:30:28] 06Operations, 06Labs: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896#2281144 (10yuvipanda) +1 <3. Will we move the content over? There were no guarantees of such, but it might be a nice gesture. No need for it to be complete or consistent. [15:30:56] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286927 (https://phabricator.wikimedia.org/T128605) (owner: 10Glaisher) [15:31:01] RECOVERY - RAID on sinistra is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [15:31:04] 06Operations, 06Labs: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896#2281169 (10yuvipanda) Should also remember to soft mount these. [15:31:18] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281170 (10Danny_B) I can confirm that most of the cases I remember happened when POSTing (save/preview page, save preferences, filtering on special pages...), can't guarantee it was //only// POST though...... [15:31:19] (03PS1) 10Ema: Upgrade cp3007 to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/287971 (https://phabricator.wikimedia.org/T131501) [15:31:36] (03Merged) 10jenkins-bot: Configure $wgCheckUserCAMultiLock for CentralAuth wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286927 (https://phabricator.wikimedia.org/T128605) (owner: 10Glaisher) [15:32:06] (03CR) 10Filippo Giunchedi: "compiler https://puppet-compiler.wmflabs.org/2732/" [puppet] - 10https://gerrit.wikimedia.org/r/287970 (https://phabricator.wikimedia.org/T134889) (owner: 10Filippo Giunchedi) [15:32:23] 06Operations, 06Labs: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896#2281176 (10chasemp) I have really no opinion on moving the data over, other than it costs us in maint time obv. We don't have a real strategy on /scratch cleanup so it's all adhoc reasoning. [15:32:29] Krenair: could you rebase your patch. Gerrit can't evidently. [15:32:48] er s/patch./patch? [15:33:37] (03CR) 10BBlack: [C: 031] Upgrade cp3007 to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/287971 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema) [15:34:18] (03CR) 10Ema: [C: 032 V: 032] Upgrade cp3007 to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/287971 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema) [15:35:01] 06Operations, 06Labs: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896#2281199 (10yuvipanda) yeah. I'd like for us to just do a simple rsync if possible. If we decide to not do that, we should provide people notice as well. I know that the kiwix project for example is using it for... [15:35:43] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Configure $wgCheckUserCAMultiLock for CentralAuth wikis [[gerrit:286927]] (duration: 00m 26s) [15:35:46] Glaisher: ^ check please [15:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:04] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Let's keep this simpler." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [15:36:55] thcipriani: can't test because the extension code hasn't reached the wikis yet and I also don't have access to the page [15:37:06] thcipriani, actually, it seems my patch is now obsolete [15:37:08] but nothing seems to be broken on logs etc.? [15:37:09] Krinkle did it earlier [15:37:24] Glaisher: logs look good. [15:37:52] Krenair: oh, ok. [15:37:52] 06Operations, 06Labs: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896#2281202 (10chasemp) The main thing would be I would want to snapshot the volume, then copy over and then swap out to the new volume as gracefully as possible. Which in some cases is not graceful at all. That me... [15:37:55] (03Abandoned) 10Alex Monk: Cleanup IRC switchover from argon to kraz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287797 (owner: 10Alex Monk) [15:38:04] thcipriani: ok, let's hope it didn't/won't break anything then. Thanks. :) [15:38:29] hrmm, the last time i had to manually restore a software raid, it wasnt on huge gpt disks. it seems like using dd to copy the partition layout isnt ideal.. anyone know of a standard tool that would ideally copy all my partitioning of sdc to sdd (but dont need data) [15:38:31] ? [15:38:41] Glaisher: oh good. yes, let's :) [15:39:20] 06Operations, 06Labs: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896#2281206 (10yuvipanda) I think that's good enough if we pre-announce it early enough. [15:39:39] 06Operations, 06Labs: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896#2281207 (10chasemp) This will also involve a period of dumps being offline. I think this is mostly a small issue though, anecdotally I don't see too many consumers this morning. But I can't do the shuffle onli... [15:41:22] (03CR) 10Jcrespo: "I agree with both the goals and the implementation, but in an ideal world, I would like to block this by T114752 and T134480. Realisticall" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [15:42:03] robh: gdisk should be able to copy the gpt partition table [15:43:56] godog: gdisk doesnt seem to be stock in our jessie install =[ [15:44:01] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281216 (10Samtar) Just to tag on, also Europe (UK) and also on POST events - it's very periodic (twice today through a couple of hours editing) [15:44:12] i suppose these are the things we need to start adding to standard over time? [15:44:36] bearND: mdholloway I'm around for scap deploy things in puppet swat today, FYI. Once everything merges, puppet runs on tin, then targets, you should just be able to run `deploy` inside of /srv/deployment/mobileapps/deploy. To watch the output from all targets you can run `deploy-log` inside the same directory in another term window. [15:44:39] everything online says sfdisk which is also not currently installed for us [15:45:18] robh: yeah I guess it is a one off, apt install gdisk and then sgdisk, sfdisk is util-linux though so installed iirc [15:45:33] thcipriani: cool, thanks! [15:46:46] (03PS1) 10DCausse: Bump CirrusSearchRequestSet avro schema to rev 121456865906 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287973 (https://phabricator.wikimedia.org/T133726) [15:47:46] godog: ahh, yeah, it is [15:48:45] !log collect mysqld metrics with prometheus-metrics-collector 0.8.1 on db2070 for 24h T128185 [15:48:45] T128185: Prepare mysql account and options for prometheus - https://phabricator.wikimedia.org/T128185 [15:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:37] godog: moritzm: _joe_ : I have added a couple trivial patches for puppet swat, but can't attend :/ Feel free to skip them [15:49:48] <_joe_> oh puppetswat is upcoming yes [15:49:51] <_joe_> I'll take a look [15:50:14] mdholloway: bearND it just occurred to me we have to bump the puppet patch again :( We just updated the scap version in the apt repositories, but we haven't rolled it out yet. We have to roll that out first since it's pinned to a specific version in puppet. [15:50:18] ok! thanks hashar _joe_ [15:51:00] <_joe_> mobrovac: I see https://gerrit.wikimedia.org/r/#/c/287112/ is up for puppet swat, but I don't think it's suitable for it [15:51:15] <_joe_> bearND too ^^ [15:51:34] <_joe_> we can schedule a time for deploying it tomorrow maybe, at a time that's comfortable for everyone? [15:52:07] !log deployed patch for T134863 [15:52:07] that's a continuation of work with godog, but i concur that switching to scap3 isn't trivial [15:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:52:38] mobrovac: _joe_ I don't think we can do that one until https://gerrit.wikimedia.org/r/#/c/287918/ goes anyway. [15:53:04] we can merge the scap change now though [15:53:04] since there can be only 1 version in the apt repo at any time. [15:53:48] thcipriani: how does scap's deb version figure into switching mobileapps to scap? [15:53:58] bleh, even though the disk is hot swap [15:54:06] the sgdisk partition update wont be seen by kernel until reboot [15:54:10] so much for hot swap =] [15:54:19] mobrovac: because it'll try to install 3.1.0-1 on scap targets and it won't find it in apt. [15:54:50] thcipriani: i think it'll just check that the appropriate version is present on the target, which currently it is [15:55:17] ah, I didn't realize all the targets already had scap installed. If that's the case then, yeah, should be fine. [15:55:18] robh: it should [15:55:47] if it doesn't, try hdparm -z (IIRC) [15:55:55] madhuvishy: What can you tell me about the labs instance 'wikimedia-ui' in the 'design' project? Do you know if I can delete it or if it is still in use? [15:56:16] root@sinistra:~# sgdisk -R /dev/sdc /dev/sdd [15:56:17] Creating new GPT entries. [15:56:17] Warning: The kernel is still using the old partition table. [15:56:18] The new table will be used at the next reboot. [15:56:28] and i couldnt see, so i totally rebooted [15:57:39] oh, and i randomized the GUID of the second disk since it would otherwise be identical to sdc [15:58:13] (03PS1) 10Faidon Liambotis: base: add gdisk to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/287975 [15:58:15] and now its in grub recovery... fml [15:58:45] this gets about 30 more minutes of my frustration before i just reinstall it. (it was a new deployment host and had not yet been used, im just fighting with it now to better understand stuff) [15:59:11] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 633 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6153725 keys - replication_delay is 633 [15:59:15] as we should be able to add in disks easily. [15:59:32] well fuck, my guid command must have fucked the guid on the boot disk [15:59:36] cuz it doesnt find it and now wont boot. [15:59:47] * robh wasnt quite as careful as he would have been on a host with data! [15:59:48] (03CR) 10Elukey: Add the possibility to specify memcached's chunk growth factor. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [15:59:59] though my command should have only hit sdd. [16:00:04] godog moritzm _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160510T1600). [16:00:04] Krenair bearND hashar Glaisher: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:23] Hi. [16:00:45] <_joe_> hi Glaisher [16:01:00] <_joe_> you're last in the queue, but if others don't show up now, you'll move up [16:01:15] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281288 (10BBlack) Without a more-detailed error message or some kind of trace of the connection attempt, it's difficult to get to the bottom of this. There are a thousand reasons a secure connection can f... [16:01:17] <_joe_> anyone else around? Krenair hashar ? [16:01:29] yes [16:02:12] <_joe_> ok [16:02:21] <_joe_> Krenair: 1 sec I have one comment for Glashier [16:02:22] (03PS1) 10Hashar: ocg: skip ganglia when it is unwanted [puppet] - 10https://gerrit.wikimedia.org/r/287976 (https://phabricator.wikimedia.org/T134808) [16:03:23] (03CR) 10Giuseppe Lavagetto: "Please see my comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283841 (https://phabricator.wikimedia.org/T53731) (owner: 10Glaisher) [16:03:33] !log i fubar'd sinistra's grub, it'll be offline for a bit while longer. [16:03:37] (03PS16) 10Giuseppe Lavagetto: deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [16:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:03:51] <_joe_> Krenair: so talking about ^^ [16:04:11] (03CR) 10Hashar: [C: 031] "Cherry picked on deployment-puppetmaster. That fix deployment-pdf01 and deployment-pdf02." [puppet] - 10https://gerrit.wikimedia.org/r/287976 (https://phabricator.wikimedia.org/T134808) (owner: 10Hashar) [16:04:11] <_joe_> I guess you already merged a diamond collector that basically collects nrpe checks, right? [16:04:19] !log nodetool cleanup on restbase2005 T132976 [16:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:04:33] T132976: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976 [16:04:58] <_joe_> It seems correct AFAICT, and you seem to have taken into account the comments on the preceding PSs [16:05:02] 07Blocked-on-Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on labs instances due to Ganglia (ex: using apache::site puppet class) - https://phabricator.wikimedia.org/T134808#2281327 (10hashar) Summary of patches for puppet.g... [16:05:03] _joe_: https://gerrit.wikimedia.org/r/#/c/287121/ [16:05:17] i merged that, it's the collector he wrote [16:05:27] but wasnt used yet [16:05:59] (03CR) 10Giuseppe Lavagetto: [C: 032] "SWAT" [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [16:06:02] (03CR) 10Glaisher: Add TranslationsUpdateJob to translate job runner group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283841 (https://phabricator.wikimedia.org/T53731) (owner: 10Glaisher) [16:06:22] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2281335 (10RobH) So I had to install gdisk tools, as I needed sgdisk to copy the GPT partitions. Then after the clone, I attempted to randomize the GUID of the NEW disk, but somehow did it for ALL the disks and mess... [16:06:48] <_joe_> Krenair: merged, now I can't verify it in prod ofc [16:07:08] hi [16:07:43] _joe_: have to escape. Again feel free to skip my puppet patches :} [16:07:53] (03PS4) 10Andrew Bogott: Remove dns entries for the old ldap/dns servers [dns] - 10https://gerrit.wikimedia.org/r/287238 (https://phabricator.wikimedia.org/T126758) [16:08:25] (03PS6) 10Giuseppe Lavagetto: Move puppet repository cherrypick counter to diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/286226 (https://phabricator.wikimedia.org/T132997) (owner: 10Alex Monk) [16:08:42] <_joe_> hashar: let's see, if they're simple enough, I'll merge them [16:08:55] <_joe_> Krenair: this one LGTM as well, already tested in beta? [16:09:11] it's already in beta [16:09:22] <_joe_> ok cool [16:09:32] (03CR) 10Andrew Bogott: [C: 032] Remove dns entries for the old ldap/dns servers [dns] - 10https://gerrit.wikimedia.org/r/287238 (https://phabricator.wikimedia.org/T126758) (owner: 10Andrew Bogott) [16:09:46] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/286226 (https://phabricator.wikimedia.org/T132997) (owner: 10Alex Monk) [16:09:57] keyholder patch works: http://shinken.wmflabs.org/service/deployment-tin/Keyholder%20status [16:10:03] <_joe_> cool :) [16:10:14] <_joe_> Krenair: so let's see the last one [16:10:36] This one hasn't been tested [16:10:42] (03PS1) 10RobH: disabling ulsfo for onsite work on 2016-05-11 [dns] - 10https://gerrit.wikimedia.org/r/287985 [16:11:18] <_joe_> I am not sure why are you re-reverting a revert by jynus [16:11:40] Look at the difference between PS1 and PS2 [16:11:43] bblack: ^ yeah that patch to push traffic from ulsfo was hella easy. i dont plan on pushing until this evening though [16:11:50] i flagged you as a reviewer [16:11:55] <_joe_> I mean why it was reverted in the first place? [16:12:24] * robh realizes its hella easy because others took the time to put in the logic, when he did this last time himself it was a lot of changes/replaces [16:12:26] I don't think the reason is in gerrit. it might be in the logs for this channel? [16:13:00] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281369 (10BBlack) As a random experiment, perhaps some of those reporting could try this in FF 46.0.1? 1. Type 'about:config' in the URL bar (it will probably pop up a warning about voiding your warranty... [16:13:27] <_joe_> Krenair: why is adding an explicit "monthday => '*'" going to change anything? [16:13:29] (03PS1) 10Dzahn: planet: remove MuddyB's blog because 504 Gateway Time-Out [puppet] - 10https://gerrit.wikimedia.org/r/287986 (https://phabricator.wikimedia.org/T133577) [16:13:58] thcipriani: so, is there anything I need to change for https://gerrit.wikimedia.org/r/#/c/287112/? [16:14:10] <_joe_> oh I see now [16:14:11] <_joe_> sorry [16:14:17] _joe_, his original patch didn't owrk [16:14:23] 06Operations, 06Discovery, 10Maps: Install / configure new maps servers in codfw - https://phabricator.wikimedia.org/T134901#2281373 (10Gehel) [16:14:30] I had to revert and modify cron manually [16:14:34] <_joe_> ok ok [16:14:37] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2281391 (10Gehel) [16:14:38] <_joe_> makes sense now [16:14:39] 06Operations, 06Discovery, 10Maps: Install / configure new maps servers in codfw - https://phabricator.wikimedia.org/T134901#2281390 (10Gehel) [16:14:42] <_joe_> thanks [16:14:48] (03PS4) 10Giuseppe Lavagetto: Revert "Revert "phabricator: Send weekly mail every week instead of on certain monthdays"" [puppet] - 10https://gerrit.wikimedia.org/r/274788 (owner: 10Alex Monk) [16:14:58] just make sure it works now [16:15:00] i think there was no minute defined and it was unclear if the puppet cron provider has a default that is "0" or "*" [16:15:05] <_joe_> jynus: I will [16:15:09] or it may have changed at some point [16:15:10] bearND: _joe_ wanted to hold off on that one, to reschedule for _not_ puppet swat since it's somewhat non-trivial. [16:15:18] so that all of a sudden there was one cron every minute and before it wasnt [16:15:18] as in, just check the final cron [16:16:13] yes, I cannot remember the details, it wasn't very clear [16:16:35] but I was sure that at some point the cron entry wasn't correct [16:17:06] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/274788 (owner: 10Alex Monk) [16:17:15] yes, it was strange as in "puppet code existed like that before but now different results" and the minute => was missing [16:17:20] (03CR) 10Giuseppe Lavagetto: [V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/274788 (owner: 10Alex Monk) [16:17:27] if that's the same one, heh [16:18:51] thcipriani: _joe_ : does that mean I should re-schedule it for tomorrow mornings regular SWAT deploy? [16:19:00] (03PS2) 10Dzahn: planet: remove MuddyB's blog because 504 Gateway Time-Out [puppet] - 10https://gerrit.wikimedia.org/r/287986 (https://phabricator.wikimedia.org/T133577) [16:19:08] (03CR) 10Dzahn: [C: 032] planet: remove MuddyB's blog because 504 Gateway Time-Out [puppet] - 10https://gerrit.wikimedia.org/r/287986 (https://phabricator.wikimedia.org/T133577) (owner: 10Dzahn) [16:20:04] <_joe_> 0 0 * * 1 /usr/local/bin/project_changes.sh [16:20:13] <_joe_> ok seems fine [16:20:16] <_joe_> mutante: ... [16:20:30] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281416 (10BBlack) Note the above would test the hypothesis that we're hitting this: https://www.ruby-forum.com/topic/6878264 [16:20:56] robh: ok [16:20:58] _joe_: :) yea [16:21:27] (03CR) 10BBlack: [C: 031] disabling ulsfo for onsite work on 2016-05-11 [dns] - 10https://gerrit.wikimedia.org/r/287985 (owner: 10RobH) [16:22:14] Something strange going on under check_graphite in that cherry-pick counting commit: [16:22:17] krenair@shinken-01:~$ /usr/lib/nagios/plugins/check_graphite -U http://labmon1001.eqiad.wmnet -T 10 check_threshold 'deployment-prep.deployment-puppetmaster.CherryPickCounterCollector.cherrypicked_commits.ops-puppet' -W 0 -C 0 --from 48h --perc 100 --over [16:22:17] OK: Less than 100.00% above the threshold [0.0] [16:22:23] krenair@shinken-01:~$ /usr/lib/nagios/plugins/check_graphite -U http://labmon1001.eqiad.wmnet -T 10 check_threshold 'deployment-prep.deployment-puppetmaster.CherryPickCounterCollector.cherrypicked_commits.ops-puppet' -W 0 -C 0 --from 48h --perc 99 --over [16:22:23] CRITICAL: 99.93% of data above the critical threshold [0.0] [16:22:43] (03CR) 10Giuseppe Lavagetto: "A little comment, but it's good as it is." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287946 (owner: 10Hashar) [16:22:58] bearND: nah. _joe_ suggested scheduling it at a different time tomorrow. I'm not sure of his availability tomorrow... [16:23:00] it says OK for deployment-puppetmaster when it should fail the check :/ [16:23:03] maybe it'll sort itself out [16:23:09] <_joe_> bearND: which TZ are you in? [16:23:18] (03PS3) 10Giuseppe Lavagetto: contint: create /mnt/redis [puppet] - 10https://gerrit.wikimedia.org/r/287946 (owner: 10Hashar) [16:23:37] (03CR) 10Elukey: Add the possibility to specify memcached's chunk growth factor. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [16:23:47] _joe_: Mountain Daylight Time, one hour before SF [16:23:57] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint: create /mnt/redis [puppet] - 10https://gerrit.wikimedia.org/r/287946 (owner: 10Hashar) [16:24:22] PROBLEM - DPKG on sinistra is CRITICAL: Connection refused by host [16:24:42] PROBLEM - Disk space on sinistra is CRITICAL: Connection refused by host [16:25:11] PROBLEM - RAID on sinistra is CRITICAL: Connection refused by host [16:25:22] (03CR) 10Giuseppe Lavagetto: Add TranslationsUpdateJob to translate job runner group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283841 (https://phabricator.wikimedia.org/T53731) (owner: 10Glaisher) [16:25:28] (03PS2) 10Giuseppe Lavagetto: Add TranslationsUpdateJob to translate job runner group [puppet] - 10https://gerrit.wikimedia.org/r/283841 (https://phabricator.wikimedia.org/T53731) (owner: 10Glaisher) [16:25:33] PROBLEM - configured eth on sinistra is CRITICAL: Connection refused by host [16:25:55] <_joe_> bearND: so 15:00Z would be ok? [16:26:02] PROBLEM - dhclient process on sinistra is CRITICAL: Connection refused by host [16:26:03] PROBLEM - Check size of conntrack table on sinistra is CRITICAL: Connection refused by host [16:26:04] <_joe_> the time we have the morning swat at [16:26:12] PROBLEM - salt-minion processes on sinistra is CRITICAL: Connection refused by host [16:26:13] PROBLEM - puppet last run on sinistra is CRITICAL: Connection refused by host [16:26:13] <_joe_> mobrovac: would work for you? [16:26:39] _joe_: that's ok for me [16:26:51] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2281445 (10Gehel) @Papaul: I'm following the service implementation on T134901. Can I close this task? Or do you still need to track something here? [16:26:53] 06Operations: ircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2281447 (10Dzahn) yep, i'll test it in labs. (we know setup works there now. i have added fake private data in labs/private, Krenair has tested it and fixed more. so that should not be hard now :) [16:27:27] 06Operations, 06Discovery, 10Maps, 03Discovery-Maps-Sprint: Install / configure new maps servers in codfw - https://phabricator.wikimedia.org/T134901#2281373 (10Gehel) [16:27:40] _joe_: bearND: 15:30Z would suit me much better [16:27:47] <_joe_> mobrovac: cool [16:28:05] 06Operations: ircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2281457 (10Dzahn) (but don't restart ircd in prod, users hate it :) [16:28:14] _joe_: mobrovac : works for me [16:28:40] (03CR) 10Giuseppe Lavagetto: [C: 032] Add TranslationsUpdateJob to translate job runner group [puppet] - 10https://gerrit.wikimedia.org/r/283841 (https://phabricator.wikimedia.org/T53731) (owner: 10Glaisher) [16:29:40] <_joe_> Glaisher: merged, testing it on one jobrunner just to be sure [16:31:00] <_joe_> Glaisher: in about 30 mins it will be deployed everywhere [16:31:50] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2281484 (10Papaul) @Gehel you are welcome to close the task, [16:31:56] _joe_: Alright. Thanks! [16:34:12] _joe_: puppetswat done? [16:34:24] can i throw in https://gerrit.wikimedia.org/r/#/c/287902/ perhaps? [16:35:51] (03CR) 10Mobrovac: "We have scheduled to switch MobileApps for tomorrow at 15:30 UTC. I think it makes sense push both CXServer and MobileApps at the same tim" [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) (owner: 10KartikMistry) [16:36:06] <_joe_> mobrovac: it is done indeed, let's see [16:37:01] (03PS3) 10Giuseppe Lavagetto: service::node: Add a convenience script to pretty-tail logs [puppet] - 10https://gerrit.wikimedia.org/r/287902 (owner: 10Mobrovac) [16:37:10] <_joe_> mobrovac: yeah I already reviewed it this morning :) [16:37:14] <_joe_> it's cool [16:37:23] :) [16:38:34] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures [16:38:53] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281529 (10Dzahn) list of IPs that still show up now.. and the names they resolve t: | 10.68.17.70 | integration-slave-precise-1011.integration.eqiad.wmflabs. | 10.68.16.5... [16:39:22] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/287902 (owner: 10Mobrovac) [16:40:48] !log repooling cp3007 running varnish 4 [16:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:46] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10AlexMonk-WMF) Ran `root@rt1:~# dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia` on rt1.servermon.eqiad.wmflabs (10.68.19.15) Puppet is broken on t... [16:43:01] <_joe_> ema: \o/ [16:43:37] <_joe_> mobrovac: works like a charm on scb1001 [16:43:37] yay! [16:43:48] (03CR) 10Rush: [C: 031] Spreadcheck should return 0 when everything is good. [puppet] - 10https://gerrit.wikimedia.org/r/287637 (owner: 10Andrew Bogott) [16:43:57] _joe_: nice! thnx! [16:44:46] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2281549 (10Dzahn) Alright, then we should create a new group (naming suggestions? "scab-admins" may not be the best :), but for this purpose, and then put it on the hosts via hieradata/role/... [16:44:53] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281550 (10Krenair) I can't log in to phab-01.phabricator.eqiad.wmflabs (10.68.16.201), even as root. Maybe someone with access to the labs salt master can get in. [16:45:41] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281552 (10fgiunchedi) I did `graphite-labs.graphite.eqiad.wmflabs` and `graphite1.graphite.eqiad.wmflabs` [16:47:55] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281561 (10Dzahn) integration-raita can be disregarded. that was fixed by hashar. i think it just needs a little more time to disappear from the UI but there is no new data [16:48:59] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281578 (10Krenair) >>! In T115330#2281529, @Dzahn wrote: > | 10.64.48.132 | 3(NXDOMAIN) `templates/wmnet:822:wmf4727-test 1H IN A 10.64.48.132` - resolving that h... [16:50:47] (03PS1) 10Gehel: WIP - Preparing configuration for new maps servers [puppet] - 10https://gerrit.wikimedia.org/r/287992 (https://phabricator.wikimedia.org/T134901) [16:51:01] (03PS8) 10Rush: change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 [16:51:20] (03CR) 10Nuria: "Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/285309 (owner: 10Dzahn) [16:52:03] (03CR) 10jenkins-bot: [V: 04-1] change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 (owner: 10Rush) [16:53:52] (03PS9) 10Rush: change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 [16:54:32] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281594 (10Dzahn) wlmjurytool2014.wlmjurytool.eqiad.wmflabs. - killed gmond, there is puppet fail about starting ganglia-monitor and i think it's self-hosted master. but gm... [16:59:00] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2259756 (10JanZerebecki) Why would deployment of one service force you to change an uninvolved service? [17:00:04] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160510T1700). Please do the needful. [17:01:24] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:02:06] (03PS10) 10Rush: change up alerting for services within tools in icinga [puppet] - 10https://gerrit.wikimedia.org/r/287723 [17:04:02] (03CR) 10Rush: [C: 032] "yuvi set sail on a plane just now but we talked about the note, fixed here as described. I believe I have equiv of +1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/287723 (owner: 10Rush) [17:04:06] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6150046 keys - replication_delay is 0 [17:05:39] (03PS2) 10Andrew Bogott: Spreadcheck should return 0 when everything is good. [puppet] - 10https://gerrit.wikimedia.org/r/287637 [17:14:05] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2281705 (10mobrovac) >>! In T134251#2281549, @Dzahn wrote: > Alright, then we should create a new group (naming suggestions? "scab-admins" may not be the best :), but for this purpose, and t... [17:18:00] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2281733 (10RobH) 05Open>03Resolved So my attempt to install sgdisk and copy partitions worked, but then my command to randomize the GUID of the new disk (since it copied SDC) failed and randomized the GUIDs of ev... [17:18:02] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, and 2 others: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2281735 (10RobH) [17:18:14] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, and 2 others: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2086341 (10RobH) The raid issue is resolved, but service implementation still needs to occur. [17:18:51] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2281737 (10Samtar) @BBlack completely understand, I'll try the above (Win 8.1 Pro + FF 46.0.1) and report back - saying that, I've spent the last couple of minutes trying to force it to happen (both before... [17:21:52] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2281758 (10Dzahn) >>! In T115330#2281542, @AlexMonk-WMF wrote: > Ran `dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia` I did the same on the last couple ins... [17:23:07] (03CR) 10Andrew Bogott: [C: 032] Spreadcheck should return 0 when everything is good. [puppet] - 10https://gerrit.wikimedia.org/r/287637 (owner: 10Andrew Bogott) [17:25:00] 06Operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T133517#2281774 (10RobH) [17:27:25] (03PS1) 10Rush: change icinga host for toollabs from ip to hostname based [puppet] - 10https://gerrit.wikimedia.org/r/287993 [17:28:57] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T134755#2281787 (10RobH) Understood, we'll want to wipe any servers and disks we reclaim for any purpose though, so we won't touch anything until a decision has been made on db archiving. F... [17:29:09] i'm plannning to deploy OCG in this window [17:29:11] fyi [17:30:54] (03CR) 10Rush: [C: 032 V: 032] "jenkins?" [puppet] - 10https://gerrit.wikimedia.org/r/287993 (owner: 10Rush) [17:31:15] (03PS2) 10BBlack: Pipe websockets through traffic layers [puppet] - 10https://gerrit.wikimedia.org/r/287941 (https://phabricator.wikimedia.org/T134870) [17:31:17] (03PS2) 10BBlack: tlsproxy: minimize keepalives diff in config [puppet] - 10https://gerrit.wikimedia.org/r/287940 (https://phabricator.wikimedia.org/T134870) [17:31:19] (03PS1) 10BBlack: tlsproxy: switch to (non-persistent) HTTP/1.1 [puppet] - 10https://gerrit.wikimedia.org/r/287995 (https://phabricator.wikimedia.org/T134870) [17:31:21] (03PS1) 10BBlack: tlsproxy: turn proxy_request_buffering off [puppet] - 10https://gerrit.wikimedia.org/r/287996 [17:31:56] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [17:32:28] ^this is me I believe addressing it [17:32:37] RECOVERY - Tool Labs instance distribution on labcontrol1001 is OK: OK: All critical toollabs instances are spread out enough [17:37:02] 07Blocked-on-Operations, 06Operations, 10hardware-requests: Evaluate replacing SATA disks on ganeti100X.eqiad.wmnet with SSDs - https://phabricator.wikimedia.org/T132679#2281804 (10RobH) 05stalled>03Resolved It seems there are no tasks assigned to this as a blocker, and the order is handled via the #proc... [17:37:40] 06Operations, 10ops-ulsfo: power loss in ulsfo cabinet 1.23 - https://phabricator.wikimedia.org/T134330#2281807 (10RobH) 05Open>03Resolved [17:38:10] !log starting OCG deploy [17:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:41:38] !log updated OCG to version b0c57a1c6890e9fa1f2c3743fc14cb6a7f244fc3 (T120079) [17:41:39] T120079: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079 [17:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:50:25] (03PS2) 10Andrew Bogott: Enable base::firewall on labtestcontrol2001 [puppet] - 10https://gerrit.wikimedia.org/r/286145 (owner: 10Muehlenhoff) [17:52:14] (03CR) 10Andrew Bogott: [C: 032] Enable base::firewall on labtestcontrol2001 [puppet] - 10https://gerrit.wikimedia.org/r/286145 (owner: 10Muehlenhoff) [17:53:18] 06Operations, 10Mail, 15User-greg: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#2281895 (10Dzahn) [17:53:20] 06Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2281892 (10Dzahn) 05Open>03Resolved a:03Dzahn Now it has an SPF record. ``` ;; QUESTION SECTION: ;phabricator.wikimedia.org. IN TXT ;; ANSWER SECTION: p... [17:53:27] RECOVERY - Tool Labs instance distribution on labcontrol1002 is OK: OK: All critical toollabs instances are spread out enough [17:53:36] 06Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2281896 (10Dzahn) a:05Dzahn>03Mschon [17:53:45] cscott, done with depl? [17:53:54] i need to push graphoid out [17:54:16] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [17:55:12] yurik: yup, finished. [17:55:24] cscott, do you know if anyone else is puhsing stuff out? [17:55:58] yurik: note that i know of. parsoid and OCG are using the convention that we'll !log before we start a deploy, so you can check the SAL to see if anything else is active. [17:56:15] *not that i know of [17:56:26] cscott, thx for letting me know, i will use the same conv [17:56:30] !log about to deploy graphoid [17:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:57:08] (03PS1) 10Ottomata: Configure kafka1022 as a confluent 0.9 broker [puppet] - 10https://gerrit.wikimedia.org/r/288009 (https://phabricator.wikimedia.org/T121562) [17:57:25] yurik: also, we discovered recently that there's a bot which will add autolinks to the SAL to phab tasks if you mention the phab tasks in your !log mesage. So when you finish your deploy, you might try adding the Txxx numbers of whatever bugs this deploy is supposed to fix. [17:57:44] cscott, thx, good to know [17:57:45] (see my previous !log for instance) [17:57:53] !log stopping camus and puppet on analytics1027 during upgrade of one kafka broker [17:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:58:09] elukey: fyi i'm gonna do this one as we discussed, lemme know if you want to hang out during [17:58:47] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:00:50] !log OCG: clearing cache for ocg1003.eqiad.wmnet and ocg1003 (T120079) [18:00:51] T120079: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079 [18:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:00:58] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:01:15] !log stopping kafka on kafka1022 to upgrade to 0.9 [18:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:01:37] 06Operations, 10ops-codfw, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2281935 (10jcrespo) 05Open>03stalled es2019 has been reimaged from es2017. Let's wait now that moth to see if it fails again while I test GTID here. [18:01:38] !log OCG: script reported "Cleared 0 (of 363141 total) entries from cache in 56.894 seconds" (T120079) [18:01:39] T120079: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079 [18:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:02:12] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2281939 (10JanZerebecki) >>! In T134251#2281705, @mobrovac wrote: >>>! In T134251#2281625, @JanZerebecki wrote: >> Why would deployment of one service force you to change an uninvolved servi... [18:02:13] almost done with graphoid depl [18:03:24] (03CR) 10Ottomata: [C: 032] Configure kafka1022 as a confluent 0.9 broker [puppet] - 10https://gerrit.wikimedia.org/r/288009 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [18:03:27] 06Operations: Increase size of root partition on ocg* servers - https://phabricator.wikimedia.org/T130591#2281956 (10cscott) [18:03:29] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2281957 (10cscott) [18:03:31] 06Operations, 10OCG-General, 06Scrum-of-Scrums, 06Services, 07Technical-Debt: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#2281953 (10cscott) 05Open>03Resolved a:03cscott Ok, looks like the script is fixed (see above). Of course, the entries all ex... [18:04:33] gehel, could you help for a sec, i forgot what servers i need to reboot (graphoid service that is) [18:04:52] 06Operations, 10ArchCom-RfC, 06Services: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#2281963 (10cscott) [18:04:54] 06Operations, 10OCG-General, 06Scrum-of-Scrums, 06Services, 07Technical-Debt: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#2281962 (10cscott) [18:05:34] yurik: not sure at all... I have mostly no idea what graphoid is... [18:06:09] yurik: let me do some grep magic... [18:06:28] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2281970 (10cscott) I believe this is unblocked now. [18:06:51] gehel, basically i already restarted scb1001.eqiad.wmnet, and will do the same for scb1002.eqiad.wmnet, but i think there are more [18:07:33] yurik: I don't see that server in site.pp... [18:08:06] _joe_: fyi, i deployed the fix for the "recursive nexttick" and successfully ran the script to empty the cache for ocg1003. ganglia indicates that it is quite now, with no appreciable cpu or network. [18:08:21] yurik: sorry pebcak: /^sca[12]00[12]\.(eqiad|codfw)\.wmnet$/ + /^scb[12]00[12]\.(eqiad|codfw)\.wmnet$/ [18:08:23] (03PS1) 10Rush: collapse icinga monitoring for tools [puppet] - 10https://gerrit.wikimedia.org/r/288011 [18:08:38] (03PS2) 10Rush: collapse icinga monitoring for tools [puppet] - 10https://gerrit.wikimedia.org/r/288011 [18:09:16] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [18:09:42] yurik, so it seems that we have the same servers in codfw [18:09:45] gehel, which of these actually exist? [18:09:54] but don't quote me on that... [18:10:00] the sync reports 6 servers, but i suspect its a mestake [18:10:24] (03PS1) 10Ottomata: Fix include and thresholds for icinga alerts for confluent kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/288012 (https://phabricator.wikimedia.org/T121562) [18:10:27] !log testing GTID replication on es2019 T133385 T130702 [18:10:28] T133385: Implement GTID replication on MariaDB 10 servers - https://phabricator.wikimedia.org/T133385 [18:10:28] T130702: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702 [18:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:11:38] (03CR) 10jenkins-bot: [V: 04-1] Fix include and thresholds for icinga alerts for confluent kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/288012 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [18:11:44] yurik: how urgent is that? My multitasking abilities are not all that great and you can guess where the other part of my brain is at the moment... [18:12:07] gehel, not critical, but i should probably do it soonish :) I will check myself, maybe i will find it [18:12:15] (03PS2) 10Ottomata: Fix include and thresholds for icinga alerts for confluent kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/288012 (https://phabricator.wikimedia.org/T121562) [18:13:13] yurik: ok, I'll slowly dig into that ... [18:13:22] (03CR) 10Rush: [C: 032] "trying to fix icinga!" [puppet] - 10https://gerrit.wikimedia.org/r/288011 (owner: 10Rush) [18:13:24] gehel, thx :) [18:14:33] (03PS3) 10Ottomata: Fix include and thresholds for icinga alerts for confluent kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/288012 (https://phabricator.wikimedia.org/T121562) [18:14:42] (03CR) 10Ottomata: [C: 032 V: 032] Fix include and thresholds for icinga alerts for confluent kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/288012 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [18:15:47] yurik: I only find those 4 servers, but I have no really sure what I'm looking for... [18:16:47] (03PS1) 10Ottomata: Fix kafka typo [puppet] - 10https://gerrit.wikimedia.org/r/288014 [18:17:23] (03CR) 10Ottomata: [C: 032 V: 032] Fix kafka typo [puppet] - 10https://gerrit.wikimedia.org/r/288014 (owner: 10Ottomata) [18:17:37] (03PS1) 10Andrew Bogott: Move several Labs IPs and IP ranges into Hiera. [puppet] - 10https://gerrit.wikimedia.org/r/288015 [18:18:31] (03PS1) 10Ottomata: Require proper class in confluent::kafka::broker::alerts [puppet] - 10https://gerrit.wikimedia.org/r/288016 [18:18:49] !log disabling puppet on caches for nginx change observe/deploy... [18:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:59] (03CR) 10Ottomata: [C: 032 V: 032] Require proper class in confluent::kafka::broker::alerts [puppet] - 10https://gerrit.wikimedia.org/r/288016 (owner: 10Ottomata) [18:19:21] gehel, ah, its all good, i found it in puppet/conftool-data/nodes/codfw.yaml & eqiad [18:19:32] yurik: as far as I can see, we only have 2 graphoid servers in each DC behind the LVS endpoint [18:19:45] gehel, yep, i concur [18:20:19] (03PS3) 10BBlack: tlsproxy: minimize keepalives diff in config [puppet] - 10https://gerrit.wikimedia.org/r/287940 (https://phabricator.wikimedia.org/T134870) [18:20:41] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: minimize keepalives diff in config [puppet] - 10https://gerrit.wikimedia.org/r/287940 (https://phabricator.wikimedia.org/T134870) (owner: 10BBlack) [18:20:56] !log finished graphoid deployment & restart. T134575 [18:20:57] T134575: Update graphoid service to Vega 2.5.2 - https://phabricator.wikimedia.org/T134575 [18:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:22:43] (03PS1) 10Ottomata: Fix ferm rule for confluent kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/288017 (https://phabricator.wikimedia.org/T121562) [18:23:08] (03PS2) 10Ottomata: Fix ferm rule for confluent kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/288017 (https://phabricator.wikimedia.org/T121562) [18:24:16] (03CR) 10Ottomata: [C: 032 V: 032] Fix ferm rule for confluent kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/288017 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [18:25:55] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 39 failures [18:27:05] (03PS5) 10Dzahn: interface: move rps::modparams to own file [puppet] - 10https://gerrit.wikimedia.org/r/284083 [18:27:17] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/2733/ ??" [puppet] - 10https://gerrit.wikimedia.org/r/284083 (owner: 10Dzahn) [18:28:17] (03CR) 10jenkins-bot: [V: 04-1] interface: move rps::modparams to own file [puppet] - 10https://gerrit.wikimedia.org/r/284083 (owner: 10Dzahn) [18:28:30] (03PS2) 10BBlack: tlsproxy: switch to (non-persistent) HTTP/1.1 [puppet] - 10https://gerrit.wikimedia.org/r/287995 (https://phabricator.wikimedia.org/T134870) [18:28:41] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: switch to (non-persistent) HTTP/1.1 [puppet] - 10https://gerrit.wikimedia.org/r/287995 (https://phabricator.wikimedia.org/T134870) (owner: 10BBlack) [18:29:54] (03PS2) 10Andrew Bogott: Move several Labs IPs and IP ranges into Hiera. [puppet] - 10https://gerrit.wikimedia.org/r/288015 [18:29:57] (03PS1) 10Ottomata: Hardcode ferm::service kafka broker port [puppet] - 10https://gerrit.wikimedia.org/r/288018 [18:30:26] (03CR) 10Ottomata: [C: 032 V: 032] Hardcode ferm::service kafka broker port [puppet] - 10https://gerrit.wikimedia.org/r/288018 (owner: 10Ottomata) [18:30:41] (03CR) 10Hashar: contint: create /mnt/redis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287946 (owner: 10Hashar) [18:31:13] (03CR) 10jenkins-bot: [V: 04-1] Move several Labs IPs and IP ranges into Hiera. [puppet] - 10https://gerrit.wikimedia.org/r/288015 (owner: 10Andrew Bogott) [18:31:44] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [18:35:17] (03PS3) 10Andrew Bogott: Move several Labs IPs and IP ranges into Hiera. [puppet] - 10https://gerrit.wikimedia.org/r/288015 [18:35:19] (03PS1) 10Andrew Bogott: s/kakfa/kafka [puppet] - 10https://gerrit.wikimedia.org/r/288019 [18:36:03] (03CR) 10Andrew Bogott: "Puppet compiler confirms this is a no-op on labcontrol1001" [puppet] - 10https://gerrit.wikimedia.org/r/288015 (owner: 10Andrew Bogott) [18:36:34] (03CR) 10jenkins-bot: [V: 04-1] s/kakfa/kafka [puppet] - 10https://gerrit.wikimedia.org/r/288019 (owner: 10Andrew Bogott) [18:37:04] (03CR) 10jenkins-bot: [V: 04-1] Move several Labs IPs and IP ranges into Hiera. [puppet] - 10https://gerrit.wikimedia.org/r/288015 (owner: 10Andrew Bogott) [18:37:18] Can I get a root to nuke /tmp/make-wmf-branch from tin? [18:37:19] hm, did someone just add 'kafka' to the typo checker? [18:37:45] ostriches: yea, dne [18:38:04] !log deleted /tmp/make-wmf-branch on tin by request [18:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:38:14] mutante: Thx [18:38:35] (03PS4) 10Andrew Bogott: Move several Labs IPs and IP ranges into Hiera. [puppet] - 10https://gerrit.wikimedia.org/r/288015 [18:38:35] andrewbogott: didn't I add kakfa? [18:38:37] (03PS2) 10Andrew Bogott: s/kakfa/kafka [puppet] - 10https://gerrit.wikimedia.org/r/288019 [18:38:40] haha did I typo while adding a typo?? [18:38:42] haha [18:38:57] andrewbogott: i added kakfa [18:39:01] that isa tyope, no? [18:39:10] AHHH [18:39:12] it found typos :/ [18:39:13] hahah [18:39:14] ottomata: your change to the typos file was correct [18:39:15] thanks [18:39:25] but you should generally remove typos before… otherwise no one can merge ever again [18:39:33] yeah sorry [18:39:34] (and, actually, didn't jenkins flag your change?) [18:39:42] heh == Function: kakfa_config(string cluster_prefix[, string site]) [18:40:09] it might not have, that change came along with another that I was hasty to merge, since I needed during an upgrade in progress [18:40:15] sorry about that [18:40:33] looks like jenkins approved it. Weird, it most only apply on n+1 [18:40:34] (03CR) 10Ottomata: [C: 031] s/kakfa/kafka [puppet] - 10https://gerrit.wikimedia.org/r/288019 (owner: 10Andrew Bogott) [18:40:54] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 714 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6161122 keys - replication_delay is 714 [18:40:55] (03CR) 10Andrew Bogott: [C: 032] s/kakfa/kafka [puppet] - 10https://gerrit.wikimedia.org/r/288019 (owner: 10Andrew Bogott) [18:41:09] (03CR) 10Andrew Bogott: [C: 032] Move several Labs IPs and IP ranges into Hiera. [puppet] - 10https://gerrit.wikimedia.org/r/288015 (owner: 10Andrew Bogott) [18:41:11] (03PS1) 10Yurik: Remove obsolete graphoid settings [puppet] - 10https://gerrit.wikimedia.org/r/288020 [18:42:01] gehel, ^ [18:42:57] (03CR) 10jenkins-bot: [V: 04-1] Remove obsolete graphoid settings [puppet] - 10https://gerrit.wikimedia.org/r/288020 (owner: 10Yurik) [18:44:47] (03CR) 10Dzahn: [C: 04-1] interface: move aggregate_member to own file [puppet] - 10https://gerrit.wikimedia.org/r/284084 (owner: 10Dzahn) [18:45:08] (03PS2) 10Andrew Bogott: Removed the transitional labs-ns2 and labs-ns3 definitions. [dns] - 10https://gerrit.wikimedia.org/r/287245 (https://phabricator.wikimedia.org/T126758) [18:45:28] grr, how should the arrows be aligned in https://gerrit.wikimedia.org/r/#/c/288020/1/modules/graphoid/manifests/init.pp ???? [18:46:00] it complains about line 40,41,42 [18:46:06] (03CR) 10Andrew Bogott: [C: 032] Removed the transitional labs-ns2 and labs-ns3 definitions. [dns] - 10https://gerrit.wikimedia.org/r/287245 (https://phabricator.wikimedia.org/T126758) (owner: 10Andrew Bogott) [18:48:16] i want to get on this machine: integration-slave-trusty-1004.integration.eqiad.wmflabs but i can't, even as root [18:48:22] but i know it's running [18:48:32] and it's somehow different from the others [18:48:45] yurik: everything is one space too many to the right [18:49:05] yurik: does that make sense? Everything should be aligned, but also as far left as possible [18:49:24] andrewbogott, you mean i need to remove one space in front of the arrow? [18:49:24] bleh [18:49:47] yurik: one space between the word and => [18:49:55] allowedDomains => [18:50:00] bleh. Thansk! fixed [18:50:08] (03PS2) 10Yurik: Remove obsolete graphoid settings [puppet] - 10https://gerrit.wikimedia.org/r/288020 [18:50:18] OCD galore [18:52:02] andrewbogott: when i cant get on labs instances as root.. should i try running stuff via salt? [18:52:05] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:52:07] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2282229 (10hashar) [18:52:16] i have like one or 2 .where it's the case [18:52:29] including phab-01 [18:52:32] mutante: it's worth a try, although if your root key doesn't work that's a bad sign [18:52:49] i just want to rm -rf something on them and then i'm fine :) [18:52:54] eh, kill gmond [18:53:16] andrewbogott: from labcontrol? [18:53:22] yeah [18:53:26] ok, thx [18:53:27] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10hashar) I have moved @Dzahn list of IP/FQDN to the task detail ( https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-4wg7fhgbo3bwvli/ ) this wa... [18:53:49] mutante: I have updated https://phabricator.wikimedia.org/T115330 to list the labs instances emitting to Ganglia in the task detail [18:53:53] so we can mark them as OK :-} [18:54:28] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2282254 (10hashar) [18:55:04] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10hashar) [18:55:20] hashar: great! we have already fixed like half i think:) [18:55:27] probably [18:56:14] I have a few patches for puppet modules apache, hhvm and OCG that drops ganglia whenever standard::has_ganglia == false. But I am not sure whom to ask review [18:56:23] listed at https://phabricator.wikimedia.org/T134808#2281327 [18:57:16] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2282265 (10Dzahn) [18:57:26] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6151586 keys - replication_delay is 0 [18:58:27] hashar: (OCG does have ganglia support, but I'm guessing that I'm misunderstanding what your patch does) [18:58:28] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10Dzahn) [18:59:09] cscott: hello! the ocg puppet class is applied on the beta cluster instance and thus the labs instance emit ganglia metrics to the production ganglia [18:59:17] cscott: nothing to worry about for prod ;-} [18:59:31] (03PS3) 10Yurik: Remove obsolete graphoid settings [puppet] - 10https://gerrit.wikimedia.org/r/288020 [18:59:48] ah [19:00:04] hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160510T1900). [19:00:20] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2282270 (10Dzahn) [19:01:51] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10Dzahn) Maybe somebody in Analytics could take care of limn and maintenance.analytics? [19:05:18] (03PS1) 10Ottomata: Fix for kafka broker process alert for confluent brokers [puppet] - 10https://gerrit.wikimedia.org/r/288027 (https://phabricator.wikimedia.org/T121562) [19:06:11] hashar: are the 3 patches really dependennt on each other? no right. maybe you could remove that [19:06:19] i'm looking already [19:06:33] yeah they are independent [19:06:37] I have chained them out of pure laziness [19:06:44] merging the one for ocg.. i'd do that [19:06:48] and to indicate they are doing the same [19:06:58] and apache also looking.. maybe hhvm not so much [19:07:00] yup cherry pick button to refs/heads/production does magic;) [19:07:13] oh [19:07:17] ok [19:07:17] your call!! I dont feel obligated :} [19:07:59] (03CR) 10Ottomata: [C: 032] Fix for kafka broker process alert for confluent brokers [puppet] - 10https://gerrit.wikimedia.org/r/288027 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [19:09:05] (03PS2) 10Dzahn: ocg: skip ganglia when it is unwanted [puppet] - 10https://gerrit.wikimedia.org/r/287976 (https://phabricator.wikimedia.org/T134808) (owner: 10Hashar) [19:10:56] (03PS3) 10Dzahn: ocg: skip ganglia when it is unwanted [puppet] - 10https://gerrit.wikimedia.org/r/287976 (https://phabricator.wikimedia.org/T134808) (owner: 10Hashar) [19:11:08] (03CR) 10Dzahn: [C: 032] "no-op in production http://puppet-compiler.wmflabs.org/2736/" [puppet] - 10https://gerrit.wikimedia.org/r/287976 (https://phabricator.wikimedia.org/T134808) (owner: 10Hashar) [19:11:17] at least all of beta and CI instances should be out of ganglia [19:11:27] !log reenabling camus and puppet on analytics1027 [19:11:29] * hashar checks for leftover gmond process [19:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:13:18] cscott: i confirmed there was no change on ocg1001 or anything. "labs only" [19:13:29] \O/ [19:17:34] (03PS2) 10Dzahn: apache: skip ganglia when it is unwanted [puppet] - 10https://gerrit.wikimedia.org/r/287695 (https://phabricator.wikimedia.org/T134808) (owner: 10Hashar) [19:18:59] (03PS2) 10Andrew Bogott: horizon: Enable password autocomplete on login form [puppet] - 10https://gerrit.wikimedia.org/r/286112 (owner: 10BryanDavis) [19:19:34] (03CR) 10Dzahn: [C: 032] "no-op in production on random appservers http://puppet-compiler.wmflabs.org/2737/" [puppet] - 10https://gerrit.wikimedia.org/r/287695 (https://phabricator.wikimedia.org/T134808) (owner: 10Hashar) [19:19:42] (03CR) 10Tjones: [C: 031] A/B/C test of control vs textcat vs accept-lang + textcat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) (owner: 10EBernhardson) [19:20:21] !log upgraded kafka1022 to confluent kafka 0.9.0.1 [19:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:46] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 675 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6153217 keys - replication_delay is 675 [19:20:59] (03CR) 10Andrew Bogott: [C: 032] horizon: Enable password autocomplete on login form [puppet] - 10https://gerrit.wikimedia.org/r/286112 (owner: 10BryanDavis) [19:21:27] (03PS3) 10Andrew Bogott: horizon: Enable password autocomplete on login form [puppet] - 10https://gerrit.wikimedia.org/r/286112 (owner: 10BryanDavis) [19:22:56] andrewbogott, do you know how to make puppet param optional? I have a config block that might have a few optional params, and i don't want them to be set unless they are defined [19:23:35] yurik: is there a reason why giving them a default won't work? [19:24:08] (03PS1) 10Chad: Group0 to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288031 [19:24:14] (03CR) 10Dzahn: "double checked on mw2001 (because it has ganglia_aggregator = true), and on others that don't.. no-op" [puppet] - 10https://gerrit.wikimedia.org/r/287695 (https://phabricator.wikimedia.org/T134808) (owner: 10Hashar) [19:25:04] (03CR) 10Chad: [C: 032] Group0 to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288031 (owner: 10Chad) [19:25:47] andrewbogott, actually i just realized a "false" value will work too! nvm :) [19:25:53] (03Merged) 10jenkins-bot: Group0 to 1.28.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288031 (owner: 10Chad) [19:25:54] ok :) [19:26:05] 06Operations, 10Traffic, 13Patch-For-Review: Support websockets in cache_misc - https://phabricator.wikimedia.org/T134870#2282393 (10Aklapper) [19:26:36] !log demon@tin Started scap: group0 wikis to 1.28.0-wmf.1 [19:26:42] 06Operations, 07HHVM, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2282404 (10Bawolff) As an aside, I have a patch which would make a change in the format of cl_collation (https://gerrit.wikimedia.org/r/#/c/272419/ ), which would require running u... [19:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:27:10] hashar: Started scap for group0. Everything should just be about done after that except deploy notes (already did all the other cleanup from prev. weeks) [19:27:24] oh [19:27:25] deploy notes [19:27:35] I have never ever generated them [19:27:35] :( [19:27:41] andre__: nice typo catch :) [19:27:51] hashar: Ummm, it's one of the steps! [19:28:07] Although it seems to be breaking, it wants the old branch to be wmf.22, not wmf.23 [19:28:10] guess my brain skip it entirely [19:28:37] 06Operations: Increase size of root partition on ocg* servers - https://phabricator.wikimedia.org/T130591#2282421 (10Dzahn) [19:28:39] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2282419 (10Dzahn) 05stalled>03Open @cscott cool, so currently ocg1003 is depooled already? [19:29:05] https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Update_deploy_notes [19:29:40] (03PS1) 10Yurik: Set graphoid result headers [puppet] - 10https://gerrit.wikimedia.org/r/288033 (https://phabricator.wikimedia.org/T134542) [19:30:10] looks like I stop at Purge localization cache for now unused versions [19:30:13] :( [19:30:47] Heh, those steps are out of order actually w.r.t. purging old versions [19:32:29] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2282430 (10Multichill) https://www.wikidata.org/w/index.php?title=Q889&curid=1211&diff=334725206&oldid=333223017 looks promising, but I don't se... [19:37:26] andrewbogott: i just tried and let puppet-compiler auto-pick some hosts on a change. in this example, change only touches modules/hhvm, compiler picked some random hosts to start with. like lvs2005, ms-be1001, mw1146, mw1186, wtp1006. that's it, those 5, then it finishes relatively quick and tells me there are no changes. so it doesnt break, and it doesnt take hours.. but most of these hosts dont have hhvm and why they were picked is a myst [19:37:42] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2282443 (10cscott) @Dzahn ocg1003 is in 'decommission mode', where it would respond to front-end requests for cached files, but won't start any new backend jobs. I've also confirmed that the cache... [19:38:34] mutante: did it also test some that did have modules/hhvm? Or did it miss the point entirely? [19:40:59] andrewbogott: the 2 mw servers do have it, but i am not sure if that was just being lucky, because there are many mw hosts and statistical chance [19:46:17] (03PS2) 10Dzahn: hhvm: skip ganglia when it is unwanted [puppet] - 10https://gerrit.wikimedia.org/r/287743 (https://phabricator.wikimedia.org/T134808) (owner: 10Hashar) [19:46:42] (03CR) 10Dzahn: [C: 032] "also no change in prod , f.e. http://puppet-compiler.wmflabs.org/2738/mw1186.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/287743 (https://phabricator.wikimedia.org/T134808) (owner: 10Hashar) [19:49:01] 07Blocked-on-Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on labs instances due to Ganglia (ex: using apache::site puppet class) - https://phabricator.wikimedia.org/T134808#2282461 (10Dzahn) @Hashar all 3 merged (after chec... [19:49:26] oh puppet compiler! [19:49:30] i should have thought about it [19:50:20] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2282465 (10hashar) [19:50:23] 07Blocked-on-Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Puppet fails on labs instances due to Ganglia (ex: using apache::site puppet class) - https://phabricator.wikimedia.org/T134808#2282463 (10hashar) 05Open>03Resolved puppet co... [19:50:55] mutante: thanks a lot ! [19:51:16] hashar: :) [19:51:20] (03CR) 10Dzahn: "yep, also double confirmed on mw1186 after merge, nothing happened here" [puppet] - 10https://gerrit.wikimedia.org/r/287743 (https://phabricator.wikimedia.org/T134808) (owner: 10Hashar) [19:52:23] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2282469 (10BBlack) I should've noted above: if you apply the manual HTTP/2 disable, please don't forget to turn it back on later after sufficient testing! [19:53:05] PROBLEM - Apache HTTP on mw1103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:25] !log demon@tin Finished scap: group0 wikis to 1.28.0-wmf.1 (duration: 26m 49s) [19:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:54:03] andrewbogott: what the heck.. it is running again and now doing it on many more hosts just like the other day [19:54:11] like that change was already done [19:55:05] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.050 second response time [19:55:57] ... and it did find an actual change that i would not have thought about [19:56:00] haha [20:00:34] (03PS2) 10Ladsgroup: wikilabels: enable CORS [puppet] - 10https://gerrit.wikimedia.org/r/287570 [20:01:21] ostriches: there is a bunch of weird errors [20:01:27] eg Notice: Unable to unserialize: [-1]. Expected ':' but got '1'. in /srv/mediawiki/php-1.28.0-wmf.1/includes/objectcache/RedisBagOStuff.php on line 313 [20:01:38] Hrm... [20:01:53] !log re-enabling puppet on caches [20:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:02:05] filling it as a task [20:02:38] That's odd af. [20:02:42] (03CR) 10Ladsgroup: "Added this line manually to the ini file in labels-experiment." [puppet] - 10https://gerrit.wikimedia.org/r/287570 (owner: 10Ladsgroup) [20:03:23] fatal: ambiguous argument 'wmf/1.27.0-wmf.22..wmf/1.28.0-wmf.1': unknown revision or path not in the working tree. [20:03:31] Why would you pick wmf.22 as the old branch? [20:03:34] Stupid script [20:04:23] (03PS1) 10Dzahn: install_server/ocg: let ocg1003 use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/288049 (https://phabricator.wikimedia.org/T84723) [20:05:18] lolol [20:05:25] // check, if there was a good result, otherwise assume, that there are 22 previous minor versions and use that [20:05:27] 06Operations, 10Traffic, 13Patch-For-Review: Support websockets in cache_misc - https://phabricator.wikimedia.org/T134870#2282530 (10BBlack) I've merged the first two patches, which are really pre-patches from this ticket's POV. There's some interaction between this work and T107749 , so I'll put the detail... [20:05:29] hashar: ^^ [20:05:45] hardcoded convention! [20:06:14] (03PS2) 10Dzahn: install_server/ocg: let ocg1003 use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/288049 (https://phabricator.wikimedia.org/T84723) [20:06:52] (03CR) 10Dzahn: [C: 032] install_server/ocg: let ocg1003 use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/288049 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [20:07:19] Erp, or not.... [20:07:20] Hmm [20:07:56] 06Operations: Increase size of root partition on ocg* servers - https://phabricator.wikimedia.org/T130591#2140241 (10Dzahn) we need to create a new partman recipe and let ocg1003 use it first [20:11:13] ostriches: looks good to me overall [20:11:33] It's just these dumb deploy notes. [20:11:35] the redisbagostuff notice is concerning but I have really have no clue what the heck is happening there [20:12:58] (03PS1) 10Dzahn: install_server/ocg: let ocg1003 use raid1-lvm partman [puppet] - 10https://gerrit.wikimedia.org/r/288053 (https://phabricator.wikimedia.org/T130591) [20:13:07] (03CR) 10Dzahn: [V: 032] install_server/ocg: let ocg1003 use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/288049 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [20:13:47] Wait WHAT? [20:13:55] How in holy hell.... [20:15:04] (03PS2) 10Dzahn: install_server/ocg: let ocg1003 use raid1-lvm partman [puppet] - 10https://gerrit.wikimedia.org/r/288053 (https://phabricator.wikimedia.org/T130591) [20:15:55] * ostriches sighs [20:16:23] (03PS3) 10Dzahn: install_server/ocg: let ocg1003 use raid1-lvm partman [puppet] - 10https://gerrit.wikimedia.org/r/288053 (https://phabricator.wikimedia.org/T130591) [20:16:59] (03CR) 10Dzahn: [C: 032] install_server/ocg: let ocg1003 use raid1-lvm partman [puppet] - 10https://gerrit.wikimedia.org/r/288053 (https://phabricator.wikimedia.org/T130591) (owner: 10Dzahn) [20:18:51] (03CR) 10Dzahn: [V: 032] "self-verify - it's not puppet code, just a cfg" [puppet] - 10https://gerrit.wikimedia.org/r/288053 (https://phabricator.wikimedia.org/T130591) (owner: 10Dzahn) [20:18:58] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2282604 (10thcipriani) >>! In T133992#2279073, @Dzahn wrote: > @thcipriani could you specify who is "CI" and "the releng" team i... [20:19:07] what's up ostriches? [20:19:27] I forgot how to use git! [20:19:41] Totally a peasant move. [20:21:04] GAHHHH [20:21:13] I hate release notes. [20:22:33] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2282615 (10Krenair) what would be the rationale for only providing a subset of contint-admins access, and why specifically that... [20:24:35] !log scheduled icinga downtime for ocg1003 and all services on it, rebooting to PXE (T84723) [20:24:36] T84723: reinstall OCG servers - https://phabricator.wikimedia.org/T84723 [20:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:38] mutante: cscott are you switching OCG to Jessie? [20:27:20] the app armor profile would need some kind of update. The imageMagick path is different on Jessie [20:27:21] hashar: let's see, we are trying with one server and the new thing is that we can decom it [20:27:28] modules/ocg/templates/usr.bin.nodejs.apparmor.erb:74: /etc/ImageMagick/** r, [20:27:34] hashar: i think repartitioning was the first item on the agenda. [20:27:42] for now i am concerned with partman working and stuff [20:28:13] (03CR) 10Gehel: "Puppet compiler indicates that the list of domains also changes. Looking at this change, I don't see why, but that look suspicious. https:" [puppet] - 10https://gerrit.wikimedia.org/r/288033 (https://phabricator.wikimedia.org/T134542) (owner: 10Yurik) [20:28:19] 06Operations, 10OCG-General: imagemagick::install refers to directory /etc/ImageMagic which does not exist on Jessie - https://phabricator.wikimedia.org/T134773#2282640 (10hashar) [20:28:44] cscott: mutante for imagemagick that is https://phabricator.wikimedia.org/T134773 and I have added OCG to it [20:29:01] hashar: cool! good to know [20:29:38] cscott: i have an installer :) [20:29:57] that may sound not like much but it's not always like that right away, heh [20:30:18] 06Operations, 10Traffic, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2282655 (10BBlack) I've been reviewing and re-testing a bunch of related things today. There are several inter-mixed issues and I'm not even going to try to separate the... [20:30:38] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2282661 (10hashar) We can revisit the list of contint-admins. I am not sure whether bd808, cscott, reedy or marktraceur still n... [20:31:56] cscott: RAID1-LVM with like 50GB on / , how does that sound [20:32:08] it passed the partman step, yay [20:32:23] (03CR) 10Gehel: [C: 031] "My bad, I did not see the dependent patch." [puppet] - 10https://gerrit.wikimedia.org/r/288033 (https://phabricator.wikimedia.org/T134542) (owner: 10Yurik) [20:33:40] ostriches: I am going off [20:33:43] err [20:33:44] to sleep [20:34:06] hashar: Adios [20:34:07] good night hashar [20:34:17] thx! [20:38:16] 06Operations, 10Graph, 10Graphoid, 10Traffic, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2282696 (10BBlack) In the patch you mention separate cache-control headers for 'error' responses. What kinds of error responses? Are these 500s? [20:40:29] RECOVERY - cassandra-b CQL 10.192.16.177:9042 on restbase2007 is OK: TCP OK - 0.034 second response time on port 9042 [20:42:21] Ahh, that's it [20:42:29] * ostriches throws rocks at this stupid script [20:42:51] mutante: so, 'slow but useful' :) [20:43:00] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 3 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2282715 (10Krenair) >>! In T134017#2280141, @Nikki wrote: > I've also noticed that there are a lot of pages not appearing in the categories they... [20:44:08] (03PS1) 10Madhuvishy: [WIP] jupyterhub: Add module to set up Jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/288086 [20:46:39] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6151318 keys - replication_delay is 0 [20:50:26] andrewbogott: after i said that i tried to reproduce what it said by compiling it just on snapshot hosts, and then it was "no change" again. ¯\_(ツ)_/¯ [20:51:00] keeps manually selecting the nodes [20:52:14] cscott: first attempt i said "yay" too early.. ends up in BusyBox shell, cant find the new volumne group :/ [20:52:51] ..and that is why i said "may not sound like much" [20:54:29] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: puppet fail [20:54:47] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2251644 (10bd808) I'd be glad to give up contint-admins membership. I haven't been active in supporting CI for quite a while. [20:55:54] (03CR) 10Gehel: "Puppet compiler is failing for new maps servers, but I assume it is because they are new... https://puppet-compiler.wmflabs.org/2741/" [puppet] - 10https://gerrit.wikimedia.org/r/287992 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [20:56:47] 06Operations, 10Traffic, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2282789 (10BBlack) And of course, there are still elevated random 503's on text, like before. Need to confirm if it's unrelated and coincidental (unlikely), or which of... [21:05:12] 06Operations, 10netops: codfw-eqiad Zayo link is down (cr2-codfw:xe-5/0/1) - https://phabricator.wikimedia.org/T134930#2282802 (10faidon) [21:07:04] !log merging and applying configuration for new maps servers [21:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:54] * gehel is deploying new servers for the first time. Please cross your fingers... [21:08:17] *crosses fingers* [21:08:28] (03PS2) 10Gehel: Preparing configuration for new maps servers [puppet] - 10https://gerrit.wikimedia.org/r/287992 (https://phabricator.wikimedia.org/T134901) [21:08:33] * MaxSem grabs popcorn [21:08:49] (03PS1) 10BBlack: Revert "tlsproxy: switch to (non-persistent) HTTP/1.1" [puppet] - 10https://gerrit.wikimedia.org/r/288089 (https://phabricator.wikimedia.org/T107749) [21:09:08] MaxSem: I hope you also have a fire extinguisher... [21:09:34] as long as they're not serving traffic, don't fear [21:12:35] (03CR) 10Gehel: [C: 032] Preparing configuration for new maps servers [puppet] - 10https://gerrit.wikimedia.org/r/287992 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [21:12:54] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [21:13:17] (03PS2) 10BBlack: Revert "tlsproxy: switch to (non-persistent) HTTP/1.1" [puppet] - 10https://gerrit.wikimedia.org/r/288089 (https://phabricator.wikimedia.org/T107749) [21:13:25] (03CR) 10BBlack: [C: 032 V: 032] Revert "tlsproxy: switch to (non-persistent) HTTP/1.1" [puppet] - 10https://gerrit.wikimedia.org/r/288089 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [21:16:58] 06Operations, 10Traffic, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2282869 (10BBlack) FWIW, the random 503s look like this on GET of plain article URLs (and other things, of course): ``` 421 VCL_return c hash 421 VCL_call c mis... [21:19:05] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Puppet has 5 failures [21:19:40] (03PS1) 10Ottomata: [WIP] Druid module [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [21:20:16] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Puppet has 7 failures [21:21:08] 06Operations, 10Traffic, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2282886 (10BBlack) Reverting just the HTTP/1.1 nginx patch makes the 503s go away (still using upstream module).... needs more digging in more-isolated testing.... [21:21:35] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Druid module [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [21:21:36] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [21:28:03] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2282889 (10Dzahn) [21:28:24] 06Operations, 10netops: cr2-codfw LUCHIP/trinity_pio error messages - https://phabricator.wikimedia.org/T134932#2282892 (10faidon) [21:29:05] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Puppet has 5 failures [21:30:37] (03PS4) 10Gehel: Remove obsolete graphoid settings [puppet] - 10https://gerrit.wikimedia.org/r/288020 (owner: 10Yurik) [21:30:57] 06Operations, 10netops: cr2-codfw LUCHIP/trinity_pio error messages - https://phabricator.wikimedia.org/T134932#2282907 (10faidon) This is Juniper case [[ https://casemanager.juniper.net/casemanager/#/cmdetails/2016-0510-0764 | 2016-0510-0764 ]] now. [21:31:20] puppet errors on maps200[14] are my doing. Please ignore... [21:31:24] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2282908 (10Dzahn) I got a jessie installer and it finished, but then it doesn't detect the disk/controller and i am prompted with BusyBox. 14:01 like you had on another server recently 1... [21:33:10] (03CR) 10Gehel: [C: 032] Remove obsolete graphoid settings [puppet] - 10https://gerrit.wikimedia.org/r/288020 (owner: 10Yurik) [21:34:02] (03PS1) 10Aude: Set interwiki sorting order for West Frisian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288101 (https://phabricator.wikimedia.org/T103207) [21:34:24] (03PS2) 10Gehel: Set graphoid result headers [puppet] - 10https://gerrit.wikimedia.org/r/288033 (https://phabricator.wikimedia.org/T134542) (owner: 10Yurik) [21:39:36] (03CR) 10Gehel: [C: 032] Set graphoid result headers [puppet] - 10https://gerrit.wikimedia.org/r/288033 (https://phabricator.wikimedia.org/T134542) (owner: 10Yurik) [21:40:48] ocg1003 has been spam-failing in pybal lately: [21:40:48] May 10 21:37:05 lvs1002 pybal[1759]: [ocg_8000 IdleConnection] WARN: ocg1003.eqiad.wmnet (disabled/down/not pooled): Connection failed. [21:40:51] May 10 21:37:06 lvs1002 pybal[1759]: [ocg_8000 ProxyFetch] WARN: ocg1003.eqiad.wmnet (disabled/down/not pooled): Fetch failed, 0.002 s [21:40:54] etc... [21:40:57] known? [21:41:04] yes [21:41:11] well, not the spam [21:41:18] but that the install failed [21:41:19] oh, it's disabled too. I forget pybal also spams disabled backends [21:41:27] it should be disabled , yes [21:41:37] ok [21:42:15] basically the installer finished but then i stil end up in BusyBox and cant find volume group [21:42:39] but papaul got it to boot [21:51:35] 06Operations, 10netops: codfw-eqiad Zayo link is down (cr2-codfw:xe-5/0/1) - https://phabricator.wikimedia.org/T134930#2282958 (10faidon) p:05High>03Normal Link up since 21:12:38Z. Waiting to hear from Zayo about the root cause and if it was on their side. [22:02:58] (03CR) 10Yuvipanda: "Are you fully aware of the implications of this? ORES doesn't have any issues with it since it is a purely readonly service, but wikilabel" [puppet] - 10https://gerrit.wikimedia.org/r/287570 (owner: 10Ladsgroup) [22:03:43] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 4 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2283001 (10Nikki) >>! In T134017#2282715, @Krenair wrote: > {T117332}? Possibly, although it seems like I'm seeing two things because purging t... [22:04:29] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 4 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2283002 (10aude) @multichill sounds like a caching issue. Does it work for you in the interface with debug=true? [22:08:17] !log ocg1003 - revoked old puppet cert, signed new cert, re-adding after reinstall [22:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:08:29] !log ocg1003 - papaul fixed the install issue [22:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:50] 06Operations, 06Labs, 10Tool-Labs: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2283026 (10faidon) [22:10:09] PROBLEM - puppet last run on mw2097 is CRITICAL: CRITICAL: puppet fail [22:10:37] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#930642 (10Papaul) The problem was fixed by inserting the follow line into GRUB acpi=off irqpoll [22:11:34] 06Operations, 06Labs, 10Tool-Labs: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2277377 (10faidon) @yuvipanda mentioned on the procurement ticket (linked above) that we should use Let's Encrypt. Let's Encrypt does not allow wildcards, so I'm guessing we'd have t... [22:11:46] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 4 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2283043 (10Nikki) @aude: It does show up for me if I add ?debug=true to the URL [22:13:01] 06Operations, 06Labs, 10Tool-Labs: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2283060 (10yuvipanda) We only need toolserver.org and www.toolserver.org I think. [22:13:51] (03PS1) 10Gehel: WIP - Allow host specific private configuration [puppet] - 10https://gerrit.wikimedia.org/r/288106 (https://phabricator.wikimedia.org/T134901) [22:27:38] 06Operations, 10Graph, 10Graphoid, 10Traffic, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283170 (10Yurik) @BBlack, the graphoid service now sets 3600 maxage on success, and 300 maxage on failure: ```Cache-Control: public, s-maxage=3600,... [22:27:59] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2283173 (10Dzahn) thank you very much @papaul for fixing that I could continue with the install. Re-added to puppet, signed new cert, added new salt-key..etc Initial puppet run, user accounts have... [22:28:23] !log graphoid was restarted on all scb servers with the new caching configuration. T134542 [22:28:24] T134542: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542 [22:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:29:39] cscott: you should be able to login again on ocg1003, it's on jessie, some issues now but kind of expected. but we can figure them out one by one now [22:31:12] like hashar said, probably apparmor issues. we might not have even solved all of those on the old setup, there were still some mysterious failures on pages w/ images on them which might have been triggered by apparmor preventing various random imagemagick conversion modes [22:31:24] cscott: yes, and names of font packages [22:31:36] ah. that might be 'solved' already in the travis config. [22:31:37] i'll look at the latter [22:32:09] it's similar to what moritz did in the MW module [22:32:19] to fix all the font issues on jessie [22:32:40] compare the list of packages in the README with the one in .travis.yml. [22:32:51] ok [22:33:15] also check out the size of / [22:33:34] pretty sure the README is the list of packages I installed in my debian/unstable build env, so in theory it should be closer to jessie. [22:33:48] but there is no /srv as a separate partition [22:33:57] maybe that's ok [22:34:09] it uses LVM , can be extended [22:34:29] ok @ README [22:34:32] mutante: the only thing which could stand to be on a separate partition is the OCG cache dir, I forget where we had put that. [22:34:43] 06Operations, 13Patch-For-Review: Increase size of root partition on ocg* servers - https://phabricator.wikimedia.org/T130591#2283199 (10Dzahn) on ocg1003 after reinstall: /dev/md0 46G 4.9G 39G 12% / [22:34:53] and logs, in theory, to keep them from interfering with other stuff, but I think we successfully turned down all the logging some time ago. [22:35:16] ok, *nod* [22:35:33] 06Operations, 10Graph, 10Graphoid, 10Traffic, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283201 (10BBlack) @Yurik: Well, we can talk about longer cache lifetimes later. For something new it's fine. But my earlier question still stands:... [22:36:30] PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Puppet has 5 failures [22:36:35] ACKNOWLEDGEMENT - OCG health on ocg1003 is CRITICAL: CRITICAL: connection error: (Connection aborted., error(111, Connection refused)) daniel_zahn T84723 [22:36:35] ACKNOWLEDGEMENT - puppet last run on ocg1003 is CRITICAL: CRITICAL: Puppet has 5 failures daniel_zahn T84723 [22:37:29] cscott: this will be another thing: (/Stage[main]/Ocg/Service[ocg]) Provider upstart is not functional on this host [22:37:31] RECOVERY - puppet last run on mw2097 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [22:38:15] oh, we're moving to debian from ubuntu? [22:38:19] i never liked upstart ;) [22:38:32] yes, Debian and systemd [22:38:38] but... systemd. sigh. [22:38:43] but i did something like that for ircd and stuff too [22:38:51] so i have existing examples [22:45:12] 06Operations, 10Graph, 10Graphoid, 10Traffic, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283231 (10Yurik) @bblack, those are actually 400 errors, e.g. [[ https://www.mediawiki.org/api/rest_v1/page/graph/png/Extension%3AGraph%2FDemo/21130... [22:49:18] (03CR) 10Jforrester: [C: 04-1] "Not until 17 May." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287646 (owner: 10Jforrester) [22:57:32] (03PS1) 10Dzahn: ocg: make it work on systemd [puppet] - 10https://gerrit.wikimedia.org/r/288112 (https://phabricator.wikimedia.org/T84723) [23:00:04] RoanKattouw ostriches Krenair Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160510T2300). Please do the needful. [23:00:04] James_F MatmaRex: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:15] * James_F waves. [23:00:17] hi. [23:00:18] Hello. I can SWAT this evening. [23:01:40] (03PS3) 10Dereckson: Undeploy UploadWizard from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287944 (owner: 10Bartosz Dziewoński) [23:02:14] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287944 (owner: 10Bartosz Dziewoński) [23:02:48] (03Merged) 10jenkins-bot: Undeploy UploadWizard from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287944 (owner: 10Bartosz Dziewoński) [23:04:49] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Undeploy UploadWizard from test2wiki ([[Gerrit:287944]], 1/2) (duration: 00m 30s) [23:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:18] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Undeploy UploadWizard from test2wiki ([[Gerrit:287944]], 2/2) (duration: 00m 27s) [23:05:24] MatmaRex: you can test it ^ [23:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:25] (03PS2) 10Dzahn: ocg: make it work on systemd [puppet] - 10https://gerrit.wikimedia.org/r/288112 (https://phabricator.wikimedia.org/T84723) [23:05:36] Dereckson: thanks, looks undeployed to me :) [23:05:40] Okay. [23:06:01] (03PS3) 10Dzahn: ocg: make it work on systemd [puppet] - 10https://gerrit.wikimedia.org/r/288112 (https://phabricator.wikimedia.org/T84723) [23:08:03] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) (owner: 10Bartosz Dziewoński) [23:08:11] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 657 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6165319 keys - replication_delay is 657 [23:09:11] MatmaRex: too late as I send it to Zuul, but try to avoid > 72 chars first line, it looks terrible on GitHub (it breaks at 72, then changes the font) [23:09:41] blergh [23:10:20] hmm, won't it need a rebase anyway? so i might rephrase at the same time too. :D [23:10:24] Dereckson: ^ [23:10:37] (03CR) 10Dereckson: [C: 032] "SWAT, take 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) (owner: 10Bartosz Dziewoński) [23:10:47] MatmaRex: yes, you're right [23:10:51] good idea [23:11:17] (03PS6) 10Bartosz Dziewoński: Configure cross-wiki uploads from test2wiki to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) [23:11:25] (03PS7) 10Bartosz Dziewoński: Configure cross-wiki uploads from test2wiki to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) [23:12:03] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) (owner: 10Bartosz Dziewoński) [23:13:23] (03CR) 10Dereckson: [C: 032] "SWAT, take 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) (owner: 10Bartosz Dziewoński) [23:13:59] (03PS4) 10Dzahn: ocg: make it work on systemd [puppet] - 10https://gerrit.wikimedia.org/r/288112 (https://phabricator.wikimedia.org/T84723) [23:14:06] Zuul love us this night. [23:14:40] So it's rebased, not in the Zuul queue. [23:14:51] And I removed my previous CR and the previous JenkinsBot V [23:15:17] all should have been okay [23:15:22] (03CR) 10Dereckson: [V: 032] "SWAT, take 3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) (owner: 10Bartosz Dziewoński) [23:15:37] it's gotta end up there eventually [23:15:47] oh huh, this time apparently worked? weird [23:15:58] wait, uh [23:16:08] I've manually merged it. [23:16:09] Dereckson: Two more coming for SWAT, sorry; a wmf.23 and a wmf.1 one in VE. :-( [23:16:11] no, the one that appeared on https://integration.wikimedia.org/zuul/ is patchset 5 [23:16:21] so it was just delayed, i think [23:16:24] oh [23:16:43] I had the queue opened with 285708 as filter, didn't see it [23:16:51] (now I was seeing the -> beta) [23:17:38] James_F: k [23:17:45] !log dereckson@tin Synchronized wmf-config/filebackend-production.php: Configure cross-wiki uploads from test2wiki to testwiki ([[Gerrit:285708]], 1/2) (duration: 00m 27s) [23:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:55] Checkout is taking an age. Fun times. [23:18:18] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Configure cross-wiki uploads from test2wiki to testwiki ([[Gerrit:285708]], 2/2) (duration: 00m 27s) [23:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:39] MatmaRex: here you are :) [23:18:45] (03CR) 10Ladsgroup: "Thanks Yuvi for your input. Using CORS was the plan for Wikilabels as far as I can see in codes. They are designed to use CORS. One other " [puppet] - 10https://gerrit.wikimedia.org/r/287570 (owner: 10Ladsgroup) [23:18:50] yay [23:19:13] i can see a copy of https://test.wikipedia.org/wiki/File:Large_blue_square.png at https://test2.wikipedia.org/wiki/File:Large_blue_square.png , so the foreign repo works [23:19:30] (i can also see files from Commons, so the fallback also works: https://test2.wikipedia.org/wiki/File:Trucks_Speed_blank_sign.svg) [23:19:30] k [23:20:01] i'll try uploading a thing or two, don't wait for me. :) [23:20:03] (03PS2) 10RobH: disabling ulsfo for onsite work on 2016-05-11 [dns] - 10https://gerrit.wikimedia.org/r/287985 [23:20:09] k [23:20:12] James_F: tell me when you have something in wmf/* so we can run Jenkins tests directly [23:20:25] Dereckson: It's being created now. [23:20:38] (03PS4) 10Dereckson: Enable VisualEditor by default in SET mode for logged-in users on the Japanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285985 (owner: 10Jforrester) [23:21:07] There. [23:21:12] !log disabling ulsfo via dns for onsite work tomorrow per T134831 [23:21:13] T134831: ulsfo planned maintenance on 2016-05-11 - https://phabricator.wikimedia.org/T134831 [23:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:38] (03CR) 10RobH: [C: 032] disabling ulsfo for onsite work on 2016-05-11 [dns] - 10https://gerrit.wikimedia.org/r/287985 (owner: 10RobH) [23:22:10] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6148590 keys - replication_delay is 0 [23:22:27] now to sit back and watch its traffic plummet. [23:22:42] odd to type that with a sense of calm. [23:22:52] Now listed. [23:22:58] "don't struggle, it'll be over soon" [23:23:33] ulsfo will never be over, we'll somehow end up with something else, though its not likely to be another problem datacenter. [23:24:10] heh https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=LVS%20loadbalancers%20ulsfo&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1462837394&g=network_report&z=large [23:24:15] already can see it dropping [23:24:21] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2283386 (10mmodell) @dzahn: yes but it can run on the same hardware as the www service. [23:24:24] (03CR) 10Dzahn: [C: 032] "going ahead since i'm not changing the servers that are in production" [puppet] - 10https://gerrit.wikimedia.org/r/288112 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [23:24:36] nice to see systems respect dns ttl. [23:26:01] James_F: for ja.wikipedia, you didn't add a reference to any task [23:26:26] They are aware this is coming? [23:26:27] Dereckson: There isn't a specific one. [23:26:38] Dereckson: Very aware. Talking to them about this for two years. :-) Don't worry. [23:26:54] Fine. [23:27:18] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285985 (owner: 10Jforrester) [23:27:54] (03Merged) 10jenkins-bot: Enable VisualEditor by default in SET mode for logged-in users on the Japanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285985 (owner: 10Jforrester) [23:28:10] (03PS1) 10Dzahn: ocg: follow-up fix for systemd change [puppet] - 10https://gerrit.wikimedia.org/r/288129 [23:28:29] (03PS2) 10Dzahn: ocg: follow-up fix for systemd change [puppet] - 10https://gerrit.wikimedia.org/r/288129 [23:28:49] (03CR) 10Dzahn: [C: 032] ocg: follow-up fix for systemd change [puppet] - 10https://gerrit.wikimedia.org/r/288129 (owner: 10Dzahn) [23:28:59] (03CR) 10Dzahn: [V: 032] ocg: follow-up fix for systemd change [puppet] - 10https://gerrit.wikimedia.org/r/288129 (owner: 10Dzahn) [23:29:13] Dereckson: (everything with the uploads seems to be fine, thanks!) [23:29:20] Thanks for testing MatmaRex. [23:29:33] https://test.wikipedia.org/wiki/Special:Contributions/Matma_Rex https://test2.wikipedia.org/wiki/Special:Contributions/Matma_Rex [23:30:36] !log dereckson@tin Synchronized dblists/visualeditor-default.dblist: Enable VisualEditor in single edit mode on ja.wiki ([[Gerrit:285985]], 1/2) (duration: 00m 25s) [23:31:03] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable VisualEditor in single edit mode on ja.wiki ([[Gerrit:285985]], 2/2) (duration: 00m 25s) [23:31:05] James_F: please test ^ [23:31:12] Yup, doing so. [23:31:27] (03PS2) 10Dereckson: Centralise feedback for the visual editor at the Hindi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287873 (https://phabricator.wikimedia.org/T134789) (owner: 10Jforrester) [23:32:05] Dereckson: Yup, LGTM. [23:32:45] k [23:32:48] Thank you. [23:32:55] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287873 (https://phabricator.wikimedia.org/T134789) (owner: 10Jforrester) [23:33:31] (03Merged) 10jenkins-bot: Centralise feedback for the visual editor at the Hindi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287873 (https://phabricator.wikimedia.org/T134789) (owner: 10Jforrester) [23:34:39] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Centralise feedback for the visual editor at the Hindi Wikipedia ([[Gerrit:287873]], T134789) (duration: 00m 25s) [23:34:40] T134789: Centralize feedback for the visual editor at the Hindi Wikipedia - https://phabricator.wikimedia.org/T134789 [23:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:35:00] James_F: here you are ^ [23:35:39] * James_F tests. [23:35:58] Yup, working. [23:36:34] Thank you. [23:36:52] k [23:43:13] (03PS1) 10Dzahn: ocg: install the right font packages on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288132 (https://phabricator.wikimedia.org/T84723) [23:44:08] (03PS2) 10Dzahn: ocg: install the right font packages on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288132 (https://phabricator.wikimedia.org/T84723) [23:44:26] (03CR) 10jenkins-bot: [V: 04-1] ocg: install the right font packages on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288132 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [23:46:56] * James_F waits. [23:47:16] Dereckson, are you done? [23:47:37] Not yet. [23:49:02] MaxSem: we were waiting two Zuul merges [23:49:10] !log dereckson@tin Synchronized php-1.27.0-wmf.23/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.js: Fix 'Uncaught TypeError: this.emit is not a function' (T134794) (duration: 00m 28s) [23:49:11] T134794: [Regression wmf.23] Cannot switch to Read tab from VE, getting error "this.emit is not a function" - https://phabricator.wikimedia.org/T134794 [23:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:49:27] James_F: here you're for .23 ^ [23:49:32] Ta. [23:51:04] (03PS3) 10Dzahn: ocg: install the right font packages on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288132 (https://phabricator.wikimedia.org/T84723) [23:51:41] Dereckson: Yup, working. [23:51:45] k [23:51:46] (Darn cache.) [23:52:22] MaxSem: you've a change to add to the SWAT? [23:52:27] !log dereckson@tin Synchronized php-1.28.0-wmf.1/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.js: Fix 'Uncaught TypeError: this.emit is not a function' (T134794) (duration: 00m 25s) [23:52:28] T134794: [Regression wmf.23] Cannot switch to Read tab from VE, getting error "this.emit is not a function" - https://phabricator.wikimedia.org/T134794 [23:52:29] security [23:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:53:37] 06Operations, 06Labs, 10Tool-Labs: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2283509 (10Dzahn) So an existing example to copy puppetized LE setup for a misc service is new RT on ununpentium. [23:56:27] James_F: ? [23:56:51] Waiting. [23:57:26] Strange, the cache invalidation was faster for wmf23 [23:57:31] Aha, yes, working. [23:57:45] It's based on whatever five minute clock the server I hit is on. [23:57:47] Or something. [23:58:02] Thanks for testing. [23:58:08] MaxSem: Tin is up to you. [23:58:13] thanks [23:58:40] !log maxsem@tin Synchronized php-1.28.0-wmf.1/extensions/Kartographer/: Security patch (duration: 00m 26s) [23:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:52] !log maxsem@tin Synchronized php-1.27.0-wmf.23/extensions/Kartographer/: Security patch (duration: 00m 25s) [23:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master