[00:02:22] (03PS1) 10Andrew Bogott: Split out the mysql config into a separate class. [puppet] - 10https://gerrit.wikimedia.org/r/188248 [00:02:38] !log rbf2001 - initial puppet run, adding users [00:02:42] Logged the message, Master [00:03:13] !log rbf2002 - error while setting up RAID during installer (rbf2001 did not have this? or did it?) [00:03:17] Logged the message, Master [00:06:29] (03PS2) 10Andrew Bogott: Split out the mysql config into a separate class. [puppet] - 10https://gerrit.wikimedia.org/r/188248 [00:07:39] 3operations: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#1010151 (10Krenair) wikitech-static is a backup of wikitech in case the cluster breaks and wikitech becomes inaccessible, IIRC. Don't know about radium. [00:08:48] 3operations: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1010152 (10Dzahn) i added rbf2001 to puppet, signed a cert, started the initial run etc. regarding rbf2002, i PXE booted it because it was down and not already installed like rbf2001, but i... [00:10:00] 3Phabricator, Project-Creators, Triagers, operations: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1010154 (10Philippe-WMF) Plz to add me? I'm looking at the CA team's involvement and how we use this, and it seems best to set up a project, and proba... [00:10:33] (03CR) 10jenkins-bot: [V: 04-1] Split out the mysql config into a separate class. [puppet] - 10https://gerrit.wikimedia.org/r/188248 (owner: 10Andrew Bogott) [00:13:29] (03PS3) 10Andrew Bogott: Split out the mysql config into a separate class. [puppet] - 10https://gerrit.wikimedia.org/r/188248 [00:17:33] (03CR) 10Andrew Bogott: [C: 032] Split out the mysql config into a separate class. [puppet] - 10https://gerrit.wikimedia.org/r/188248 (owner: 10Andrew Bogott) [00:21:26] (03PS1) 10Andrew Bogott: Remove some unneeded dependencies. [puppet] - 10https://gerrit.wikimedia.org/r/188257 [00:22:37] (03CR) 10Andrew Bogott: [C: 032] Remove some unneeded dependencies. [puppet] - 10https://gerrit.wikimedia.org/r/188257 (owner: 10Andrew Bogott) [00:26:06] 3ops-codfw, hardware-requests, operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1010241 (10RobH) [00:26:07] 3operations: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1010240 (10RobH) [00:26:09] 3operations, ops-codfw, hardware-requests: Procure and setup rdb2001-2004 - https://phabricator.wikimedia.org/T86896#1010242 (10RobH) [00:26:12] 3operations: Redirects to https need to set NE (no escape) in apache - https://phabricator.wikimedia.org/T88359#1010244 (10BBlack) I think you mean: ``` RewriteRule ^/(.*)$ https://office.wikimedia.org/$1 [R=301,L,NE] ``` Can someone test this somewhere and get back? [00:27:29] 3operations: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#979090 (10RobH) I had already installed rbf2001 via the now linked task https://phabricator.wikimedia.org/T86897. I'm not sure what this specific ticket is for, unless its for service imple... [00:27:47] (03PS1) 10BBlack: Revert "enable varnishncsa svcs at machine boot time" [puppet] - 10https://gerrit.wikimedia.org/r/188259 [00:27:53] (03PS2) 10BBlack: Revert "enable varnishncsa svcs at machine boot time" [puppet] - 10https://gerrit.wikimedia.org/r/188259 [00:28:03] (03CR) 10BBlack: [C: 032 V: 032] Revert "enable varnishncsa svcs at machine boot time" [puppet] - 10https://gerrit.wikimedia.org/r/188259 (owner: 10BBlack) [00:28:30] (03Abandoned) 10Andrew Bogott: Revert "Include a database on silver, for wikitech mediawiki." [puppet] - 10https://gerrit.wikimedia.org/r/188232 (owner: 10Andrew Bogott) [00:32:20] 3ops-codfw, hardware-requests, operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1010280 (10RobH) daniel's setup rbf2001 via the linked task for service implementation https://phabricator.wikimedia.org/T86898 [00:39:38] (03PS1) 10Ottomata: Point varnishcsa multicast relay instances back at the socat relay on gadolinium. [puppet] - 10https://gerrit.wikimedia.org/r/188262 [00:39:51] (03PS1) 10Andrew Bogott: Temporary hack: Turn off wikitech-static dump crons. [puppet] - 10https://gerrit.wikimedia.org/r/188263 [00:39:53] (03PS1) 10Andrew Bogott: Add a db config to silver that resembles the one on virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/188264 [00:44:06] (03PS5) 10BBlack: Autodetect RSS patterns in interface-rps [puppet] - 10https://gerrit.wikimedia.org/r/188112 [00:44:08] (03PS1) 10BBlack: interface perf for jessie cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/188265 [00:44:33] (03PS2) 10Ottomata: Point varnishcsa multicast relay instances back at the socat relay on gadolinium. [puppet] - 10https://gerrit.wikimedia.org/r/188262 [00:45:51] (03CR) 10jenkins-bot: [V: 04-1] Add a db config to silver that resembles the one on virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/188264 (owner: 10Andrew Bogott) [00:47:00] (03CR) 10Ottomata: [C: 032 V: 032] Point varnishcsa multicast relay instances back at the socat relay on gadolinium. [puppet] - 10https://gerrit.wikimedia.org/r/188262 (owner: 10Ottomata) [00:47:10] 3ops-codfw, operations: reclaim rbf2002/WMF5833 back to spare, allocate WMF5845 as rbf2002 - https://phabricator.wikimedia.org/T88380#1010323 (10RobH) 3NEW a:3Papaul [00:54:14] (03PS2) 10Andrew Bogott: Temporary hack: Turn off wikitech-static dump crons. [puppet] - 10https://gerrit.wikimedia.org/r/188263 [00:54:16] (03PS2) 10Andrew Bogott: Add a db config to silver that resembles the one on virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/188264 [00:55:32] (03CR) 10Andrew Bogott: [C: 032] Temporary hack: Turn off wikitech-static dump crons. [puppet] - 10https://gerrit.wikimedia.org/r/188263 (owner: 10Andrew Bogott) [00:55:57] (03PS1) 10RobH: changing the host allocation for rbf2002 [dns] - 10https://gerrit.wikimedia.org/r/188267 [00:56:36] (03CR) 10RobH: [C: 032] changing the host allocation for rbf2002 [dns] - 10https://gerrit.wikimedia.org/r/188267 (owner: 10RobH) [01:02:20] 3operations: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1010364 (10Dzahn) @joe was the plan to install these on Debian already? or still Ubuntu? [01:04:14] Jenkins/zuul offline? [01:05:00] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [01:08:00] bd808, is logstash dead again? [01:08:21] MaxSem: I don't think it has been un-dead for weeks [01:08:49] heh [01:09:09] still, it became stone dead 5 hours ago:) [01:10:24] looks like no new index for 2015-02-03 yet [01:10:36] I can poke things in a minute [01:15:26] (03PS2) 10BBlack: interface perf for jessie cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/188265 [01:15:28] (03PS1) 10BBlack: move bnx2x num_queues from lvs::balancer to interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/188268 [01:16:08] (03PS6) 10BBlack: Autodetect RSS patterns in interface-rps [puppet] - 10https://gerrit.wikimedia.org/r/188112 [01:16:18] (03CR) 10BBlack: [C: 032 V: 032] Autodetect RSS patterns in interface-rps [puppet] - 10https://gerrit.wikimedia.org/r/188112 (owner: 10BBlack) [01:16:19] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:16:30] (03PS2) 10BBlack: move bnx2x num_queues from lvs::balancer to interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/188268 [01:17:08] (03CR) 10BBlack: [C: 032 V: 032] move bnx2x num_queues from lvs::balancer to interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/188268 (owner: 10BBlack) [01:17:36] (03PS3) 10BBlack: interface perf for jessie cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/188265 [01:18:37] (03CR) 10BBlack: [C: 032] interface perf for jessie cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/188265 (owner: 10BBlack) [01:20:52] lovely, incoming icinga spam I think :p [01:21:29] oh maybe not, I think it didn't affect precise [01:23:20] (03CR) 10Andrew Bogott: [C: 032] Add a db config to silver that resembles the one on virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/188264 (owner: 10Andrew Bogott) [01:28:40] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:33:11] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [01:33:50] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: puppet fail [01:35:05] (03PS1) 10BBlack: remove trailing space from cmd if !rss_pattern [puppet] - 10https://gerrit.wikimedia.org/r/188273 [01:36:17] (03CR) 10BBlack: [C: 032] remove trailing space from cmd if !rss_pattern [puppet] - 10https://gerrit.wikimedia.org/r/188273 (owner: 10BBlack) [01:37:20] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [01:37:59] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [01:40:44] 3operations: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#1010395 (10Dzahn) wikitech-static is still in use, but totally separate from puppet, running on a rackspace instance. radium is the tor-relay. [01:41:37] (03PS1) 10Dzahn: redisdb: replace hardcoded eqiad with $site [puppet] - 10https://gerrit.wikimedia.org/r/188274 (https://phabricator.wikimedia.org/T86898) [01:46:49] (03PS1) 10Dzahn: mediawiki: replace hardcoded eqiad with $site [puppet] - 10https://gerrit.wikimedia.org/r/188275 (https://phabricator.wikimedia.org/T86894) [01:47:58] (03PS2) 10Dzahn: mediawiki: replace hardcoded eqiad with $site [puppet] - 10https://gerrit.wikimedia.org/r/188275 (https://phabricator.wikimedia.org/T86894) [01:48:40] (03PS3) 10Dzahn: mediawiki: replace hardcoded eqiad with $site [puppet] - 10https://gerrit.wikimedia.org/r/188275 (https://phabricator.wikimedia.org/T86894) [01:49:39] (03PS2) 10Dzahn: redisdb: replace hardcoded eqiad with $site [puppet] - 10https://gerrit.wikimedia.org/r/188274 (https://phabricator.wikimedia.org/T86898) [01:49:41] (03PS1) 10BBlack: switch cache nodes to perf cpufreq governor [puppet] - 10https://gerrit.wikimedia.org/r/188276 [01:49:59] (03CR) 10BBlack: [C: 032 V: 032] switch cache nodes to perf cpufreq governor [puppet] - 10https://gerrit.wikimedia.org/r/188276 (owner: 10BBlack) [01:51:30] !log restarted elasticsearch on logstash1003 [01:51:37] Logged the message, Master [01:51:46] !log restarted logstash on logstash1001 [01:51:49] Logged the message, Master [01:56:42] 3operations: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1010425 (10Dzahn) @Joe how about master-slave replication of redis servers? i see the "# slaveof " construct in the config template. Is the goal to just replicate within a datacenter? so this should be an... [01:57:36] YuviPanda|zzz: I disabled the nightly wikitech-static sync, and so far wikitech has survived the magic hour. [01:57:45] No doubt it will go down the instant I close my laptop though [01:58:54] what is that "oldnova" database? [01:59:04] did you see that in the bug about mysql going down? [01:59:27] mutante: I didn’t see it [01:59:43] It’s likely a relic of another age, but also probably not a culprit in the collapse [01:59:46] !log Manually created apifeatureusage-2015.02.02 and apifeatureusage-2015.02.03 indices in elasticsearch; clsuter needs rolling restart for autocreate to work for these names [01:59:56] Logged the message, Master [01:59:58] !log depool cp1065 (text eqiad in pybal -> jessie) [02:00:01] Logged the message, Master [02:01:03] andrewbogott: last 2 comments on https://phabricator.wikimedia.org/T88256 [02:02:05] (03PS1) 10BBlack: depool cp1065 from eqiad text backends [puppet] - 10https://gerrit.wikimedia.org/r/188278 [02:02:07] (03PS1) 10BBlack: cp1065 -> jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/188279 [02:02:19] PROBLEM - Disk space on silver is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=86%): [02:02:34] andrewbogott: yay [02:02:45] (03CR) 10BBlack: [C: 032 V: 032] depool cp1065 from eqiad text backends [puppet] - 10https://gerrit.wikimedia.org/r/188278 (owner: 10BBlack) [02:02:52] andrewbogott: I'm awake but laptop less for another 6h [02:03:00] ok [02:03:07] (03CR) 10BBlack: [C: 032] cp1065 -> jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/188279 (owner: 10BBlack) [02:03:33] andrewbogott: thank you! :) [02:03:49] looks like sean worked on the DB overnight, so he probably gets the actual credit. [02:04:03] 3operations, Labs: MySQL on wikitech keeps dying - https://phabricator.wikimedia.org/T88256#1010431 (10Andrew) Aw, when the system didn't go down today I thought I might have fixed it with https://gerrit.wikimedia.org/r/#/c/188263/ but now I see it was most likely Sean's work. 'novaold' can almost certainly go.... [02:04:15] heh [02:05:01] springle: I’m about to step away but will try to catch up with you regarding the wikitech db in a couple of hours. [02:05:09] andrewbogott: dmesg said plenty of stuff besides db kills wikitech from time to time. reducing DB footprint was just to gain headroom [02:05:14] np [02:06:08] andrewbogott: also, i complained on T88311 :) but no hurry [02:06:10] PROBLEM - Host cp1065 is DOWN: PING CRITICAL - Packet loss = 100% [02:07:14] springle: I maybe don’t know what I mean by ‘misc hosts’. I had the impression that we had some db boxes that were used for misc (non-production non-wikipedia-related) services... [02:07:35] a few people (including mark) were advocating for us to move to that in order to get backups and redundancy and such. [02:08:00] RECOVERY - Host cp1065 is UP: PING WARNING - Packet loss = 50%, RTA = 0.90 ms [02:08:03] non-production and non-wikipedia-related are different things [02:08:18] ok, fair enough :) [02:08:35] yes we have those; i'm just a little concerned about letting wikitech in... so to speak :) [02:08:47] are we no longer making any effort to keep it separate? [02:09:06] I’m not proposing that the openstack dbs be hosted anywhere but virt1000 [02:09:13] only the wiki db [02:09:24] I don’t think there are any security concerns there... [02:09:29] oh ok [02:09:33] * andrewbogott should never say things like that out loud [02:09:35] that sounds nicer [02:10:09] So, that wiki will still host OpenStackManager, but it will interact with labs only via the mw API [02:10:15] well, and calls out to keystone, but that’s one-way [02:10:23] btw, where can i find the password for wikitech-static [02:10:30] mutante: it’s on iron [02:10:31] for something unrelated [02:10:40] I believe it’s in a file named ‘wikitech-static’ :) [02:10:54] got it, . must have been blind.thanks [02:10:57] np [02:11:19] OK, I’m out — will be back briefly in a couple hours. [02:11:26] hm [02:11:26] given the point of wikitech-static, why is the password for login stored in the production cluster? :/ [02:11:37] Krenair: you don’t need the password to /read/ it [02:12:07] I know, but still... :/ [02:12:39] You are envisioning a very very terrible scenario which I am going to not think about right now :) [02:12:39] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [02:16:33] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 03s) [02:16:44] Logged the message, Master [02:17:04] 3operations: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#1010435 (10Dzahn) i fixed the SSL settings on wikitech-static. SSLv3 is disabled and i used the same cipher settings as on regular wikitech. feel free to re-scan now [02:17:39] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.014 second response time [02:17:40] !log LocalisationUpdate completed (1.25wmf14) at 2015-02-03 02:16:37+00:00 [02:17:44] Logged the message, Master [02:19:31] !log installing package upgrades on radium [02:19:33] Logged the message, Master [02:20:00] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.036 second response time [02:20:07] /away zzzz [02:21:11] 3operations: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#1010436 (10Dzahn) a:3Dzahn [02:25:35] 3Wikimedia-Logstash, hardware-requests, operations: Allocate temporary Elasticsearch nodes from spares pool for Logstash - https://phabricator.wikimedia.org/T87460#1010452 (10Krinkle) [02:30:14] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 02s) [02:30:22] Logged the message, Master [02:30:41] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [02:31:21] !log LocalisationUpdate completed (1.25wmf15) at 2015-02-03 02:30:18+00:00 [02:31:24] Logged the message, Master [02:35:57] (03PS1) 10BBlack: Revert "depool cp1065 from eqiad text backends" [puppet] - 10https://gerrit.wikimedia.org/r/188285 [02:36:07] (03CR) 10BBlack: [C: 032 V: 032] Revert "depool cp1065 from eqiad text backends" [puppet] - 10https://gerrit.wikimedia.org/r/188285 (owner: 10BBlack) [02:37:31] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [02:42:55] !log cp1065 re-pooled in pybal [02:43:03] Logged the message, Master [02:43:10] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [02:48:55] (03PS1) 10Ori.livneh: Update navtiming reporter for latest schema [puppet] - 10https://gerrit.wikimedia.org/r/188286 [02:49:16] !log deployed https://gerrit.wikimedia.org/r/#/c/187304/ (php-set X-Analytics header) to both production branches. [02:49:23] Logged the message, Master [02:49:39] bblack: fyi -- things look good. [02:49:45] (re header) [02:49:55] (03PS2) 10Ori.livneh: Update navtiming reporter for latest schema [puppet] - 10https://gerrit.wikimedia.org/r/188286 [02:50:02] (03CR) 10Ori.livneh: [C: 032 V: 032] Update navtiming reporter for latest schema [puppet] - 10https://gerrit.wikimedia.org/r/188286 (owner: 10Ori.livneh) [02:50:49] awesome [02:51:15] (03CR) 10Krinkle: "https://meta.wikimedia.org/?oldid=10785754" [puppet] - 10https://gerrit.wikimedia.org/r/188286 (owner: 10Ori.livneh) [02:52:46] we could actually strip the header from the response body; we're only inserting them so that varnishkafka logs them [02:52:59] that's a micro-optimization though [02:53:04] useful for debugging sometimes too though? [02:53:28] not really. maybe if you want to surveil which wikipedia pages your employees visit :P [02:53:50] I think there are easier ways :) [03:28:39] 3MediaWiki-Core-Team, operations: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1010498 (10tstarling) 3NEW [03:29:35] 3MediaWiki-Core-Team, operations: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1010506 (10tstarling) [03:52:31] beta.wmflabs.org is down [03:56:19] I can reach it in a browser right now [03:59:51] weird. it's automatically changing to https [03:59:56] on chrome [04:00:09] working fine on FF [04:06:09] RECOVERY - Disk space on silver is OK: DISK OK [04:08:46] ha. forcehttps cookie was the culprit [04:09:34] 3operations, Labs: MySQL on wikitech keeps dying - https://phabricator.wikimedia.org/T88256#1010548 (10Andrew) a:3Andrew [04:12:02] springle: how much work/time is it to import the db from virt1000/silver into a proper misc server? [04:12:22] And then there’s the question of having the wiki actually look there for its data, which I don’t know how to do but which is probably easy [04:22:02] andrewbogott: a hour downtime or so I guess [04:22:50] would need to dump/reload. mostly small tables, except wikitech uses text [04:23:31] Is this something you’d have the time and inclination to do in the near future? By default I’m going to just rsync the db files from virt1000 to silver, but we could just skip ahead to doing it right... [04:23:57] Would it really be downtime? Or just read-only time? [04:24:10] right, read only [04:24:40] andrewbogott: we should probably clean up this novaold stuff [04:25:07] Since I don’t know where it came from, it makes me nervous. Can you tell when it was last modified? [04:25:11] which may (likely) mean it needs dumping regardless [04:25:37] oh, you mean, dumping as a backup before we drop the tables? [04:25:55] not with any certainty. virt1000 helpfully rotates mysqld error logs, so i don't know how far back things broke [04:26:05] it certainly isn't being written to right now [04:26:53] ‘novaold’ sounds like something I would’ve made during a transition and then forgotten about. Except, my mysql skills arent’ good enough to do that :) [04:27:03] So maybe it was generated during a nova point upgrade... [04:27:47] andrewbogott: the innodb data dictionary is out of sync with the data directory. something blew away *.ibd files, or did a wierd sync or upgrade [04:28:12] um… would rsyncing a datadir from another host do that? ‘cause I for sure did that [04:28:19] to be safe, we should first dump and backup. then experiment with repair, or just reload afresh [04:28:21] When we migrated from tampa [04:29:03] idk how you did the rsync. if mysqld was still running when the rsync ran, it could have been the cause [04:29:09] but *shrug* [04:29:18] it's broken now, and needs fixing [04:29:21] ok. So when you say ‘we should first dump and backup…’ [04:29:25] Is the ‘we’ you, or me? :D [04:29:31] me i guess :) [04:30:03] once we dicide exactly which databases are going where [04:30:44] ok, that shouldn’t be too hard [04:31:59] springle: it’s obvious which are mediawiki and which are openstack, right? I think that’s where the break should be. [04:32:18] ‘labswiki’ can go on a shared host, everything else should stay on virt1000 [04:32:27] what is labswiki_eqiad? [04:32:59] oh, man, there’s a labswiki and a labswiki_eqiad? [04:33:13] I am not the greatest housekeeper :( [04:33:19] :) [04:33:58] The live wikitech is clearly using a db called ‘labswiki’ [04:34:06] so, labswiki_eqiad must be a backup [04:34:10] ok, actually, I know what it is -- [04:34:13] labswiki_eqiad has timestamps up to 20140319234939 [04:34:24] there was a second, fully-functioning wiki running on virt1000 at the same time as virt0 [04:34:45] so when we shutdown tampa I declared there to be one true wiki and shoved the old contents aside. [04:34:49] So, that’s got to be what that is. [04:35:20] and, March is when that happened, so, checks out. [04:35:56] andrewbogott: will test the dump/reload time today. it won't block anything [04:36:06] great. [04:36:19] Shall I schedule some downtime for tomorrow night, or do you want to wait until you know more before I schedule? [04:36:38] your night? now-ish + 24h? [04:37:02] whenever suits you best. [04:37:16] But, yeah, right now is a pretty slack time labswise [04:37:19] probably ok [04:37:19] and opswise for that matter [04:38:05] So what will happen exactly? You’ll mark all the dbs as read-only, right? I wonder how nova will cope with that... [04:38:11] and mediawiki. [04:38:14] Any guesses? [04:38:29] I guess I can just say “Things will be weird and maybe broken” and wait and see :) [04:38:31] mediawiki has a read only setting [04:38:39] no clue about everything else [04:40:46] andrewbogott: i'd like to hear from Reedy or similar about whether labswiki should go on s3 or a m[123]. because it uses the text table heavily and not external storage, it may not suit s3. but it owuld be the first mediawiki to go on m[123]... [04:41:25] springle: want me to start an email thread, or are you likely to still be online when he wakes up? [04:41:33] presumably it has implications for how upgrades are handled too [04:41:49] andrewbogott: start a ticket or thread. i'll be in and out [04:42:15] ok. I think I will punt on the downtime notice for now — short notice is probably ok anyway. [04:45:02] ok [04:45:38] springle: ok, I updated T88311 and nagged Sam directly to respond. If you learn things from the dump and reload that’s probably the right place to note things as well. [04:45:46] Thanks for working on this with so little warning. [04:47:59] np [04:49:42] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Feb 3 04:48:38 UTC 2015 (duration 48m 37s) [04:49:47] Logged the message, Master [04:52:36] (03PS1) 10Glaisher: Fix wgMobileUrlTemplate for wikidatawiki on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188300 (https://phabricator.wikimedia.org/T87440) [05:09:29] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Puppet has 1 failures [05:14:41] !log wikitech virt1000 test db dump T88311 [05:14:47] Logged the message, Master [05:26:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [05:41:00] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [05:52:59] (03CR) 10Tim Landscheidt: "Just commenting out the cron jobs will not disable them. Did you remove them manually as well?" [puppet] - 10https://gerrit.wikimedia.org/r/188263 (owner: 10Andrew Bogott) [05:55:58] 3operations: Redirects to https need to set NE (no escape) in apache - https://phabricator.wikimedia.org/T88359#1010633 (10Joe) a:3Joe [06:00:44] 3MediaWiki-Core-Team, operations: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1010638 (10Joe) Honestly I don't see the point in creating a godzilla 130 GB file every day. A correct way to tackle this is probably rotating the file more often than daily, and keep 7 days of retentio... [06:25:54] 3MediaWiki-Core-Team, operations: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1010663 (10tstarling) >>! In T88393#1010638, @Joe wrote: > Honestly I don't see the point in creating a godzilla 130 GB file every day. If we knew what HHVM server(s) were involved in T87645, that would... [06:28:20] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:29] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:20] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:10] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:18] (03CR) 10Andrew Bogott: "Um... yes (now) :)" [puppet] - 10https://gerrit.wikimedia.org/r/188263 (owner: 10Andrew Bogott) [06:44:59] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:00] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:51:27] <_joe_> it's relieving to wake up and see passenger o'clock is still there [06:56:50] (03PS17) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [06:57:23] <_joe_> jgage: I'm gonna take a look at it today [06:57:37] 3operations: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#1010674 (10Chmarkine) So is radium needed to fix as well? [06:57:37] <_joe_> or did you test ipresolve thoroughly? [06:59:04] _joe_ i mostly fixed it but i wasn't sure on syntax to get the ttl so i set it to 300s [06:59:16] but i used wireshark to confirm that it's caching successfully [06:59:19] <_joe_> ok, I'll take a look [06:59:25] thanks amigo [06:59:42] (03CR) 10Gage: "Fixed most issues with ipresolve.rb except using actual TTL. Temporarily hardcoded to 300 seconds." [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [06:59:46] <_joe_> I wrote that function in bits & pieces, I didn't expect it to worke at all [07:00:07] <_joe_> how broken was it? [07:00:18] hehe pretty broken, i had fun stepping through it [07:00:28] learned some ruby along the way [07:01:13] <_joe_> good :) [07:01:37] <_joe_> yeah the ttl is not passed along, obviously [07:01:41] <_joe_> I'll fix that [07:01:54] <_joe_> and also write tests [07:02:12] awesome thanks [07:02:26] currently it assumes only one A/AAAA record, which is fine for us [07:02:35] but for reuse maybe we should use getresources instead of getresource [07:02:48] <_joe_> yes, I decided it was the best way to do that [07:02:54] also mark advised that we should probably serve from the stale cache if there's no dns response [07:02:55] <_joe_> one can always write a wrapper [07:03:06] fair enough [07:03:16] <_joe_> could be a good idea [07:05:48] <_joe_> gee my todo list for today is already horribly long [07:34:10] (03PS2) 10Giuseppe Lavagetto: hiera: actively look up the role hierarchy instead of the standard one [puppet] - 10https://gerrit.wikimedia.org/r/187312 [07:34:20] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hiera: actively look up the role hierarchy instead of the standard one [puppet] - 10https://gerrit.wikimedia.org/r/187312 (owner: 10Giuseppe Lavagetto) [07:44:11] (03PS1) 10Giuseppe Lavagetto: move the codfw role configs to the correct place [puppet] - 10https://gerrit.wikimedia.org/r/188315 [07:44:13] (03CR) 10Nikerabbit: [C: 031] Undeploy Solarium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188217 (owner: 10Reedy) [07:45:09] (03CR) 10Giuseppe Lavagetto: [C: 032] move the codfw role configs to the correct place [puppet] - 10https://gerrit.wikimedia.org/r/188315 (owner: 10Giuseppe Lavagetto) [07:50:44] <_joe_> mmmh no wikibugs [08:10:21] (03PS1) 10Faidon Liambotis: reprepro: update signing key for Cassandra's repo [puppet] - 10https://gerrit.wikimedia.org/r/188318 [08:10:23] (03PS1) 10Faidon Liambotis: reprepro: add updates from torproject.org [puppet] - 10https://gerrit.wikimedia.org/r/188319 [08:10:47] !log wikitech mysql restart to fix novaold errors [08:10:54] Logged the message, Master [08:11:16] (03CR) 10Faidon Liambotis: [C: 032] reprepro: update signing key for Cassandra's repo [puppet] - 10https://gerrit.wikimedia.org/r/188318 (owner: 10Faidon Liambotis) [08:11:29] (03CR) 10Faidon Liambotis: [C: 032] reprepro: add updates from torproject.org [puppet] - 10https://gerrit.wikimedia.org/r/188319 (owner: 10Faidon Liambotis) [08:12:47] 3operations, Labs: MySQL on wikitech keeps dying - https://phabricator.wikimedia.org/T88256#1010779 (10Springle) novaold errors have been fixed. [08:14:24] !log radium: upgrade tor to the latest torproject.org version [08:14:28] Logged the message, Master [08:15:17] (03PS1) 10Ori.livneh: Add log group for T87645 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188322 [08:15:54] (03CR) 10Ori.livneh: [C: 032] Add log group for T87645 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188322 (owner: 10Ori.livneh) [08:17:40] !log ori Synchronized php-1.25wmf15/includes/EditPage.php: Id376f9e75: Hack for T87645, since maybe it is still happening (duration: 00m 05s) [08:17:46] Logged the message, Master [08:18:16] (03CR) 10Ori.livneh: [V: 032] Add log group for T87645 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188322 (owner: 10Ori.livneh) [08:19:06] !log ori Synchronized php-1.25wmf14/includes/EditPage.php: Id376f9e75: Hack for T87645, since maybe it is still happening (duration: 00m 07s) [08:19:10] Logged the message, Master [08:20:19] !log ori Synchronized wmf-config/InitialiseSettings.php: Ie7b32e3d8: Add log group for T87645 (duration: 00m 05s) [08:20:22] Logged the message, Master [08:20:59] _joe_: the www-data patches I +1'd the other day seem like noops :D should we merge them? [08:21:09] _joe_: I can test on beta if you'd like [08:21:18] I already have a mediawiki instance [08:23:05] <_joe_> YuviPanda: yeah that was my idea [08:23:19] _joe_: alright then. let me do that now? [08:23:24] <_joe_> if you want, go on now, I'm doing something else ATM [08:23:30] deployment-mediawiki04. it's out of rotation and has the same roles as prod applied [08:23:34] _joe_: cool. [08:23:39] _joe_: I'll report back with diffs, if any [08:23:43] (shouldn't be any) [08:24:31] <_joe_> I guess so [08:24:43] <_joe_> this will also allow us to test the migration [08:25:20] _joe_: yeah [08:27:42] <_joe_> un-fucking-believable. The resolver class in ruby 1.8.7 doesn't save the ttl anywhere [08:27:45] <_joe_> sigh [08:28:07] (03CR) 10Yuvipanda: [C: 04-1] mediawiki: allow using a different web user than apache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/187259 (owner: 10Giuseppe Lavagetto) [08:28:13] _joe_: very believable, actually :) [08:28:16] * YuviPanda fixes up patch [08:29:37] (03PS3) 10Yuvipanda: mediawiki: allow using a different web user than apache [puppet] - 10https://gerrit.wikimedia.org/r/187259 (owner: 10Giuseppe Lavagetto) [08:29:48] (03PS4) 10Yuvipanda: mediawiki: allow using a different web user than apache [puppet] - 10https://gerrit.wikimedia.org/r/187259 (owner: 10Giuseppe Lavagetto) [08:29:56] (03PS2) 10Yuvipanda: labstore: do not explicitly declare the apache user existence [puppet] - 10https://gerrit.wikimedia.org/r/187686 (owner: 10Giuseppe Lavagetto) [08:30:03] (03PS2) 10Yuvipanda: maintenance: allow choosing the web user [puppet] - 10https://gerrit.wikimedia.org/r/187687 (owner: 10Giuseppe Lavagetto) [08:30:10] (03PS2) 10Yuvipanda: beta: allow defining the web user. [puppet] - 10https://gerrit.wikimedia.org/r/187688 (owner: 10Giuseppe Lavagetto) [08:30:35] <_joe_> YuviPanda: ouch, good catch, I was writing ruby there :) [08:30:42] _joe_: :D [08:32:23] _joe_: yup, they're noops with that fixed :) [08:41:20] 3operations: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#1010815 (10faidon) 5Open>3Resolved radium is the Tor relay, not an HTTP server. It was due for a tor version upgrade anyway though, so I did this today. [08:48:30] <_joe_> YuviPanda: good to know! Would you comment on the tickets? [08:48:45] _joe_: yeah, am doing so now [08:49:13] _joe_: jenkins hiccuped, am waiting another 5mins to see if that was with the broken patch or the new one [08:49:19] will comment once that's done [08:49:58] (03PS2) 10Reedy: Undeploy Solarium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188217 [08:50:04] (03CR) 10Reedy: [C: 032] Undeploy Solarium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188217 (owner: 10Reedy) [08:50:17] (03Merged) 10jenkins-bot: Undeploy Solarium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188217 (owner: 10Reedy) [08:50:44] !log reedy Synchronized wmf-config/: Bye bye Solarium (duration: 00m 06s) [08:50:49] Logged the message, Master [08:57:53] 3operations: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1010865 (10faidon) The response from our GlobalSign rep is that they don't have a firm timeline and can't commit to a Q1 rollout yet. [08:59:22] (03PS1) 10Giuseppe Lavagetto: Use dh-exec to properly rename ini file for fastcgi [debs/hhvm] - 10https://gerrit.wikimedia.org/r/188332 [08:59:50] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 1 failures [09:08:15] greetings [09:08:37] hi :) [09:09:33] hey paravoid ! how's the jetlag? [09:09:51] no jetlag anymore [09:09:53] :) [09:09:55] that was quick [09:10:02] having to wake up at 8am for three days helped [09:10:09] and stay up all day [09:10:14] with no option to go crash on a bed [09:12:30] <_joe_> it was brutal, yes [09:12:56] yeah :) [09:14:27] haha nice, supposedly west->east is easier (?) [09:14:42] depends on the person [09:14:48] for me it's not [09:14:59] <_joe_> for me it is [09:15:03] <_joe_> but i need a good rest [09:15:10] <_joe_> which I didn't get until today [09:15:14] <_joe_> https://pbs.twimg.com/media/B816Ac9CcAA-uAn.jpg lol [09:15:28] 3Beta-Cluster, operations: Make www-data the web-serving user (is currently apache) - https://phabricator.wikimedia.org/T78076#1010888 (10yuvipanda) [09:16:00] (03CR) 10Filippo Giunchedi: [C: 031] fix another 28 puppet linter warnings [puppet] - 10https://gerrit.wikimedia.org/r/188206 (https://phabricator.wikimedia.org/T87132) (owner: 10Dzahn) [09:17:39] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:18:14] (03PS3) 10Yuvipanda: maintenance: allow choosing the web user [puppet] - 10https://gerrit.wikimedia.org/r/187687 (https://phabricator.wikimedia.org/T78076) (owner: 10Giuseppe Lavagetto) [09:18:16] (03PS3) 10Yuvipanda: labstore: do not explicitly declare the apache user existence [puppet] - 10https://gerrit.wikimedia.org/r/187686 (https://phabricator.wikimedia.org/T78076) (owner: 10Giuseppe Lavagetto) [09:18:18] (03PS5) 10Yuvipanda: mediawiki: allow using a different web user than apache [puppet] - 10https://gerrit.wikimedia.org/r/187259 (https://phabricator.wikimedia.org/T78076) (owner: 10Giuseppe Lavagetto) [09:18:20] (03PS3) 10Yuvipanda: beta: allow defining the web user. [puppet] - 10https://gerrit.wikimedia.org/r/187688 (https://phabricator.wikimedia.org/T78076) (owner: 10Giuseppe Lavagetto) [09:18:22] good morning ! [09:18:47] 3operations: stop gerrit from mailing every single change in operations to the ops mailing list - https://phabricator.wikimedia.org/T88388#1010911 (10Aklapper) [09:19:10] 3Beta-Cluster, operations: Make www-data the web-serving user (is currently apache) - https://phabricator.wikimedia.org/T78076#1010912 (10yuvipanda) Retitled to point to current solution. I tested the patches (with a fix) on deployment-mediawiki04, and found that they were noops (yay!) [09:27:42] 3Phabricator, Project-Creators, Triagers, operations: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1010928 (10Aklapper) I've added Philippe-WMF. [09:33:09] 3operations: stop gerrit from mailing every single change in operations to the ops mailing list - https://phabricator.wikimedia.org/T88388#1010940 (10hashar) I guess the Gerrit user `novaadmin ` has been made to watch the operations/puppet.git repository which would cause the spam.... [09:40:15] 3operations: Redirects to https need to set NE (no escape) in apache - https://phabricator.wikimedia.org/T88359#1010957 (10Joe) The line in @BBlack comment should work; I am still thinking of possible security implications of using NE in these rewrites, but there shouldn't be any [09:40:55] (03PS2) 10Yuvipanda: labs_vmbuilder: Don't setup lvm based swap [puppet] - 10https://gerrit.wikimedia.org/r/186125 [09:41:03] <_joe_> wat? [09:41:08] <_joe_> lvm based swap [09:41:16] <_joe_> what could possibly go wrong [09:41:22] _joe_: :) *don't* :) [09:41:30] (03CR) 10Yuvipanda: [C: 032 V: 032] labs_vmbuilder: Don't setup lvm based swap [puppet] - 10https://gerrit.wikimedia.org/r/186125 (owner: 10Yuvipanda) [09:41:32] <_joe_> ok :) [09:41:49] _joe_: but yeah, am trying to build new images with the / issue 'fixed' [09:41:57] what's wrong with swap on lvm in general? [09:42:20] godog: I'm not sure at all, but in this case we also had non-lvm swap setup... [09:42:30] <_joe_> godog: it doesn't make any sense IMO [09:43:01] <_joe_> it's an additional level of complexity, also I remember it causing race conditions years ago [09:43:08] <_joe_> it probably changed though [09:43:21] <_joe_> godog: the only case where it makes sense is in FDE [09:43:30] <_joe_> where you don't have any other options basically [09:43:52] it has always worked fine for me tbh [09:43:58] <_joe_> for anything else I don't expect you to need to resize your swap [09:44:22] <_joe_> (I don't use swap a lot, too) [09:44:39] heh that reminds me every time I look at partman setup I wonder if we want more of that being the same across the fleet [09:45:06] <_joe_> I think my only systems with more than a tiny swap have been laptops lately [09:46:19] <_joe_> YuviPanda: thinking better about this, do we even want labs hosts to have a swap? [09:46:30] <_joe_> it will make the overall i/o skyrocket [09:46:44] <_joe_> labs VMs I mean [09:48:01] <_joe_> or maybe I am missing something, I know almost nothing about openstack tbh [09:48:59] _joe_: yeah, my thinking is to agree, but I'm going to wait for andrewbogott_afk to come back before doing that [09:49:57] hmm, building a new image seems to take forever [09:50:31] <_joe_> godog: do we have a procedure for ES rolling restarts on wikitech? If not it may be a good idea to write one :) [09:51:24] 3operations: plan workflow for blocked on ops patches - https://phabricator.wikimedia.org/T88315#1010981 (10Aklapper) Note that a "Blocked-on-Operations" tag exists in Phab, and if this task is not entirely ops-team specific this might also be a shared topic for "Team-Practices". [09:51:55] <_joe_> yes it ids there, maybe a bit sparse [09:52:03] <_joe_> https://wikitech.wikimedia.org/wiki/Search#Rolling_restarts [09:52:52] 3ops-eqiad, operations: Decommission lsearchd - https://phabricator.wikimedia.org/T85009#1010982 (10fgiunchedi) I think we want to keep them as spares unless we have a good reason not to (e.g. lack of rack space, cleanup, etc) since they are fairly powerful but out of warranty [09:53:27] _joe_: yep that page essentially together with the tool chad wrote [09:53:57] <_joe_> godog: maybe something about verifying everything is fine after a restart could be added [09:54:09] <_joe_> like, operational steps to verify everything is fine [09:55:15] <_joe_> I see there is a check in the script, but I think some additional details could be good to have [09:56:05] indeed, something good to have, folks will be online in a few hours so we can take a look [09:57:16] (03PS1) 10Yuvipanda: beta: Make alerting period for cherry-picked changes 48h [puppet] - 10https://gerrit.wikimedia.org/r/188340 [10:01:22] 3Multimedia, operations: recurring http 500 errors when generating thumbnails - https://phabricator.wikimedia.org/T88412#1010996 (10fgiunchedi) CCing #operations [10:10:26] (03PS2) 10Yuvipanda: beta: Make alerting period for cherry-picked changes 48h [puppet] - 10https://gerrit.wikimedia.org/r/188340 [10:10:49] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Make alerting period for cherry-picked changes 48h [puppet] - 10https://gerrit.wikimedia.org/r/188340 (owner: 10Yuvipanda) [10:11:03] (03CR) 10Filippo Giunchedi: "thanks Ori! I think we're fine with going for trusty-only anyway, I'll merge this after the move to graphite1001 (note that gdash will sta" [puppet] - 10https://gerrit.wikimedia.org/r/188069 (owner: 10Ori.livneh) [10:11:57] (03PS3) 10Filippo Giunchedi: Make gdash's uWSGI config.ru Ruby 1.9-compatible [puppet] - 10https://gerrit.wikimedia.org/r/188069 (https://phabricator.wikimedia.org/T85909) (owner: 10Ori.livneh) [10:12:57] (03PS3) 10Filippo Giunchedi: graphite: format /var/lib/carbon [puppet] - 10https://gerrit.wikimedia.org/r/187690 (https://phabricator.wikimedia.org/T85909) [10:13:02] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: format /var/lib/carbon [puppet] - 10https://gerrit.wikimedia.org/r/187690 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [10:27:27] 3RESTBase, operations: Set up cassandra monitoring - https://phabricator.wikimedia.org/T78514#1011033 (10fgiunchedi) a:3fgiunchedi sure I'll take a look [11:18:36] (03PS1) 10Raimond Spekking: Fix config for elwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188350 [11:20:23] (03CR) 10Glaisher: "This will be fixed with I35334f38ec3eb99a435930714f2ede99703fa4f9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188350 (owner: 10Raimond Spekking) [11:22:10] (03PS4) 10Glaisher: Standardize the name of interface editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186593 (https://phabricator.wikimedia.org/T85731) [11:40:17] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.01 [11:40:31] paravoid: I finally sat down, read the docs, and built ubuntu images with unified / :D [11:40:35] and they work! [11:40:48] ;) [11:40:51] paravoid: I'll make them default after consulting with andrewbogott_afk. [11:40:51] nice :) [11:41:03] paravoid: I'm wondering - if / gets full, will I still be able to log in? [11:41:08] via ssh, that is? [11:41:15] yes [11:41:22] right [11:41:29] <_joe_> oh my I hate ruby, but I hate rspec more [11:41:41] let me build the precise image now [11:41:47] precise? [11:41:52] can't we just deprecate precise instances? [11:41:54] <_joe_> (I would've read the docs, but "the documentation is a work in progress at the moment") [11:42:18] <_joe_> akosiaris: around? [11:42:22] paravoid: nope, because several prod thigns are still precise [11:42:32] the puppetmaster, salt master, videoscalers... [11:42:38] and I'm pretty sure lots more [11:42:55] 3RESTBase, operations: Set up cassandra monitoring - https://phabricator.wikimedia.org/T78514#1011124 (10fgiunchedi) so from the referenced link we'd need: * ship metrics-graphite jar from http://search.maven.org/#artifactdetails|com.yammer.metrics|metrics-graphite|2.2.0|jar in a debian package (separate from ca... [11:50:26] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [12:02:40] (03PS18) 10Giuseppe Lavagetto: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [12:22:53] 3operations: git::clone makes changed files root-only readable - https://phabricator.wikimedia.org/T86527#1011151 (10hashar) The git::clone has learned the umask parameter with https://gerrit.wikimedia.org/r/#/c/187331/ 'fix git::clone umask issues'. Has been filed as T87843. [12:23:25] 3operations: git::clone makes changed files root-only readable - https://phabricator.wikimedia.org/T86527#1011153 (10hashar) [12:25:06] paravoid: hmm, we're building the new images with a 20G / [12:25:10] should we make it 30? [12:25:12] * YuviPanda ponders [12:25:26] I guess we can always make new larger images if we end up needing them [12:25:42] and people can lvm mount things if needed [12:27:03] andrewbogott_afk: btw, nice docs on building OS images! \o/ [12:31:42] <_joe_> YuviPanda: are you preserving our beloved /var of 2 G? [12:31:52] <_joe_> I bet not [12:32:03] <_joe_> you can't just part from tradition like that [12:32:21] <_joe_> it's a hidden scheme we had to make people feel miserable [12:32:30] _joe_: :P I've killed them allll! [12:32:33] and verified it works [12:32:38] so only question is 20G / or 30G / [12:32:41] but our smallest image is 20G [12:32:42] so... [12:32:44] maybe 20G /? [12:32:50] <_joe_> mmmh ok [12:32:59] <_joe_> I'd tend to say 30 G [12:33:10] <_joe_> disk space is the cheapest commodity we have I guess [12:33:16] do a df across the whole fleet with salt [12:33:19] <_joe_> but I have no idea on the consequences for labs [12:33:39] and try to figure out the distribution of used space across the (labs) fleet [12:33:58] <_joe_> paravoid: hoping nobody has an hadoop fs mounted via fuse [12:34:31] NFS /home is failing to mount in new instances. [12:35:28] <_joe_> how? [12:36:04] the project does not provide one [12:36:17] another fail tbh. The project does not provide one in wikitech [12:36:22] but puppet tries to mount it :-( [12:36:30] <_joe_> ahah [12:36:34] <_joe_> nice! [12:36:37] wait, what? [12:36:43] I don't understand [12:37:04] paravoid: so, it is possible to have a cluster wide /home per labs project [12:37:14] it is a checkbox in some page in wikitech [12:37:19] <_joe_> so, the project is not set to use nfs homes, but puppet tries anyways? [12:37:26] <_joe_> why is that? [12:37:36] but you can uncheck it. The result is puppet trying to mount it and failing [12:37:39] <_joe_> doesn't that checkbox set a top-scope puppet variable? [12:37:55] <_joe_> and how is that checkbox applied on the nfs hosts? [12:38:05] let me recheck cause I dont remember the issue well enough [12:38:46] Error: /Stage[main]/Role::Labs::Instance/Mount[/home]: Could not evaluate: Execution of '/bin/mount /home' returned 32: mount.nfs: mounting labstore.svc.eqiad.wmnet:/project/akosiaristests/home failed, reason given by server: [12:38:46] No such file or directory [12:38:56] <_joe_> ahahahahahah [12:39:00] <_joe_> ok then [12:39:06] <_joe_> lemme check puppet [12:39:32] https://wikitech.wikimedia.org/w/index.php?title=Special:NovaProject&action=configureproject&projectname=akosiaristests [12:39:43] so the checkbox is unchecked in that page [12:39:51] <_joe_> ok, but [12:40:02] <_joe_> unchecking that box should set some variable in ldap [12:40:11] <_joe_> that we can use in the puppet manifests [12:40:17] <_joe_> this should be fixed in OSM [12:40:28] yup [12:40:29] <_joe_> right now the mount is unconditional [12:40:51] I 'll create a phab ticket for it [12:42:26] mark, bblack: pybal on lvs1004 is not running, did you stop it by any chance? [12:42:42] i didn't [12:42:59] interesting [12:47:30] _joe_: akosiaris but we default the check box to on [12:47:41] And I am getting errors on deployment-prep [12:47:47] And godog got it somewhere too [12:49:28] YuviPanda: yeah I know.... labstore.svc.eqiad.wmnet is not exactly stable [12:49:49] YuviPanda: I just remembered about that issue as well and I said it [12:50:08] right [12:50:16] I dunno if I should investigate or just be 'meh' [12:57:32] <_joe__> mmmh tor failure [12:57:35] <_joe__> :) [12:58:02] <_joe_> if someone wrote me in the last 5 minutes, chances are I won't read those messages [13:00:49] pretty quiet here [13:01:14] _joe_: so does the latest hhvm read fcgi/php.ini or php.ini? [13:01:38] I think we modified latter, but perhaps it reads the first one and so we were testing with same ocnfig all the time [13:02:12] <_joe_> Nikerabbit: look at the startup script, if in doubt [13:02:26] <_joe_> btw I'm uploading a new package version right now :) [13:04:03] could it be that when I reinstalled hhvm it installed the init.d script which takes presedence over the upstart script? [13:05:13] <_joe_> no I don't think so [13:07:10] okay... then it's still mystery why I can't enable admin esrver [13:15:51] <_joe_> :/ [13:20:43] curiously enough it works if I specify -vAdminServer.Port=9001 on the init script itself [13:22:31] ah... my config has been wrong all along [13:32:22] That's comforting [13:32:59] I swear that I checked that I had used the correct config key multiple times [13:37:01] * _joe_ lunch [13:37:48] andrewbogott_afk: so, precise and trusty both work withh the new unified root! \o/ I'm wondering what kind of disk capacity we have - was thinking of making the minimum be 30G [13:41:43] 3Phabricator, operations: Phabricator mails Message-ID has localhost.localdomain - https://phabricator.wikimedia.org/T75713#1011256 (10hashar) [13:41:57] 3Phabricator, operations: Phabricator mails Message-ID has localhost.localdomain - https://phabricator.wikimedia.org/T75713#777606 (10hashar) [13:43:39] 3Continuous-Integration, Ops-Access-Requests: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#1011259 (10hashar) [13:44:38] 3Continuous-Integration, Ops-Access-Requests: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#957717 (10hashar) [13:44:41] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#1011265 (10hashar) [13:45:08] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#994821 (10hashar) [13:45:09] 3Continuous-Integration, Ops-Access-Requests: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#957717 (10hashar) [13:46:22] akosiaris: considering you wrote the ferm stuff, think you can help out with https://phabricator.wikimedia.org/T72076#1006644? [13:48:01] hmm, so my comment on that ticket never made it... [13:48:25] doesn't really matter, it was in the spirit of "yeah but not before Tuesday" anyway [13:49:39] akosiaris: haha :) [13:49:40] akosiaris: ok [13:57:56] 3operations: Puppet failures in labs if "Share home directories across instances" or "Create shared project storage" are unchecked - https://phabricator.wikimedia.org/T88420#1011303 (10akosiaris) 3NEW [14:15:54] 3Beta-Cluster, operations: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1011343 (10yuvipanda) [14:18:54] 3operations: Migrate racktables to servermon - https://phabricator.wikimedia.org/T88424#1011348 (10akosiaris) 3NEW a:3akosiaris [14:19:06] <_joe__> akosiaris: priority low for that nfs fail ticket? [14:19:13] <_joe__> I'd have used UBN! [14:19:16] <_joe__> :P [14:19:29] <_joe__> like "it's broken since forever and it's embarassing FFS" [14:19:54] not to defend status quo, but it's 'broken broken' only since yesterday [14:19:58] before that a reboot fixed it [14:20:36] 3operations: Migrate racktables to servermon - https://phabricator.wikimedia.org/T88424#1011357 (10akosiaris) T84001 listed various other tools as well. Should the effort fail, we should re-evaluate that list [14:20:45] I could take a look but sooooo maaaannnnnyyyyytttthhingss to do [14:21:17] <_joe__> YuviPanda: it's not your duty I think [14:21:47] heh, that's one way to look at it, yeah. [14:22:56] <_joe__> but well, whatever. I'll raise that priority the moment it breaks my workflow [14:23:03] :) [14:26:01] (03CR) 10Alexandros Kosiaris: [C: 032] "I kind of hate the inconsistency that we have connected a port that is not the first motherboard embedded port but rather the first PCI on" [puppet] - 10https://gerrit.wikimedia.org/r/187121 (https://phabricator.wikimedia.org/T68996) (owner: 10Dzahn) [14:26:10] (03PS5) 10Alexandros Kosiaris: add IPv6 interface to dataset1001 (eth2) [puppet] - 10https://gerrit.wikimedia.org/r/187121 (https://phabricator.wikimedia.org/T68996) (owner: 10Dzahn) [14:26:16] (03CR) 10Alexandros Kosiaris: [C: 032] add IPv6 interface to dataset1001 (eth2) [puppet] - 10https://gerrit.wikimedia.org/r/187121 (https://phabricator.wikimedia.org/T68996) (owner: 10Dzahn) [14:30:51] 3operations: Move servermon.wikimedia.org behind misc-web - https://phabricator.wikimedia.org/T88427#1011374 (10akosiaris) 3NEW [14:31:45] 3operations: Migrate racktables to servermon - https://phabricator.wikimedia.org/T88424#1011383 (10akosiaris) [14:31:46] 3operations: Move servermon.wikimedia.org behind misc-web - https://phabricator.wikimedia.org/T88427#1011384 (10akosiaris) [14:33:55] (03CR) 10Hashar: [C: 04-1] fix another 28 puppet linter warnings (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/188206 (https://phabricator.wikimedia.org/T87132) (owner: 10Dzahn) [14:34:14] (03PS5) 10Hashar: fix another 28 puppet linter warnings [puppet] - 10https://gerrit.wikimedia.org/r/188206 (https://phabricator.wikimedia.org/T87132) (owner: 10Dzahn) [14:34:29] (03CR) 10Hashar: [C: 031] fix another 28 puppet linter warnings [puppet] - 10https://gerrit.wikimedia.org/r/188206 (https://phabricator.wikimedia.org/T87132) (owner: 10Dzahn) [14:37:54] 3ops-requests, operations: Migrate operations/puppet.git to use a recent version of puppet-lint from rubygems - https://phabricator.wikimedia.org/T88430#1011427 (10hashar) 3NEW [14:54:39] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: enable no->nn language pair [puppet] - 10https://gerrit.wikimedia.org/r/186522 (https://phabricator.wikimedia.org/T76674) (owner: 10KartikMistry) [14:55:12] <_joe_> !log uploaded a new hhvm package version, deploying to testwiki and beta [14:55:22] Logged the message, Master [15:03:23] (03PS1) 10Hashar: Fix linter paths discovery [puppet] - 10https://gerrit.wikimedia.org/r/188373 [15:10:12] (03PS1) 10Steinsplitter: Adding cdm16062.contentdm.oclc.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188374 (https://phabricator.wikimedia.org/T76867) [15:13:03] (03PS2) 10Steinsplitter: Adding cdm16062.contentdm.oclc.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188374 (https://phabricator.wikimedia.org/T76867) [15:16:16] (03PS1) 10Hashar: Adjust our linters to use puppet-lint 1.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/188375 (https://phabricator.wikimedia.org/T88430) [15:19:11] 3operations, Continuous-Integration, ops-requests: Migrate operations/puppet.git to use a recent version of puppet-lint from rubygems - https://phabricator.wikimedia.org/T88430#1011574 (10hashar) If the approach in https://gerrit.wikimedia.org/r/#/c/188375/ is suitable, we can add a Jenkins job that would run `b... [15:28:06] (03PS1) 10Yuvipanda: snapshot: Remove unused files from lucid era [puppet] - 10https://gerrit.wikimedia.org/r/188376 [15:28:18] what happened to that "no 3rd party repositories" rule again? :) [15:28:19] anyone wanna +2? ^ [15:28:29] this is just one reason why we need isolated CI :) [15:29:08] hmm, at current staffing levels I wonder if we need one more OpenStack person to be able to support isolated CI [15:29:23] why? [15:29:37] not sure it even needs to be openstack [15:29:39] andrew is going to be busy doing Horizon / Designate stuff, and I'll probably be doing beta / dumps / mw* / tools stuff. [15:29:49] mark: well, hashar has settled on it being openstack from what I understand [15:29:52] this is next quarter [15:30:00] nothing is settled [15:30:07] that discussion is ongoing this quarter [15:30:33] mark: from hashar's email 2 weeks ago [15:30:41] > An OpenStack cloud and KVM are non negotiable prerequisites unless we hire a couple engineers. [15:30:42] yeah I know [15:30:47] but that doesn't work that way [15:30:50] ah :) [15:31:14] but if we get some more folks to help we can use whatever else we want :] [15:31:27] especially if you use openstack you'd need help [15:31:29] :) [15:31:54] right, so to rephrase, *if* we end up using OpenStack for it, I strongly feel that the current labs team isn't resourced enough to provide openstack help. [15:32:05] that may be right [15:33:17] :) [15:34:25] my plan is to put yuvi in an openstack cloud and scale that service horizontally that way [15:34:57] * ^d gets out the yuvi cloning machine [15:36:01] :D [15:36:54] <_joe_> I think openstack is a bad idea for doing segregation of builds [15:37:09] <_joe_> but I am always the naysayer [15:38:25] ^d: from https://etherpad.wikimedia.org/p/staging-machines, I personally think 1-7, fully puppetized with same code as prod is something we can do perhaps by end of the month? [15:39:07] make that 1-8 [15:39:08] <_joe_> !log installing a new package to canary servers [15:39:14] Logged the message, Master [15:40:28] <^d> 1-4, 8 sure. [15:40:40] <^d> Don't 5-7 depend on redis and memcached though? [15:41:30] ^d: alright, 1-10 now :) [15:41:36] my plan is to put yuvi in an openstack cloud and scale that service horizontally that way [15:41:44] oops [15:42:02] i started by cloning irc lines I guess [15:42:28] is 'openstack cloud' a code name for one of those old school torture devices that 'stretch' you by pulling you in opposite directions? [15:43:02] no that's "wikimedia" [15:43:05] <^d> I guess that scales a person horizontally. [15:43:23] haha [15:43:50] <^d> Or vertically, if they stand back up [15:43:54] <^d> It's useful that way [15:44:04] horizontal scaling is accomplished by living near great restaurants [15:44:07] mark: I'm looking through our dumps stuff. last commit seems to be 2012. [15:44:21] or maybe I'm looking in wrong places [15:44:24] * YuviPanda hunts around more [15:44:40] Author: Tomasz Finc [15:44:41] wow! :) [15:46:31] <^d> YuviPanda: I'd say 1-6 for sure by end of month, 7-10 as a stretch goal. [15:46:38] <^d> Seems doable. [15:47:37] ^d: really? most of the things in 1-6 are already well puppetized, and so is 7... [15:47:39] 1-7 maybe? [15:47:57] but if you think 1-6 is what we should aim for, 1-6 is what we aim for :) [15:49:46] ^d: also, how much redundancy are we gonna have? [15:49:50] for memcached, for example [15:49:51] and redis too [15:49:53] one? two? [15:50:43] * anomie sees nothing for SWAT this morning [15:52:40] <^d> YuviPanda: We'll get done what we get done :) [15:52:48] :) [15:52:51] <^d> No idea on how redundant. I guess follow existing beta's example in some places. [15:53:18] hmm, shouldn't we follow prod's? [15:54:04] <^d> YuviPanda: Both! [15:54:09] YuviPanda: when we migrated from tamp to eqiad a year ago, it was almost trivial to reproduce the beta cluster in the new DC :D [15:54:19] :) [15:54:29] hashar: right, but that was with ::beta roles and realm branching [15:54:41] YuviPanda: just spawned a bunch of VM, had puppet finish the run, apply the relevant role, add new domain / public IP, amend puppet patches to adjust the new cluster name and IP address. {done} sort of [15:54:50] :) [15:54:51] the 'sort of' is the problem :) [15:55:04] I think it took me 2 weeks overall with lot of help from andrew B and marc-andré since eqiad instances were slightly different [15:55:17] I think for staging, the only realm branching we should do is lvs related ones. [15:55:18] you will get some oddities such as ton of code expecting beta.wmflabs.org [15:55:31] yes, they should all be abstracted into hiera [15:55:35] and there is no LVS in labs :] [15:56:02] yeah, this might take more than 2 weeks :) [15:56:07] <^d> hashar: This project isn't about replicating beta as it is building a new staging more like prod :) [15:56:07] also if we were to create a new one, it would be nice to get rid of jenkins based deployment in favor of using trebuchet/git-deploy from a single host [15:56:13] <^d> (what beta always aimed to be, but fell short) [15:56:32] yeah, ideally we wouldn't want jenkins touching staging at all [15:56:34] <^d> hashar: Whatever we do I don't think it'll involve jenkins :) [15:56:36] ^d: that is mean :] [15:57:02] <^d> beta's nice, it was a good test run. but it's too unlike prod to be used as a real canary [15:58:22] <^d> I want to remove the jenkins dependency because it's just another weird cog in the wheel of beta vs. prod. [15:59:11] <^d> YuviPanda: I added a 0 to the list. [15:59:11] <_joe_> ^d: how is beta too different from prod? [15:59:15] Yay no-SWAT mornings [15:59:27] ^d: :) [15:59:44] * _joe_ curious [15:59:59] <^d> _joe_: beta-specific roles. nfs. lack of lvs. a fenari-esque bastion/deploy host [16:01:24] <^d> In an ideal world, we could use staging as a test ground for puppet patches too that'll affect prod services [16:01:35] <^d> (ie: non-misc stuff that won't be in staging) [16:02:32] <_joe_> ok, at least 1 of these things is basically irrelevant (lvs), nfs is easily removed from the equation... [16:02:52] <^d> Also ideally, we want this super cleaned up so we could blow away and recreate staging...weekly or something....so we'd always have a clean staging environment. [16:02:54] <_joe_> ^d: yeah but too many different use cases for beta right now [16:03:22] <_joe_> and prod testing... I don't believe in it for opsy stuff [16:03:48] <_joe_> you can do basic correctness testing (i.e. this ain't gonna blow up because it's wrong) [16:04:05] <_joe_> but well, at our scale we need canary testing instead [16:04:11] <^d> I'm not saying that ops stuff would /have/ to go through staging, just that it'd give you a clean place to test something if you're worried [16:04:23] <_joe_> nod [16:04:36] well we tested the text varnish on beta cluster and found some bugs. So it definitely has potential to help ops working on new features [16:05:07] though for most tiny changes, it is easier to unpool a prod box and test it directly [16:05:08] <_joe_> hashar: we got syntax errors [16:05:12] <_joe_> not functional bugs [16:05:22] <^d> Also, names matter. There's this meme that "beta is broken" [16:05:29] <^d> So we want to move beyond beta and make a "staging" [16:05:33] <_joe_> those are only noticed when 100K request hit you any second :) [16:05:49] <_joe_> ^d: eheh call it "preprod" [16:05:54] <_joe_> sounds so solid [16:06:10] <^d> betterthanbeta [16:06:25] <_joe_> betawithkittens [16:06:52] school time bbl [16:06:53] <^d> YuviPanda: That's what we're missing. Kittens. [16:07:54] and lollipops. I want more lollipops [16:07:59] hmm [16:08:09] BeCaaS [16:08:15] Beta Cluster as a Service [16:08:24] strike beta there [16:08:27] CaaS [16:08:37] <_joe_> it sounds bad in italian [16:08:43] everything does [16:08:56] <^d> *aaS doesn't sound good period [16:08:56] * greg-g hasn't had his coffee yet [16:09:10] <_joe_> greg-g: sounds suspectly similar to http://en.wiktionary.org/wiki/cazzo [16:09:18] wait, this is -operations, not the private channel? [16:09:24] <_joe_> ahahah [16:09:26] <_joe_> nope [16:09:37] <_joe_> all your jokes are no publicly logged [16:09:40] <_joe_> ;) [16:09:41] :) [16:09:46] <_joe_> *now [16:10:23] <_joe_> whenever someone will search for "greg-g" you will appear in this embarassing conversation with a bunch of hippie engineers [16:10:47] ^d: I just spent about 10mins discussing with my current roommate if I should file a project request myself and then create the project myself, or if I shouldn't bother with it. [16:10:52] <^d> I'm not old enough to be a hippie [16:10:58] * YuviPanda feels terrible now [16:11:28] YuviPanda: in wmf phab? [16:11:31] <_joe_> ^d: are you implying that I am? [16:11:41] <_joe_> :P [16:11:42] * YuviPanda is partly hippie [16:11:47] or however you spell that [16:12:21] greg-g: yup, filed https://phabricator.wikimedia.org/T88439?workflow=76375 [16:12:24] <^d> _joe_: No, just that I can't be lumped in that "bunch of hippie engineers" :p [16:13:10] <^d> "With kittens, lollypops, british chocolate, and tea." [16:13:57] YuviPanda: they want everyone, even project creators, to log the request for tracking, I though [16:14:00] t [16:14:23] greg-g: no, this is creating a project in labs [16:14:28] though that would get unweildy for those doing sprints [16:14:30] ahhhh [16:15:06] ^d: step 0 done. [16:17:27] <^d> go team! [16:18:19] ^d: don't create instances yet [16:18:25] <^d> I wasn't [16:18:36] I've new images that get rid of /var [16:18:44] <^d> Instance can't exist without a phab task anyway [16:18:44] but want to confirm with andrew before I promote them [16:18:54] ^d: there's already a phab task for staging-palladium :D [16:19:02] <^d> I just made one for tin [16:19:06] ^d: wheee [16:19:51] ^d: do we need terbium? [16:19:55] ^d: in staging? [16:19:56] * YuviPanda isn't sure [16:20:00] <^d> Yes [16:20:04] <^d> Oh, hmm [16:20:24] <^d> Crons could run elsewhere probably. We don't need the people.wm.o stuff or one-off scripting [16:20:31] <^d> We can probably just combine tin/terbium in this case. [16:20:33] we have crons on terbium? [16:20:41] <^d> Yes like 10 or so [16:20:59] what kind of crons? [16:21:23] wellfuckit. VMs are cheap [16:21:30] <^d> See misc/maintenance.pp [16:21:45] file { '/usr/local/sbin/skrillex.py': [16:21:46] really? [16:21:48] <^d> Wow, way more than 10. I hadn't looked in awhile [16:22:14] :) [16:22:31] ^d: you know what else we should do? find some way to make node definitions stick. [16:22:52] <^d> Hmm? [16:22:53] ^d: so we don't have to futz around with wikitech [16:22:57] <^d> Ah, that [16:22:59] but instead can have a sit.pp type thing [16:23:13] we've to use the stupid ec2id atm [16:23:17] but... that's perhaps ok? [16:26:03] jouncebot: next [16:26:24] * bd808 looks for a stick to whack jouncebot with [16:26:24] 3operations: Migrate racktables to servermon - https://phabricator.wikimedia.org/T88424#1011691 (10mark) p:5Normal>3Low [16:29:53] <_joe_> YuviPanda: mmh let's talk about these ideas :) [16:30:01] jouncebot: next [16:30:01] In 1 hour(s) and 29 minute(s): Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150203T1800) [16:32:32] 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#1011693 (10akosiaris) @Gwicke, @ssastry, let's stall this for a while if you don't mind. As noted in T63882 the ability to add a timeout and hence fix the bug has been added in salt v2014.7. [16:32:59] _joe_: yeah :) tomorrow perhaps? I'm going to go off soon [16:33:10] well, I'll be around to make the new images available [16:33:13] and then disappear, probably [16:33:27] <_joe_> that was a polite way to say "site.pp again no FFS!" [16:35:09] <^d> _joe_: Well obviously not site.pp. We'll make a betasite.pp :p [16:35:18] <_joe_> ahahah [16:35:25] * _joe_ hysterically laughs [16:39:11] _joe_: idk I would take site.Po over wikitech [16:39:12] .pp [16:51:49] 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#1011716 (10ssastry) Works for me. I'll continue to keep two shells open, one to bast1001 and another to tin and that should take care of it for deploys. [16:53:46] (03PS1) 10BBlack: cp1064 -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/188383 [16:54:12] (03CR) 10BBlack: [C: 032 V: 032] cp1064 -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/188383 (owner: 10BBlack) [16:55:12] !log cp1064 (eqiad upload cache) depooled in pybal [16:55:15] Logged the message, Master [16:59:57] (03Draft2) 10Filippo Giunchedi: initial debian/ directory [debs/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/188384 (https://phabricator.wikimedia.org/T78514) [17:00:12] (03Draft2) 10Filippo Giunchedi: import LICENSE and initial jar files [debs/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/188385 (https://phabricator.wikimedia.org/T78514) [17:00:22] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [17:00:25] (03Draft2) 10Filippo Giunchedi: include gitreview [debs/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/188386 (https://phabricator.wikimedia.org/T78514) [17:01:01] akosiaris: I've added you to some code reviews for metrics-graphite, it isn't particularly pretty but should work [17:01:29] godog: yeah, I 've noticed [17:02:47] akosiaris: I've considered also archivia (sp?) but we'd need to ship them to the machine some way or another anyway, of course if you have a better idea I'm all for it [17:03:28] archiva you mean godog, right ? [17:03:42] archiva! yes [17:03:56] archiva is for building the artifacts securely, not distributing it [17:05:32] although the https://wikitech.wikimedia.org/wiki/Archiva#Setting_up_git-fat_for_your_project guide [17:05:45] is perfect for confusing you on it :-) [17:05:57] anyway, I am fine with debs [17:07:13] bblack: I am searching and searching and I find no evidence :-(. Either someone has been messing up with my memory or I need and upgrade [17:07:22] !log restarted elasticsearch on logstash1002; OOM [17:07:29] Logged the message, Master [17:08:18] akosiaris: cool, thanks! [17:12:30] gwicke: is there a cassandra test setup in labs I can use? or failing that can I poke at cassandra on e.g. xenon? [17:13:02] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [17:17:22] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [17:17:32] (03PS2) 10BryanDavis: logstash: Configure Elasticsearch index.merge.scheduler.max_thread_count [puppet] - 10https://gerrit.wikimedia.org/r/186986 (https://phabricator.wikimedia.org/T87526) [17:17:39] (03CR) 10jenkins-bot: [V: 04-1] logstash: Configure Elasticsearch index.merge.scheduler.max_thread_count [puppet] - 10https://gerrit.wikimedia.org/r/186986 (https://phabricator.wikimedia.org/T87526) (owner: 10BryanDavis) [17:19:08] (03PS3) 10BryanDavis: logstash: Configure Elasticsearch index.merge.scheduler.max_thread_count [puppet] - 10https://gerrit.wikimedia.org/r/186986 (https://phabricator.wikimedia.org/T87526) [17:19:50] (03CR) 10BryanDavis: "Patch set 2 switched default to 3 as suggested by Nik; Patch set 3 was a manual rebase." [puppet] - 10https://gerrit.wikimedia.org/r/186986 (https://phabricator.wikimedia.org/T87526) (owner: 10BryanDavis) [17:22:18] 3Staging, Beta-Cluster, operations: Move scap puppet code into a module - https://phabricator.wikimedia.org/T87221#1011847 (10mmodell) [17:23:44] !log repooled cp1064 [17:23:46] (03CR) 10JanZerebecki: [C: 031] "Yes, please merge and deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188300 (https://phabricator.wikimedia.org/T87440) (owner: 10Glaisher) [17:23:48] Logged the message, Master [17:28:09] (03PS2) 10Reedy: Add interwiki-labs.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175755 [17:29:23] (03PS1) 10Reedy: Don't commit interwiki.cdb anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188388 (https://phabricator.wikimedia.org/T75905) [17:30:40] (03PS1) 10RobH: misc-web-lb changes to support servermon.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/188389 [17:32:05] 3operations: Move servermon.wikimedia.org behind misc-web - https://phabricator.wikimedia.org/T88427#1011880 (10RobH) https://gerrit.wikimedia.org/r/#/c/188389/ should be the relevant changes for misc-web-lb for this. (As long as servermon is using normal HTTP port only.) [17:33:21] (03PS3) 10Reedy: Add interwiki-labs.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175755 [17:34:20] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [17:48:17] (03CR) 10Alexandros Kosiaris: [C: 032] "This seems fine and is a noop, shepherding into production" [puppet] - 10https://gerrit.wikimedia.org/r/186986 (https://phabricator.wikimedia.org/T87526) (owner: 10BryanDavis) [17:52:15] (03CR) 10Alexandros Kosiaris: [C: 032] fix another 28 puppet linter warnings [puppet] - 10https://gerrit.wikimedia.org/r/188206 (https://phabricator.wikimedia.org/T87132) (owner: 10Dzahn) [17:54:27] YuviPanda: you can’t go bigger than 20g because the standard ‘flavor’ only allocates 20G for the instance to fit in. [17:54:54] So we’d have to change the flavor config for anything bigger… it’s already the case that we can’t build ‘tiny’ instances this way. [18:00:04] maxsem, kaldari: Dear anthropoid, the time has come. Please deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150203T1800). [18:01:12] (03CR) 10Andrew Bogott: [C: 032] Fix linter paths discovery [puppet] - 10https://gerrit.wikimedia.org/r/188373 (owner: 10Hashar) [18:03:51] (03PS1) 10Legoktm: composer.json: Set classmap-authoritative: true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188393 (https://phabricator.wikimedia.org/T85182) [18:03:58] (03CR) 10jenkins-bot: [V: 04-1] composer.json: Set classmap-authoritative: true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188393 (https://phabricator.wikimedia.org/T85182) (owner: 10Legoktm) [18:04:18] andrewbogott: hmm, can we change the flavors? :) [18:04:26] andrewbogott: how are we on disk space? [18:04:55] grr [18:05:14] I’d rather not rearrange the flavors. There are plenty of uses where 20g is plenty [18:05:23] legoktm: ? [18:05:34] andrewbogott: hmm, ok [18:05:35] andrewbogott: complaining at jenkins :P [18:05:43] (03PS1) 10Ejegg: Special:RecordImpression now sampled client-side [puppet] - 10https://gerrit.wikimedia.org/r/188395 (https://phabricator.wikimedia.org/T45250) [18:05:49] andrewbogott: let me rename the image to make it the standard one now [18:05:50] YuviPanda: the good news is that we don’t actually consume the disk space when the the instance is allocated, only when the disk space is filled by the VM [18:05:54] tested both trusty and precise ones, works fine [18:05:59] right [18:06:06] YuviPanda: did you try making a bigger instance and applying an lvm class? [18:06:12] andrewbogott: yup! [18:06:13] worked fine [18:06:15] ok then :) [18:06:17] :D [18:06:25] You see how to mark the old image as obsolete, right? [18:07:28] 3operations, ops-requests, Continuous-Integration: Migrate operations/puppet.git to use a recent version of puppet-lint from rubygems - https://phabricator.wikimedia.org/T88430#1012000 (10Andrew) If this is going to run on a proper Jenkins box then we probably need to build a proper .deb package rather than inst... [18:07:38] (03PS1) 10RobH: replacing old gerrit sha1 cert with new globalsign sha256 [puppet] - 10https://gerrit.wikimedia.org/r/188396 [18:08:05] (03CR) 10RobH: "This shouldn't merge until we can be certain it won't cause a large interruption in gerrit service." [puppet] - 10https://gerrit.wikimedia.org/r/188396 (owner: 10RobH) [18:08:21] using gerrit to update gerrit, woooooo [18:08:22] (03CR) 10Andrew Bogott: [C: 04-2] "Please build a .deb rather than using Bundler. I can help with this, but Filippo is the expert." [puppet] - 10https://gerrit.wikimedia.org/r/188375 (https://phabricator.wikimedia.org/T88430) (owner: 10Hashar) [18:08:54] ^demon|away: So when you are about, I plan to replace the gerrit cert and would love someone to review what im doing ;D (like we dont use gerrit.wikimedia.org cert for anything but https on that box that i can see, but i'd like a second set of eyeballs) [18:09:42] robh: so, if you break Gerrit, we can't fix Gerrit ;) [18:09:45] andrewbogott: yeah, just did that [18:10:01] JohnLewis: not in a proper non hotfix fashion, yep ;D [18:10:08] :D [18:10:08] andrewbogott: and yeah, new images in place \o/ [18:10:10] this is why ops is fun. [18:10:16] YuviPanda: cool, thanks [18:10:40] also, its so much easier to track all this shit now that we've adopted a use phab for everything stance. [18:10:40] andrewbogott: :D nice docs! [18:10:43] fairly complete [18:10:47] ops is the love of living life on the edge knowing you can break the thing you use to deploy stuff, including the fix :p [18:10:49] (03CR) 10Alexandros Kosiaris: [C: 032] add network variables for dumps rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/188188 (owner: 10Dzahn) [18:11:37] 3Beta-Cluster, operations: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1012020 (10yuvipanda) [18:12:18] YuviPanda: sweet, labs has a new partioning scheme? [18:12:21] YuviPanda: next (if you’re not too busy, hah!) you can figure out the parted mystery on jessie [18:12:38] andrewbogott: :D I'm probably going to get dragged off to sleep now. [18:12:49] 3Wikimedia-Git-or-Gerrit, operations: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#1012022 (10RobH) cert has been purchased, the patchset to merge for this is: https://gerrit.wikimedia.org/r/#/c/188396/ When this happens, someone has to rm the chain... [18:13:09] YuviPanda: fair [18:13:20] 3Wikimedia-Git-or-Gerrit, operations: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#1012023 (10RobH) [18:14:36] robh: lovely like space at the beginning of the cert :p [18:14:52] aww fuck [18:14:59] fixing ;D [18:15:12] robh: hey! I haven't responded to the ticket about releng / ops yet, but will do tomorrow. [18:15:23] * YuviPanda has spent the day mostly scheming with ^demon|away and learning / making new labs images [18:15:28] YuviPanda: ? [18:15:29] (03PS2) 10RobH: replacing old gerrit sha1 cert with new globalsign sha256 [puppet] - 10https://gerrit.wikimedia.org/r/188396 [18:15:32] oh, the patchset thing? [18:15:35] yeah [18:15:48] just wanted to say it's very much on my radar, even though I didn't get to it today :) [18:15:50] i had a moment of confusion cuz i also just did a bunch of releng access requests [18:15:53] (03PS2) 10Yuvipanda: snapshot: Remove unused files from lucid era [puppet] - 10https://gerrit.wikimedia.org/r/188376 [18:15:54] had to task switch in brain ;D [18:16:08] (03CR) 10Yuvipanda: [C: 032] snapshot: Remove unused files from lucid era [puppet] - 10https://gerrit.wikimedia.org/r/188376 (owner: 10Yuvipanda) [18:16:14] i went into phab clinic mode man, i didnt even recall the patch for review stuff yet today so its alllll good [18:16:18] * YuviPanda hyperthreads robh [18:16:20] heh :) [18:16:20] ok [18:16:34] you can enable hyperthreading, but that just splits the available memory to less per core [18:16:37] i forget enough as it is [18:16:53] (03CR) 10John F. Lewis: [C: 031] "The patch itself looks sane to me." [puppet] - 10https://gerrit.wikimedia.org/r/188396 (owner: 10RobH) [18:17:15] :) [18:17:40] !log set up new images for ubuntu trusty / precise on labs, for https://phabricator.wikimedia.org/T87003 [18:17:46] Logged the message, Master [18:18:44] andrewbogott: me and _joe_ were wondering if we should just get rid of the swap partition on labs instances [18:18:49] YuviPanda: I'm trying to think now if I can break that to annoy you ;) [18:18:58] JohnLewis: break what? labs? :) [18:19:08] if I have to :D [18:19:13] :) [18:19:20] YuviPanda: I don’t think I have an opinion; I defer to _joe_’s judgement. [18:20:35] andrewbogott: cool, I have filed https://phabricator.wikimedia.org/T88450 and cc'd _joe_ [18:20:40] ok [18:21:19] (03CR) 10RobH: "please note that i simply reused the existing gerrit.wikimedia.org.key file to generate the csr/certificate." [puppet] - 10https://gerrit.wikimedia.org/r/188396 (owner: 10RobH) [18:23:25] robh: for ticket duty I’m watching the ‘operations’ and ‘ops-requests’ queues but they are super quiet. Am I missing something? Where are all the service-alert emails from vendors going? [18:23:37] oh you took over? [18:23:40] cool... uhh [18:23:56] look at my dash... https://phabricator.wikimedia.org/dashboard/view/47/ [18:24:16] robh: I think so? If you’ve been doing it too, that explans why there were so few tasks :) [18:24:27] there is no more ops-requests [18:24:47] we should make a task to update the eamil vendors use for outage notices [18:24:50] yea mark asked me to do it until someone else could volunteer, but i also planned to keep helping out [18:25:01] I mean we redirect and capture it sure, but probably best not to have them trying to hit RT forever [18:25:02] chasemp: yea are we ready to move that to phab? [18:25:18] they are sending to maint announce, i dont think its direct to rt is it? [18:25:25] i thought it was an alias used [18:25:30] yes sorry that was a "when it comes" thought [18:26:08] I think we can move it into phab whenever we are certain it'll work, but we have the issue with nda [18:26:14] we likely cannot make all maint announce public [18:26:26] can we? [18:26:28] robh: is it easy to make a standard ‘clinic duty’ panel that’s not attached to your personal panel? [18:26:47] I’d say they shouldn’t be public, some of them reveal points of vulnerability [18:26:48] andrewbogott: there is one that _joe_ started making but i need to use the admin panel to steal it [18:26:50] and make it ops editable [18:26:54] so yea [18:27:04] we cannot move maint in until the email bridge thing for security works [18:27:05] ok, I’ll use yours for now and leave this to the experts :) [18:27:11] chasemp: ^ similar to vendor stuff right? [18:27:19] Ah, so I should still be watching rt too? [18:27:40] robh: the 'ops clinic duty' widget - might be worth putting on dash 45 [18:27:49] The only things still in RT are procurement (whcih is mostly just myself/mark/faidon and doesnt need clinic duty review) and maint-announce [18:27:54] robh: sort of we can probably do something yeah [18:27:55] which clinic still needs to triage and update gcal [18:28:07] ok, that answers /that/ question. thanks [18:28:15] they are changing this upstream a bit now [18:28:17] so I'm waiting to see [18:28:20] JohnLewis: yea, i need to steal the permissions for 45 and update it [18:28:33] instead of one global email processing thing incoming, each app is going to have their own email pipeline [18:28:35] cool, so we should wait to move maint annoucne once the upstream changes [18:28:43] so you can make diffs by sending to diff@ as well as tasks with task@ [18:28:49] and assume pastes etc on down teh line [18:29:20] and they want you to use herald triggers based on source or content to attach projects [18:29:23] instead of doing #project [18:29:25] in the email [18:29:53] at least thus far I'm half expecting some association to persist but so far no one has a solid use case I think that can't be acheived otherwise [18:30:17] this kind of all stems from discarding #project auto association, and then them fleshing out email interaction [18:30:21] i only understood half of that ;D [18:30:22] anyways, more than you wanted to know [18:30:45] when you put #operations in a comment or task description it associates the object [18:30:45] so based on source would work [18:30:45] right [18:30:54] hearld sort on source for maint would totally work [18:30:55] that is going away (in theory) [18:31:01] for vendors less so [18:31:02] robh: yeah I think so [18:31:06] (03PS1) 10Alexandros Kosiaris: Revert "add network variables for dumps rsync clients" [puppet] - 10https://gerrit.wikimedia.org/r/188404 [18:31:10] but yea, that would work well for maint-announces [18:31:20] yeah but vendors aren't creating tickets and doing #project anyhow [18:31:28] (ok, now i totally get what you said, i just had to think about it a bit and recall my phab-fu) [18:31:30] indeed [18:31:34] they are always emailing direct to a task. [18:31:49] it scares me that phab makes sense to me ;_; [18:32:01] i think that means ive been assimilated. [18:32:03] dark side man [18:32:04] dark side [18:33:04] _joe_: do you mind if i use the admin account to steal ownership/edit of the ops dashboard? [18:33:17] <_joe_> robh: be my guest [18:33:19] right now its set to you only for edit, and I'd like to update it to reflect new workflows [18:33:20] cool [18:33:26] did you know there is https://wmf.zendesk.com/anonymous_requests/new [18:33:34] not just techsupport@ by mail [18:33:48] nope, because i avoid asking oit for things ;D [18:33:58] and it appears if you do, it creates gong sounds [18:34:13] via https://github.com/WikimediaOIT/zendesk-check [18:35:05] so last word on this for now is basically https://secure.phabricator.com/T6819#93958 [18:35:07] (03CR) 10Alexandros Kosiaris: "I had to revert this in Ib90139aaa168a90f81149c7d9757966a67a0c4df as it is not really working and I have no time to actively fix right now" [puppet] - 10https://gerrit.wikimedia.org/r/188188 (owner: 10Dzahn) [18:35:12] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "add network variables for dumps rsync clients" [puppet] - 10https://gerrit.wikimedia.org/r/188404 (owner: 10Alexandros Kosiaris) [18:38:19] akosiaris: oh? what was the error message [18:38:38] 3operations, Wikimedia-OTRS: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#1012089 (10Jgreen) Looks to me like we already have SessionCSRFProtection and SessionCSRFProtection enabled; both are set to 1 in Kernel/Config/Files/ZZZAAuto.pm and I confirmed that my browser has a co... [18:38:51] reads commit message of revert [18:39:30] mutante: it will not be used is the problem and I wanted to avoiding a wrong impression [18:39:40] to avoid* [18:40:06] the variables in there will not be evaluated to become ferm macros [18:40:27] cause only those in defs.erb are evaluated, not arbitrary variables in network.pp [18:40:34] mutante: ^ [18:41:00] gtg [18:41:13] akosiaris: thanks, yea, that makes sense now! [18:41:27] cya later, i'll look at an alternative [18:41:54] it would be nice to use them in rsyncd config "allowed hosts" too [18:42:01] where i took them from [18:42:28] that should be possible [18:42:39] (03CR) 10Manybubbles: [C: 032 V: 032] Update wikimedia extra plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/180204 (owner: 10Manybubbles) [19:00:04] Reedy, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150203T1900). [19:07:49] 3Continuous-Integration, operations: Migrate operations/puppet.git to use a recent version of puppet-lint from rubygems - https://phabricator.wikimedia.org/T88430#1012240 (10Dzahn) [19:08:07] 3operations: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1012241 (10Dzahn) [19:08:20] 3Analytics, MediaWiki-General-or-Unknown, operations, Services, Wikidata, wikidata-query-service: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1012242 (10GWicke) [19:14:11] 3operations: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#1012286 (10Dzahn) [19:17:39] (03CR) 10John F. Lewis: [C: 031] add ferm service for rsyncd to dumps role [puppet] - 10https://gerrit.wikimedia.org/r/188204 (owner: 10Dzahn) [19:19:50] andrewbogott: ok, now you can use that ops dash, i send email to ops about it [19:19:56] gwicke: I’m trying to handle the access requests for @Jdouglas and @mobrovac but I don’t think I have usernames or keys for them. Can they amend https://phabricator.wikimedia.org/T85492 accordingly? [19:19:59] robh: cool [19:20:05] and the clinic page is updated to reflect whats now in RT and whats not [19:20:24] andrewbogott: if they have no user, they shoudl be split to subtasks [19:20:33] grouped access requests are ONLY for escalating existing user access [19:20:41] new users have to have independent requests and acknowledgements =] [19:20:56] info on https://wikitech.wikimedia.org/wiki/Requesting_shell_access [19:20:57] robh: they /might/ have usernames but if the request doesn’t include the shell name then I have no way of knowing [19:21:11] yep, just heading off potential answer =] [19:21:28] also, its a new enforcement on our side from our phab meetings [19:21:34] i felt the need to stress it [19:21:44] robh: yep, good to know [19:22:19] so yea i totally robbed the low hanging fruit of clinic duty yesterday, sorry ;D [19:22:41] you know, all the easy ones that make ops look awesome, heh [19:23:38] 3RESTBase, Services, Ops-Access-Requests: Shell access for @Jdouglas - https://phabricator.wikimedia.org/T88464#1012296 (10Andrew) 3NEW a:3Jdouglas [19:24:44] 3RESTBase, Services, Ops-Access-Requests: Shell access for @mobrovac - https://phabricator.wikimedia.org/T88465#1012303 (10Andrew) 3NEW a:3mobrovac [19:24:49] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:26:17] 3RESTBase, Services, Ops-Access-Requests: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1012318 (10Andrew) [19:27:50] (03PS1) 10Reedy: Non wikipedias to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188409 [19:27:55] (03PS1) 10Andrew Bogott: Give Cassandra access to smalyshev. [puppet] - 10https://gerrit.wikimedia.org/r/188410 [19:28:37] (03Abandoned) 10John F. Lewis: Change wgMaxRedirects for enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154720 (https://bugzilla.wikimedia.org/63388) (owner: 10John F. Lewis) [19:28:54] (03Abandoned) 10John F. Lewis: mailman: enable rmlist for web-list deletion [puppet] - 10https://gerrit.wikimedia.org/r/170398 (owner: 10John F. Lewis) [19:29:49] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188409 (owner: 10Reedy) [19:29:54] (03Merged) 10jenkins-bot: Non wikipedias to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188409 (owner: 10Reedy) [19:30:09] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:30:22] (03CR) 10Andrew Bogott: [C: 032] Give Cassandra access to smalyshev. [puppet] - 10https://gerrit.wikimedia.org/r/188410 (owner: 10Andrew Bogott) [19:30:26] 3RESTBase, Services, Ops-Access-Requests: Shell access for @mobrovac - https://phabricator.wikimedia.org/T88465#1012354 (10mobrovac) a:5mobrovac>3Andrew mobrovac ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAr04C1g3grQJRwhOyrAic/xzW+2lxlxwjdTIY12HJs6aBvKTeUhwMLTxSMQ0nsFacnCcdTU1YcDYn0ypXxpd/v62uX4nbnw3goYSgKysmYlrHi... [19:30:45] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.25wmf15 [19:30:48] Logged the message, Master [19:33:19] 3operations: Document Debian/Ubuntu security update procedure & command - https://phabricator.wikimedia.org/T88469#1012356 (10Gage) 3NEW [19:33:51] 3operations: Document Debian/Ubuntu security update procedure & command - https://phabricator.wikimedia.org/T88469#1012364 (10Gage) [19:34:07] robh: anyone look at https://phabricator.wikimedia.org/T86808 (for hoo) since the summit? [19:34:17] it came up [19:34:20] but it had some oddness to it [19:34:25] just friendly reminder :) [19:34:26] ok [19:34:34] i dont recall what the outcome is other than someone knows about it [19:34:46] k [19:34:55] andrewbogott: do you recall that during ops meetings? [19:35:06] if not its about to totally become your clinic duty issue ;D [19:35:16] :) [19:35:33] robh: Yes, I think that we discussed that we’re hoping Ariel will recover and handle it [19:35:57] And if that doesn’t happen in a few more days we’ll intervene [19:36:23] um… at least, I think that was the discussion. mutante, that sound right? re: https://phabricator.wikimedia.org/T86808 [19:38:52] (03CR) 10Nemo bis: [C: 031] "Looks sane, but can't verify" [debs/kafka] - 10https://gerrit.wikimedia.org/r/187648 (owner: 10Mattrobenolt) [19:39:05] if you are asking me , we came full circle :) [19:39:07] but yea [19:40:38] (03CR) 10Dzahn: [C: 031] Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [19:40:46] but that was my own PS.. so ... [19:47:20] (03PS4) 10Andrew Bogott: base: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170477 (owner: 10John F. Lewis) [19:49:58] (03CR) 10Andrew Bogott: [C: 032] base: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170477 (owner: 10John F. Lewis) [19:50:27] (03PS1) 10John F. Lewis: add ferm rules for ganglia_new::web [puppet] - 10https://gerrit.wikimedia.org/r/188415 [19:52:09] (03CR) 10Plucas: [C: 031] "Looks good to me." [debs/kafka] - 10https://gerrit.wikimedia.org/r/187648 (owner: 10Mattrobenolt) [19:52:40] JohnLewis: https://phabricator.wikimedia.org/P256 [19:53:12] mutante: thanks [19:54:48] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/P256" [puppet] - 10https://gerrit.wikimedia.org/r/172434 (owner: 10John F. Lewis) [20:03:45] 3ops-codfw, hardware-requests, operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1012480 (10Papaul) [20:03:48] 3ops-codfw, operations: reclaim rbf2002/WMF5833 back to spare, allocate WMF5845 as rbf2002 - https://phabricator.wikimedia.org/T88380#1012478 (10Papaul) 5Open>3Resolved Racktable updated mgmt setup complete and tested port = ge-5/0/19 complete. [20:05:09] 3Project-Creators, operations: Project Proposal: Label style projects for common operations tools - https://phabricator.wikimedia.org/T1147#1012491 (10Dzahn) [20:09:51] (03PS1) 10John F. Lewis: Use noc@ for apache2 ServerAdmin [puppet] - 10https://gerrit.wikimedia.org/r/188416 [20:13:57] 3operations: retire "ops-requests" / delete tag - https://phabricator.wikimedia.org/T88461#1012547 (10Dzahn) p:5Triage>3Low [20:14:26] (03Abandoned) 10John F. Lewis: base: autoload modules [puppet] - 10https://gerrit.wikimedia.org/r/170481 (owner: 10John F. Lewis) [20:15:32] (03CR) 10Dzahn: [C: 031] "that's right, noc@ it is" [puppet] - 10https://gerrit.wikimedia.org/r/188416 (owner: 10John F. Lewis) [20:18:37] (03PS1) 10John F. Lewis: base: move ::grub to grub.pp [puppet] - 10https://gerrit.wikimedia.org/r/188417 [20:23:28] (03PS1) 10John F. Lewis: base: move syslogs/remote-syslogs to manifests [puppet] - 10https://gerrit.wikimedia.org/r/188419 [20:25:39] (03PS1) 10John F. Lewis: base: move instance-upstarts to manifest [puppet] - 10https://gerrit.wikimedia.org/r/188420 [20:27:09] (03PS1) 10John F. Lewis: base: move screenconfig to manifest [puppet] - 10https://gerrit.wikimedia.org/r/188421 [20:29:43] (03PS1) 10John F. Lewis: base: move base::firewall to manifest [puppet] - 10https://gerrit.wikimedia.org/r/188423 [20:29:49] (03CR) 10jenkins-bot: [V: 04-1] base: move base::firewall to manifest [puppet] - 10https://gerrit.wikimedia.org/r/188423 (owner: 10John F. Lewis) [20:37:42] (03PS2) 10Andrew Bogott: base: move ::grub to grub.pp [puppet] - 10https://gerrit.wikimedia.org/r/188417 (owner: 10John F. Lewis) [20:38:42] 3Project-Creators, operations: Project Proposal: Label style projects for common operations tools - https://phabricator.wikimedia.org/T1147#1012596 (10Jgreen) > As for subteams: > - We agreed on the list to split the operations aspects of fundraising (i.e. mostly @jgreen's work) into a different project. I wa... [20:39:19] (03CR) 10Andrew Bogott: [C: 032] base: move ::grub to grub.pp [puppet] - 10https://gerrit.wikimedia.org/r/188417 (owner: 10John F. Lewis) [20:40:45] (03PS2) 10Andrew Bogott: base: move screenconfig to manifest [puppet] - 10https://gerrit.wikimedia.org/r/188421 (owner: 10John F. Lewis) [20:42:07] (03CR) 10Andrew Bogott: [C: 032] base: move screenconfig to manifest [puppet] - 10https://gerrit.wikimedia.org/r/188421 (owner: 10John F. Lewis) [20:43:48] (03CR) 10Andrew Bogott: [C: 04-1] "needs rebase" [puppet] - 10https://gerrit.wikimedia.org/r/188420 (owner: 10John F. Lewis) [20:44:13] (03CR) 10Andrew Bogott: [C: 04-1] "needs rebase" [puppet] - 10https://gerrit.wikimedia.org/r/188419 (owner: 10John F. Lewis) [20:46:23] (03PS2) 10Andrew Bogott: add ferm rules for ganglia_new::web [puppet] - 10https://gerrit.wikimedia.org/r/188415 (owner: 10John F. Lewis) [20:46:56] 3ops-codfw, operations: setup/deploy server procyon - corporate oit backup server - https://phabricator.wikimedia.org/T87028 (10RobH) gateway = 208.80.152.241 [20:48:45] 3RESTBase, Services, Ops-Access-Requests: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1012617 (10Andrew) Access granted to @Smalyshev. The other two are blocked on subtasks -- I need keys and usernames (or pointers to them if they're alrea... [20:50:56] 3RESTBase, Services, Ops-Access-Requests: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1012620 (10mobrovac) >>! In T85492#1012617, @Andrew wrote: > I need keys and usernames (or pointers to them if they're already up someplace.) @Andrew I p... [20:53:58] (03PS1) 10BBlack: kill jessie comment [puppet] - 10https://gerrit.wikimedia.org/r/188426 [20:54:00] (03PS1) 10BBlack: cp10[67]0 -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/188427 [20:54:02] (03PS1) 10BBlack: depool cp10[67]0 cache backends [puppet] - 10https://gerrit.wikimedia.org/r/188428 [20:54:18] (03CR) 10BBlack: [C: 032 V: 032] kill jessie comment [puppet] - 10https://gerrit.wikimedia.org/r/188426 (owner: 10BBlack) [20:54:36] (03CR) 10BBlack: [C: 032 V: 032] cp10[67]0 -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/188427 (owner: 10BBlack) [20:54:53] (03CR) 10BBlack: [C: 032 V: 032] depool cp10[67]0 cache backends [puppet] - 10https://gerrit.wikimedia.org/r/188428 (owner: 10BBlack) [20:55:51] <^demon|away> robh: Yo, I'm about for a tiny bit. You still need eyes on gerrit cert? [20:56:06] just wondering if i need to schedule the flip or what [20:56:14] cuz if it looks good to you we can push it now =] [20:56:17] <^demon|away> It should just be an apache restart, right? [20:56:25] well, delete chained cert, run puppet [20:56:27] apache restart [20:56:33] that should do it. [20:56:39] <^demon|away> Yeah I'd just go ahead. [20:56:41] i just prefered if you were around in case horrible hsit happened [20:56:44] cool, i'll do it now [20:56:46] <^demon|away> Gerrit service won't need restarting [20:56:53] <^demon|away> So less impact. [20:57:01] <^demon|away> Just !log and move on :) [20:57:02] https://gerrit.wikimedia.org/r/#/c/188396/ is the change [20:57:16] if you wanted to +1 it so it doest look liek i did alone ;D [20:57:42] (03CR) 10Chad: [C: 031] "I'm pretending to understand SSL certs!" [puppet] - 10https://gerrit.wikimedia.org/r/188396 (owner: 10RobH) [20:57:45] hehe [20:57:48] nice [20:57:56] (03CR) 10RobH: [C: 032] replacing old gerrit sha1 cert with new globalsign sha256 [puppet] - 10https://gerrit.wikimedia.org/r/188396 (owner: 10RobH) [20:58:57] !log updating gerrit.wikimedia.org cert [20:59:05] Logged the message, Master [20:59:48] well, that went well [20:59:51] !log depooling cp1060, cp1070 (1 each bits + mobile) for reinstall [20:59:52] ^demon|away: new cert is live, wooo [20:59:55] Logged the message, Master [21:00:25] <^demon|away> Expires May 25, 2018. Lgtm! [21:00:27] <^demon|away> :) [21:00:50] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1012640 (10RobH) [21:00:55] 3operations, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#1012638 (10RobH) 5Open>3Resolved gerrit.wikimedia.org now has a new globalsign cert that is sha256, resolving. [21:01:59] i was about to ask about the .crt vs. pem renaming of the CA cert [21:02:07] then saw it's a symlink from one to another [21:02:10] PROBLEM - Host cp1070 is DOWN: PING CRITICAL - Packet loss = 100% [21:02:32] pem is system generated [21:02:38] we were doing it wrong a very long time [21:03:07] (but you know that ;) [21:03:10] PROBLEM - Host cp1060 is DOWN: PING CRITICAL - Packet loss = 100% [21:05:11] RECOVERY - Host cp1060 is UP: PING OK - Packet loss = 0%, RTA = 3.45 ms [21:05:39] RECOVERY - Host cp1070 is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [21:07:02] (03CR) 10Hashar: "@andrewbogott Gently ping :-D Would be nice to have labs instance to hit our apt mirror instead of Ubuntu ones. Should save a bunch of ba" [puppet] - 10https://gerrit.wikimedia.org/r/174971 (owner: 10Andrew Bogott) [21:07:20] PROBLEM - Varnishkafka log producer on cp1060 is CRITICAL: Connection refused by host [21:07:20] PROBLEM - Varnish traffic logger on cp1060 is CRITICAL: Connection refused by host [21:07:29] PROBLEM - puppet last run on cp1060 is CRITICAL: Connection refused by host [21:07:29] PROBLEM - Disk space on cp1060 is CRITICAL: Connection refused by host [21:07:30] PROBLEM - DPKG on cp1060 is CRITICAL: Connection refused by host [21:07:30] PROBLEM - dhclient process on cp1060 is CRITICAL: Connection refused by host [21:07:39] PROBLEM - HTTPS on cp1060 is CRITICAL: Return code of 255 is out of bounds [21:07:40] PROBLEM - salt-minion processes on cp1060 is CRITICAL: Connection refused by host [21:07:40] PROBLEM - RAID on cp1060 is CRITICAL: Connection refused by host [21:07:41] ^ not to mention our apt repo is different, we have packages we prefer to upstream [21:07:51] PROBLEM - Varnish HTTP bits on cp1070 is CRITICAL: Connection refused [21:07:59] PROBLEM - Varnish HTTP mobile-frontend on cp1060 is CRITICAL: Connection refused [21:08:00] PROBLEM - Varnish HTTP mobile-backend on cp1060 is CRITICAL: Connection refused [21:08:00] PROBLEM - configured eth on cp1060 is CRITICAL: Connection refused by host [21:08:10] PROBLEM - DPKG on cp1070 is CRITICAL: Connection refused by host [21:08:10] PROBLEM - RAID on cp1070 is CRITICAL: Connection refused by host [21:08:10] PROBLEM - Varnish HTCP daemon on cp1060 is CRITICAL: Connection refused by host [21:08:20] PROBLEM - puppet last run on cp1070 is CRITICAL: Connection refused by host [21:08:20] PROBLEM - HTTPS on cp1070 is CRITICAL: Return code of 255 is out of bounds [21:08:21] PROBLEM - configured eth on cp1070 is CRITICAL: Connection refused by host [21:08:39] PROBLEM - dhclient process on cp1070 is CRITICAL: Connection refused by host [21:08:40] PROBLEM - Disk space on cp1070 is CRITICAL: Connection refused by host [21:08:49] PROBLEM - salt-minion processes on cp1070 is CRITICAL: Connection refused by host [21:09:45] (03CR) 10Dzahn: "+1 for using our own mirror instead" [puppet] - 10https://gerrit.wikimedia.org/r/174971 (owner: 10Andrew Bogott) [21:12:06] (03CR) 10Hashar: [C: 04-1] "It seems the role::beta::appserver class is no more applied on the beta cluster instances so should be safe." [puppet] - 10https://gerrit.wikimedia.org/r/185966 (https://phabricator.wikimedia.org/T87210) (owner: 10Yuvipanda) [21:17:30] RECOVERY - configured eth on cp1060 is OK: NRPE: Unable to read output [21:17:39] RECOVERY - Varnish HTCP daemon on cp1060 is OK: PROCS OK: 1 process with UID = 115 (vhtcpd), args vhtcpd [21:17:44] greg-g, are we ok to deploy the las-only https://gerrit.wikimedia.org/r/#/c/188300/ ? [21:17:48] !log increased opendj lookthrough-limit to 12000 on both ldap hosts. We just hit lucky 5000 users and some queries stopped working. [21:17:50] s/las/labs/ [21:17:51] RECOVERY - Varnishkafka log producer on cp1060 is OK: PROCS OK: 1 process with command name varnishkafka [21:17:51] PROBLEM - Host cp1070 is DOWN: PING CRITICAL - Packet loss = 100% [21:17:53] Logged the message, Master [21:18:09] RECOVERY - Disk space on cp1060 is OK: DISK OK [21:18:59] PROBLEM - Host cp1060 is DOWN: PING CRITICAL - Packet loss = 100% [21:19:29] RECOVERY - Host cp1070 is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [21:21:15] RECOVERY - Host cp1060 is UP: PING OK - Packet loss = 0%, RTA = 2.97 ms [21:21:16] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:21:26] RECOVERY - dhclient process on cp1060 is OK: PROCS OK: 0 processes with command name dhclient [21:21:26] RECOVERY - DPKG on cp1060 is OK: All packages OK [21:21:35] RECOVERY - salt-minion processes on cp1060 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:21:35] RECOVERY - RAID on cp1060 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:21:35] RECOVERY - Varnish HTTP mobile-frontend on cp1060 is OK: HTTP OK: HTTP/1.1 200 OK - 348 bytes in 0.025 second response time [21:21:36] RECOVERY - HTTPS on cp1060 is OK: SSLXNN OK - 36 OK [21:22:15] RECOVERY - Varnish traffic logger on cp1060 is OK: PROCS OK: 2 processes with command name varnishncsa [21:22:25] RECOVERY - Varnish HTTP mobile-backend on cp1060 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.003 second response time [21:22:40] (03PS1) 10Dzahn: lower TTL of etherpad.wm.org [dns] - 10https://gerrit.wikimedia.org/r/188430 [21:24:07] (03PS1) 10BBlack: Revert "depool cp10[67]0 cache backends" [puppet] - 10https://gerrit.wikimedia.org/r/188431 [21:24:09] (03CR) 10Dzahn: [C: 032] lower TTL of etherpad.wm.org [dns] - 10https://gerrit.wikimedia.org/r/188430 (owner: 10Dzahn) [21:28:02] going to / on git.wikimedia.org seems to 500 but the repos themselves seem fine [21:28:20] Usually a temporary cache error [21:28:48] It's a lottery, root regularly gives 5xx [21:29:16] git is just gitblit, the web part, use gerrit.wm to clone from [21:29:31] yea, seems it's broken again..:p [21:29:50] !log restarted gitblit [21:29:56] Logged the message, Master [21:29:57] <_joe_> gitblit is mostly broken all the time [21:30:02] * aude would love permissions to restart it [21:30:02] <_joe_> sigh [21:30:10] who needs gitblit? [21:30:14] github replication is working again :P [21:30:18] jenkins does for wikidata :( [21:30:35] Sounds quite wrong [21:30:39] it's slightly easier to view a list of repos in gitblit though [21:30:40] we want to change that but apparently not that easy [21:31:04] aude: if you rely on gitblit for development, I hope you filed this as a blocker for diffusion migration [21:31:04] (03CR) 10BBlack: [C: 032] Revert "depool cp10[67]0 cache backends" [puppet] - 10https://gerrit.wikimedia.org/r/188431 (owner: 10BBlack) [21:31:05] what's the good alternative to gitblit? [21:31:21] phabricator! :p [21:31:25] <^demon|away> Diffusion :p [21:31:26] mutante: github, but that breaks sometimes and not open source [21:31:33] <^demon|away> Once we're done importing repos. [21:31:39] :) [21:31:40] * ^demon|away goes away again [21:31:40] (03Abandoned) 10Hashar: Basic rspec setup [puppet] - 10https://gerrit.wikimedia.org/r/178810 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [21:31:47] (03Abandoned) 10Hashar: Refactor rake entry point for specs [puppet] - 10https://gerrit.wikimedia.org/r/180162 (owner: 10Hashar) [21:31:54] (03Abandoned) 10Hashar: Use RSpec::Core::Runner.run() in rakefile [puppet] - 10https://gerrit.wikimedia.org/r/180215 (owner: 10Hashar) [21:31:54] redirect git. to diffusion URLS :) [21:31:59] !log repooled cp10[67]0 [21:32:03] Logged the message, Master [21:32:17] uhm.. it doesnt come back yet as usual [21:32:18] We still need to redirect gitweb URLs as well [21:33:06] PROBLEM - git.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 516 bytes in 0.004 second response time [21:33:29] icinga-wm: late to the party [21:33:56] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59709 bytes in 1.899 second response time [21:34:10] MC8: try again [21:34:30] works now, thanks [21:34:42] 3Datasets-General-or-Unknown, operations: Enable IPv6 on dumps.wikimedia.org - https://phabricator.wikimedia.org/T68996#1012825 (10akosiaris) 5Open>3Resolved [21:34:47] we could let icinga execute the restart... hmm [21:35:22] or a local "watchdog" thing [21:36:30] MaxSem: if no one answered you, yes [21:36:38] thanks [21:37:22] (03CR) 10MaxSem: [C: 032] Fix wgMobileUrlTemplate for wikidatawiki on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188300 (https://phabricator.wikimedia.org/T87440) (owner: 10Glaisher) [21:37:30] (03Merged) 10jenkins-bot: Fix wgMobileUrlTemplate for wikidatawiki on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188300 (https://phabricator.wikimedia.org/T87440) (owner: 10Glaisher) [21:38:27] !log maxsem Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/188300/ (duration: 00m 07s) [21:38:32] Logged the message, Master [21:38:50] _joe_: 2 questions for you on T86887 [21:39:06] _joe_: added one redis box to puppet already .. [21:39:15] (without the role that is) [21:39:22] but do we want Debian? [21:40:07] <_joe_> mutante: great! I've been busy with other things tbh [21:40:26] <_joe_> mutante: I have to think about that [21:40:34] sure, any time (later) [21:40:37] <_joe_> it's a risk [21:40:53] _joe_: the second box has to be ordered anyways, it wasnt the right one [21:40:57] <_joe_> in theory, no if we haven't migrated the primary datacentered [21:41:03] <_joe_> mutante: ah! [21:41:09] yea, i was just thinking, why dont we do Debian right away in codfw [21:41:40] <_joe_> yes, it could be good, but the scope of the project will immediately become larger [21:41:55] you'll also find some updates on T86898 [21:42:00] <_joe_> by a sizable amount [21:42:06] * _joe_ nod [21:43:43] 3operations: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1012874 (10Joe) At the moment we don't plan to replicate between datacenters. Maybe just the sessions data, in case, but multi-dc replication with redis is not an easy task for sure. [21:44:26] 3operations, Wikimedia-OTRS: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#1012875 (10csteipp) >>! In T87217#998789, @lfaraone wrote: > Oh, wait, I just looked at our OTRS installation to remind myself. The session ID is in the URL. `https://ticket.wikimedia.org/otrs/index.pl?... [22:03:52] (03PS1) 10Jgreen: apache-fast-test strip transaction metadata before comparing response size [puppet] - 10https://gerrit.wikimedia.org/r/188475 [22:08:08] (03CR) 10Jgreen: "ref https://phabricator.wikimedia.org/T84736" [puppet] - 10https://gerrit.wikimedia.org/r/188475 (owner: 10Jgreen) [22:13:20] jgage: did we open the ulsfo power supply box? [22:13:24] (vendor is askin) [22:13:25] !log reedy Synchronized php-1.25wmf15/extensions/WikimediaMaintenance: tmp script (duration: 00m 07s) [22:13:27] i assume yes =] [22:13:30] Logged the message, Master [22:15:18] (03PS1) 10Dzahn: monitoring service: parameter for event_handlers [puppet] - 10https://gerrit.wikimedia.org/r/188477 [22:15:53] (03CR) 10jenkins-bot: [V: 04-1] monitoring service: parameter for event_handlers [puppet] - 10https://gerrit.wikimedia.org/r/188477 (owner: 10Dzahn) [22:16:01] 3operations: replace wikitech-static.wikimedia.org sha1 with sha256 cert - https://phabricator.wikimedia.org/T88487#1012985 (10RobH) 3NEW a:3RobH [22:16:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [22:18:28] 3operations: replace wikitech-static.wikimedia.org sha1 with sha256 cert - https://phabricator.wikimedia.org/T88487#1013000 (10RobH) completed replacement [22:19:31] (03PS2) 10Dzahn: monitoring service: parameter for event_handlers [puppet] - 10https://gerrit.wikimedia.org/r/188477 [22:19:50] ^ whatever that spike was, it wasn't me [22:20:57] (03CR) 10Dzahn: "this is to let us execute commands when something in Icinga turns CRIT - for example.. restart gitblit automatically when it goes down" [puppet] - 10https://gerrit.wikimedia.org/r/188477 (owner: 10Dzahn) [22:21:27] Reedy: aude ^ .. [22:21:37] 3operations: replace wikitech-static.wikimedia.org sha1 with sha256 cert - https://phabricator.wikimedia.org/T88487#1013019 (10RobH) Revocation of the sha1 version via the rapidssl certificate via their portal isn't working, so I submitted an email directly. I'll leave this task open until I have confirmation o... [22:21:48] woo [22:24:26] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1013033 (10RobH) [22:26:09] 3operations: replace rt.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88489#1013039 (10RobH) 3NEW a:3RobH [22:27:29] 3Continuous-Integration, operations: Migrate operations/puppet.git to use a recent version of puppet-lint from rubygems - https://phabricator.wikimedia.org/T88430#1013049 (10faidon) Why would we do all that instead of just getting 1.1.0 from Debian jessie and installing it to our reprepro's backport section? It... [22:27:40] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1013052 (10RobH) [22:28:02] 3operations: replace etherpad.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88490#1013058 (10RobH) 3NEW a:3RobH [22:28:51] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#753138 (10RobH) [22:29:06] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:33:28] robh: re. https://phabricator.wikimedia.org/T88490 - the plan is to move it behind misc (I say plan, patches exist just yeah :p) [22:33:40] ok but its free for me to replace with sha256 [22:33:43] and they have errors now. [22:33:48] so not sure why i wouldnt replace ;D [22:33:58] effort though :P [22:34:01] mutante ^^ [22:34:01] mutante: yay :) [22:34:06] yes but right now it has errors [22:34:16] and its easier to replace for free than try to migrate every single one of those [22:34:17] heh [22:41:00] (03PS1) 10RobH: replacing etherpad.wikimedia.org sha1 with sha256 cert [puppet] - 10https://gerrit.wikimedia.org/r/188479 [22:42:23] (03CR) 10John F. Lewis: [C: 031] replacing etherpad.wikimedia.org sha1 with sha256 cert [puppet] - 10https://gerrit.wikimedia.org/r/188479 (owner: 10RobH) [22:42:39] (03CR) 10RobH: [C: 032] replacing etherpad.wikimedia.org sha1 with sha256 cert [puppet] - 10https://gerrit.wikimedia.org/r/188479 (owner: 10RobH) [22:43:54] !log replacing etherpad sha1 with sha256 cert [22:43:57] Logged the message, Master [22:49:33] (03PS1) 10Dzahn: let icinga auto restart gitblit when it goes down [puppet] - 10https://gerrit.wikimedia.org/r/188480 [22:49:35] (03PS1) 10RobH: replacing rt.wikimedia.org sha1 with sha256 cert [puppet] - 10https://gerrit.wikimedia.org/r/188481 [22:50:14] (03CR) 10jenkins-bot: [V: 04-1] let icinga auto restart gitblit when it goes down [puppet] - 10https://gerrit.wikimedia.org/r/188480 (owner: 10Dzahn) [22:50:39] (03CR) 10Reedy: "woo :)" [puppet] - 10https://gerrit.wikimedia.org/r/188480 (owner: 10Dzahn) [22:50:41] (03CR) 10RobH: [C: 032] replacing rt.wikimedia.org sha1 with sha256 cert [puppet] - 10https://gerrit.wikimedia.org/r/188481 (owner: 10RobH) [22:52:44] !log magnesium apache reload for rt cert replacement [22:52:47] Logged the message, Master [22:53:07] (03PS2) 10Dzahn: let icinga auto restart gitblit when it goes down [puppet] - 10https://gerrit.wikimedia.org/r/188480 [22:53:40] !log starting rolling restart of logstash elasticsearch cluster to pick up index.merge.scheduler.max_thread_count puppet change [22:53:43] Logged the message, Master [22:54:20] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1013117 (10RobH) [22:54:21] 3operations: replace etherpad.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88490#1013115 (10RobH) 5Open>3Resolved replaced and sha1 revoked, resolving [22:54:27] !log restarted elasticsearch on logstash1003 [22:54:30] Logged the message, Master [22:54:37] (03PS3) 10Dzahn: let icinga auto restart gitblit when it goes down [puppet] - 10https://gerrit.wikimedia.org/r/188480 [22:54:50] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#753138 (10RobH) [22:54:51] 3operations: replace rt.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88489#1013118 (10RobH) 5Open>3Resolved rt.wikimedia.org sha1 replaced with sha256, and sha1 revoked [22:55:07] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 26 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 23, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 100, initializing_shards: 3, number_of_data_nodes: 3 [22:56:17] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 121, initializing_shards: 3, number_of_data_nodes: 3 [22:58:22] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I always thought that using eventhandlers:" [puppet] - 10https://gerrit.wikimedia.org/r/188480 (owner: 10Dzahn) [22:59:28] hmm, i just set the topic and that remove the -2? seems new [23:00:26] (03CR) 10Dzahn: "there's even more than just a ticket, there's an entire project :) https://phabricator.wikimedia.org/tag/gitblit-deprecate/" [puppet] - 10https://gerrit.wikimedia.org/r/188480 (owner: 10Dzahn) [23:00:34] <_joe_> lol [23:01:01] <_joe_> mutante: moreso, we should wait for it [23:01:08] <_joe_> there is a plan to deprecate it [23:03:08] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1013134 (10RobH) [23:04:30] _joe_: hmm.. i think we will still restart it quite a few times until that happens [23:04:34] echo "Gitblit has been restarted $(wget -O - https://wikitech.wikimedia.org/wiki/Server_Admin_Log 2> /dev/null| grep "restarted gitblit" | wc -l) times (plus when we did not log)" [23:04:39] Gitblit has been restarted 7 times (plus when we did not log) [23:05:18] <_joe_> many more times I guess [23:05:32] 3operations: replace [blog|techblog].wikimedia.org sha1 certificates with sha256 - https://phabricator.wikimedia.org/T88491#1013136 (10RobH) 3NEW a:3RobH [23:05:35] <_joe_> but still, it's part of the job [23:05:54] <_joe_> we can't do that automatically [23:06:12] <_joe_> we do have to see that it doesn't get in a restart-again cycle for instance [23:06:15] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1013147 (10RobH) [23:07:09] somehow i dont see us debugging the java app [23:08:50] the security question.. well. yea, it needs nagios to ssh into the box, but it already has the user [23:08:55] <_joe_> well at least babysut uts restart [23:09:02] <_joe_> *its [23:10:13] hmm, right, this woudln't work yet nagios has /bin/false shell [23:10:18] on antimony [23:10:59] sigh, it's when you think you can just add something nice really quick :p [23:12:27] <_joe_> it was very well done btw :) [23:12:40] thx [23:16:41] (03CR) 10Dzahn: "yes, this should be just fine and not influence anything, just a mini typo in there" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/188415 (owner: 10John F. Lewis) [23:18:43] 3operations, Datasets-General-or-Unknown: Enable IPv6 on dumps.wikimedia.org - https://phabricator.wikimedia.org/T68996#1013178 (10wpmirrordev) 5Resolved>3Open This task has not been resolved. It is still not possible to access from an IPv6 network. This is due to lack of name resolutio... [23:18:50] 3operations, Wikimedia-Git-or-Gerrit: Git.wikimedia.org is down - https://phabricator.wikimedia.org/T73974#1013181 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/188480/ [23:19:06] (03PS3) 10John F. Lewis: add ferm rules for ganglia_new::web [puppet] - 10https://gerrit.wikimedia.org/r/188415 [23:19:12] 3operations, Wikimedia-Git-or-Gerrit: Git.wikimedia.org keeps going down - https://phabricator.wikimedia.org/T73974#1013182 (10Dzahn) 5Resolved>3Open p:5Unbreak!>3Normal [23:19:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [23:19:47] 3operations, Datasets-General-or-Unknown: Enable IPv6 on dumps.wikimedia.org - https://phabricator.wikimedia.org/T68996#1013186 (10Dzahn) a:5ArielGlenn>3Dzahn [23:20:16] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T73974" [puppet] - 10https://gerrit.wikimedia.org/r/188480 (owner: 10Dzahn) [23:20:34] (03PS4) 10Dzahn: let icinga auto restart gitblit when it goes down [puppet] - 10https://gerrit.wikimedia.org/r/188480 (https://phabricator.wikimedia.org/T73974) [23:20:55] 3operations, Wikimedia-Git-or-Gerrit: Git.wikimedia.org keeps going down - https://phabricator.wikimedia.org/T73974#755836 (10Dzahn) [23:21:04] mutante: I don't think Gerrit liked you posting that link :p [23:21:33] JohnLewis: yea, last attempt to fix that had to be reverted [23:21:44] bah [23:21:51] 3operations, Wikimedia-Git-or-Gerrit: Git.wikimedia.org keeps going down - https://phabricator.wikimedia.org/T73974#1013189 (10Dzahn) echo "Gitblit has been restarted $(wget -O - https://wikitech.wikimedia.org/wiki/Server_Admin_Log 2> /dev/null| grep "restarted gitblit" | wc -l) times (plus when we did not log)"... [23:22:27] JohnLewis: https://gerrit.wikimedia.org/r/#/c/177128/ [23:24:07] seems easier than it is-example :p [23:25:28] 3Datasets-General-or-Unknown, operations: Enable IPv6 on dumps.wikimedia.org - https://phabricator.wikimedia.org/T68996#1013191 (10Dzahn) ``` inet 208.80.154.11/26 brd 208.80.154.63 scope global eth2 inet6 2620:0:861:1:208:80:154:11/64 scope global ``` ^ so that part is good, we now have a "mapped" IPv6 add... [23:26:49] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1013195 (10RobH) [23:28:09] (03PS1) 10Dzahn: add AAAA record for dataset1001 [dns] - 10https://gerrit.wikimedia.org/r/188484 (https://phabricator.wikimedia.org/T68996) [23:28:45] (03CR) 10John F. Lewis: [C: 031] add AAAA record for dataset1001 [dns] - 10https://gerrit.wikimedia.org/r/188484 (https://phabricator.wikimedia.org/T68996) (owner: 10Dzahn) [23:29:05] (03CR) 10Dzahn: "DNS for this here: https://gerrit.wikimedia.org/r/#/c/188484/" [puppet] - 10https://gerrit.wikimedia.org/r/187121 (https://phabricator.wikimedia.org/T68996) (owner: 10Dzahn) [23:30:57] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:32:10] (03PS2) 10Dzahn: add AAAA record for dataset1001 [dns] - 10https://gerrit.wikimedia.org/r/188484 (https://phabricator.wikimedia.org/T68996) [23:32:23] JohnLewis: ..also reverse [23:32:52] mutante: bah I keep forgetting the existence of PTRs [23:33:31] looking at the PTR, +1 stands anyway [23:33:40] JohnLewis: yea, i bet we have a couple to add in other places.. thanks [23:34:23] more than likely [23:34:41] (03CR) 10Dzahn: [C: 032] add AAAA record for dataset1001 [dns] - 10https://gerrit.wikimedia.org/r/188484 (https://phabricator.wikimedia.org/T68996) (owner: 10Dzahn) [23:38:48] 3Wikimedia-General-or-Unknown, operations: DMARC: Users cannot send emails via a wiki's [[Special:EmailUser]] - https://phabricator.wikimedia.org/T66795#1013198 (10Jalexander) [23:39:51] 3Wikimedia-General-or-Unknown, operations: DMARC: Users cannot send emails via a wiki's [[Special:EmailUser]] - https://phabricator.wikimedia.org/T66795#685159 (10Jalexander) Given the privacy aspects of this I think we may want to switch the reply-to config for now as, at least, a stop gap measure. I've added o... [23:40:42] is that still a problem? :/ [23:41:09] JohnLewis: now we have dataset1001 enabled, but dumps still not ..meh :p [23:41:24] CNAME ... [23:41:51] oh, wait, or do we [23:42:56] nevermind, works :) [23:43:25] 3Datasets-General-or-Unknown, operations: Enable IPv6 on dumps.wikimedia.org - https://phabricator.wikimedia.org/T68996#1013203 (10Dzahn) >>! In T68996#1013178, @wpmirrordev wrote: > It is still not possible to access from an IPv6 network. > This is due to lack of name resolution. > ``` d... [23:47:38] 3Datasets-General-or-Unknown, operations: Enable IPv6 on dumps.wikimedia.org - https://phabricator.wikimedia.org/T68996#1013209 (10Dzahn) 5Open>3Resolved [23:47:56] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [23:49:02] (03CR) 10Dzahn: [C: 031] Give parsoid admins the ability to update/restart the RT testing service. [puppet] - 10https://gerrit.wikimedia.org/r/180221 (https://phabricator.wikimedia.org/T86804) (owner: 10Cscott) [23:56:48] 3operations: Move servermon.wikimedia.org behind misc-web - https://phabricator.wikimedia.org/T88427#1013234 (10Dzahn) p:5Triage>3Normal [23:56:53] 3operations: Move servermon.wikimedia.org behind misc-web - https://phabricator.wikimedia.org/T88427#1013236 (10Dzahn) a:3Dzahn