[00:49:02] (03PS3) 10Chad: Gerrit: Swap DNS to new host, lead [dns] - 10https://gerrit.wikimedia.org/r/299007 [00:50:04] jynus, hashar: Welcome, party's about to start :) [00:50:09] mutante is here too! [00:50:16] yep [00:50:17] Perfect. All we need now is some balloons :p [00:50:31] * hashar waves [00:50:44] have to warm up a bit still [00:51:49] (03PS2) 10Dzahn: Gerrit: Run list_reviewer_counts cron as root [puppet] - 10https://gerrit.wikimedia.org/r/300711 (owner: 10Chad) [00:52:37] harmless cron change as warmup i guess [00:53:25] Yeah Faidon noticed that was spamming on lead. I worked around it temporarily but it can totally run as root (now, it couldn't a few weeks ago) [00:53:37] Running as root will keep it from yelling at us again :) [00:53:43] (03CR) 10Dzahn: [C: 032] Gerrit: Run list_reviewer_counts cron as root [puppet] - 10https://gerrit.wikimedia.org/r/300711 (owner: 10Chad) [00:55:01] !log starting hot backup of db1020's reviewdb [00:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:55:32] it will take some time to copy 60GB, it will be consistent with the end of the backup [00:56:03] Does hot mean it'll continue to tail the data as it's inserted? Don't want the last couple of changes to disappear accidentally :) [00:56:32] hot means it doesn't block writes [00:56:43] and those writes will be on the final backup [00:56:44] Ah ok [00:56:49] Makes sense. [00:57:21] !log manually deleted reviewer-counts cron from gerrit2 user, runs as root and puppet does not remove crons unless ensure=>absent [00:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:57:33] if there are writes between the end of the backup and the start of the process, they can be reapplied with the binary logs [00:58:07] whatever you do, do not start until it finishes or you ask to start [00:58:29] (03CR) 10Dzahn: "i manually deleted the entry from gerrit2's crontab and it was added to root" [puppet] - 10https://gerrit.wikimedia.org/r/300711 (owner: 10Chad) [00:58:53] 27GB copied [00:59:21] basically the most important thing is to record the binlog position to revert in case it is needed [01:00:04] ostriches, mutante, and jynus: Respected human, time to deploy Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160725T0100). Please do the needful. [01:00:18] is it ok to delay it 5 minutes? [01:00:39] Yeah no worries [01:00:40] (you can start anything that does not involve the db [01:00:42] I wanna do things right :) [01:00:53] * hashar deploys 2.0-wmf.0 [01:01:03] I think I am awake now :D [01:01:09] * ostriches throws a stick at hashar [01:01:09] lol [01:02:04] !log rsyncing latest git data from ytterbium to lead [01:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:02:20] mutante, neon config has issues, related to openstack, do you know anything about that? [01:02:33] jynus: no, not yet [01:02:58] no worries for the weekend [01:03:01] checks the error [01:03:44] (03PS1) 10Dzahn: remove ytterbium from puppet, update gerrit comment [puppet] - 10https://gerrit.wikimedia.org/r/300806 [01:03:52] Duplicate definition found for host 'californium' [01:03:55] yes [01:04:09] I saw that, but I didn't see anything obvious [01:04:12] yea, no idea yet, but i bet it's a reinstall [01:04:15] maybe hiera [01:04:49] the backup should be about to finish [01:05:04] when that happens, what I need is the following [01:05:16] (aside from the other stuff, of course) [01:05:53] ostriches, you tell me: I am ready to alter the db, I stop the slave, wait for confirmation, they you are good to go (it should only take 5 seconds or so) [01:06:04] californium exists with private IP and once with public IP in icinga config [01:06:04] s/they/then/ [01:06:10] since there is an open ticket "Move californium to an internal host?" [01:06:21] it seems likely that it's that [01:06:39] ok, let's focus on gerrit now [01:06:40] jynus: Sounds good. Should just be a few more minutes to finish copying data and getting the patches in place & rebased. [01:06:48] yes, no rush [01:07:04] (03PS8) 10Chad: Gerrit: Swap lead to point at production data [puppet] - 10https://gerrit.wikimedia.org/r/298673 [01:08:11] 160725 01:05:16 innobackupex: completed OK! [01:08:37] so now waiting for you to stop the slave and record the binlog position [01:09:29] !log reviewdb backup finished, available on db1020:/srv/tmp/2016-07-25_00-54-31/ [01:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:10:04] !log stopping CI [01:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:10:12] no more jenkins-bot complains :D [01:11:48] Back in a second, restroom! [01:14:03] 06Operations, 10Icinga: icinga config issue with duplicate californium - https://phabricator.wikimedia.org/T141232#2491056 (10Dzahn) [01:14:15] Ok back. Rsync still going :) [01:14:29] ACKNOWLEDGEMENT - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors daniel_zahn https://phabricator.wikimedia.org/T141232 [01:14:57] mutante: while at it you can ack zuul on gallium I have stopped it [01:15:12] PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused [01:15:12] ok [01:15:20] (just so everyone's on the same page, https://phabricator.wikimedia.org/T70271#2482308 is the task list we're following today) [01:15:32] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [01:15:39] mutante, jynus: californium, the horizon host? it hasn't been reinstalled recently AFAIK [01:15:40] mutante, I have added andrew, probably either he created it (by mistake) or can through more light into it if it is openstack-related [01:16:04] ACKNOWLEDGEMENT - zuul_gearman_service on gallium is CRITICAL: Connection refused daniel_zahn gerrit migration ongoing [01:16:04] ACKNOWLEDGEMENT - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server daniel_zahn gerrit migration ongoing [01:16:21] or maybe madhuv is working on that too, I do not know? [01:16:42] jynus: yes, perfect. *nod* [01:17:03] s/through/throw/ [01:17:23] I may be losing basic writing skills [01:17:29] Krenair: maybe the issue is that a name is being reused. we'll find out later [01:17:34] ok [01:17:35] * mutante opens the task list [01:17:37] !log scandium: migrating zuul-merger repos to lead find /srv/ssd/zuul/git -path '*/.git/config' -print -execdir sed -i -e 's/ytterbium.wikimedia.org/lead.wikimedia.org/' config \; [01:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:17:42] Anddddd done with the rsync. Ok we're all set. Everyone ready? [01:18:11] yes, I haven't considered a conflict e.g. (IPs), as mutante hinted on the ticket [01:18:38] anyway, not an emergecy for the weekend [01:18:44] The authenticity of host '[lead.wikimedia.org]:29418 ([208.80.154.82]:29418)' can't be established. [01:18:44] :( [01:19:14] Hmm, that's no bueno. [01:19:29] forgot ssh check for the IP address as well :d [01:20:08] so even if we kept the ssh host fingerprint for the ssh client on 29418, the ssh client still complains [01:20:14] * hashar raises fist at security [01:20:27] Can we clear the known_hosts? [01:20:43] I can accept it :] [01:20:53] (Also: lead hasn't had its IP change...) [01:21:44] are you migrating ytterbium IP address to lead? [01:21:54] Nope, new host, new IP. [01:22:05] no, just gerrit-new becomes gerrit [01:22:16] they have 2 IPs each [01:22:47] ahh [01:23:13] I thought that the gerrit.wikimedia IP ( the .81) would be moved [01:23:19] since that is a /32 service IP [01:24:23] https://gerrit.wikimedia.org/r/#/c/299007/3/templates/wikimedia.org [01:25:07] Hmm, should we amend that to keep the .81? [01:25:33] Seems tricky since ytterbium won't have disappeared yet. [01:25:44] what is the current problem? [01:25:49] this way it will be transport to people and robots [01:26:02] orhaze [01:26:05] my english is off [01:26:23] mutante: Existing known_hosts entries may complain since IP:hostname won't match with what they already have on file. [01:26:31] (even though the ssh_host_key is unchanged) [01:26:43] so you want to give ytterbium's current IP to lead? [01:26:55] each have their own IP [01:27:01] gerrit being on a third one :) [01:29:02] your call really ostriches [01:29:25] sorry I should have thought about simply moving the IP when we talked about changing the DNS entry [01:29:32] Eh, let's go with what we're already planned. I don't want to swap things in-flight like this. [01:29:37] yeah [01:29:55] agree [01:30:30] In that case, let's move forward. mutante, let's merge those changes. [01:30:51] !log lead: stopped puppet for a few minutes [01:30:52] ok. at the same time ? [01:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:31:01] well. as much as possible [01:31:24] Yeah they can go in independently but we need both applied and live before we do the next step :) [01:31:49] And puppet's off on lead, so puppet won't try to be smart and do things to lead before we want :) [01:31:52] (03CR) 10Dzahn: [C: 032] Gerrit: Swap lead to point at production data [puppet] - 10https://gerrit.wikimedia.org/r/298673 (owner: 10Chad) [01:32:10] and I have stopped CI so Jenkins is no more reporting [01:32:15] how about ytterbium [01:32:28] Ah yeah go ahead and merge the dns one now [01:32:35] Otherwise you'll get the maintenance page :p [01:32:52] stopped puppet on ytterbium too [01:32:56] merged the config change on master [01:33:05] Yeah I've got puppet stopped on both. [01:33:07] (03CR) 10Dzahn: [C: 032] Gerrit: Swap DNS to new host, lead [dns] - 10https://gerrit.wikimedia.org/r/299007 (owner: 10Chad) [01:33:29] ok, ready for authdns-update? [01:33:44] Yep :D [01:34:02] !log switched gerrit-new to gerrit in DNS [01:34:06] Ok, I'm going to run puppet on lead now and send it into maintenance mode. [01:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:35:47] "Run puppet one last time on ytterbium, stop puppet, stop Gerrit processes" [01:35:54] the LE cert will be taken care of by puppet? [01:36:05] Yep. [01:36:12] nice [01:36:28] i disabled puppet on ytterbium, so the "one last time" has not happened [01:36:36] I turned it back on, ran it, then turned it off. [01:36:44] ok [01:36:48] !log ytterbium: Stopped puppet, stopped gerrit process. [01:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:37:13] ok, looks like we are at the snapshot stage [01:37:16] right [01:37:20] Yep. [01:37:33] jynus: We're ready to do the DB changes now. [01:37:41] !log scandium: restarted zuul-merger [01:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:38:43] !log m2 replication on db2011 stopped, master binlog pos: db1020-bin.000968:1013334195 [01:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:38:51] you are good to go, ostriches [01:38:59] Ok great. [01:39:13] !log lead: turning puppet back on, here we go [01:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:39:54] do we have a a way to flush the gerrit.wikimedia.org DNS entry on the dns recursors ? [01:40:08] or we can just wait :] [01:40:11] we lowered the TTL [01:40:17] yeah lets wait [01:40:25] the one I am seeing expires in 100 seconds [01:40:31] we could have lowered 5 more minutes.. oh well [01:41:10] what about step 9? [01:41:15] the new one has 1800 TTL apparently? [01:41:20] Schema upgades going. [01:41:22] PROBLEM - gerrit process on ytterbium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [01:41:41] Can we get an ack on that? [01:41:56] on it [01:42:14] ACKNOWLEDGEMENT - SSH access on ytterbium is CRITICAL: Connection refused daniel_zahn gerrit migration T70271 [01:42:14] ACKNOWLEDGEMENT - gerrit process on ytterbium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daniel_zahn gerrit migration T70271 [01:43:39] ACKNOWLEDGEMENT - HTTPS on lead is CRITICAL: SSL CRITICAL - failed to verify gerrit-new.wikimedia.org against gerrit.wikimedia.org daniel_zahn gerrit migration T70271 [01:43:39] ACKNOWLEDGEMENT - SSH access on lead is CRITICAL: Connection refused daniel_zahn gerrit migration T70271 [01:43:39] ACKNOWLEDGEMENT - gerrit process on lead is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daniel_zahn gerrit migration T70271 [01:44:55] Schema changes done. [01:46:58] if any puppet production hosts rely on git::clone() they might complain [01:47:16] [2016-07-25 01:46:54,539] [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 2.12.2 ready [01:47:21] \o/ [01:48:12] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures [01:48:19] Ok, now to bring us out of maintenance mode. Gonna have to live hack it so we can get a patch in :) [01:48:33] hashar: seems like the one kafka is one of them [01:48:50] mutante: most probably yeah [01:49:01] mutante: you might want to check it though [01:50:05] i did and there are no failures [01:50:08] Can we kick the grrt-wm? [01:50:10] bot [01:50:11] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures [01:50:12] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [01:50:16] yes, handling bot [01:50:48] ostriches: may I reconnect Zuul? [01:50:53] Should be able to now yes [01:51:04] !log restarted grrrit-wm [01:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:52:45] (03PS1) 10Chad: Gerrit: Maintenance is over yay! [puppet] - 10https://gerrit.wikimedia.org/r/300808 [01:53:00] :) [01:53:00] !log starting Zuul [01:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:53:33] RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730 [01:53:52] mutante: Icinga ACK of an alarms is cleared out on recovery isn't it ? [01:53:53] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [01:53:58] My dashboard seems a little wonky, some abandoned/merged changes showing up, but I'm not too worried about that just this second. [01:54:00] since we ACKed we see nicely what comes back [01:54:08] which is good [01:54:14] better than just disabling notifications [01:54:23] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 4 failures [01:54:26] hashar: yes [01:54:31] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 2 failures [01:54:32] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures [01:54:57] these will be the ones that git clone [01:55:07] 2/3 i already know they are [01:56:11] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [01:56:23] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [01:56:32] when i ran puppet manually it was already ok again [01:58:09] ostriches: ssh on Precise auto add the new ip to known_hosts :) [01:59:26] Ah ok, that missing object in the log was something I missed in the rsync. Fixed now. [01:59:41] Woah, wtf? [02:00:13] I have a draft entry on my review list "Edit Project Config" for All-Projects. This probably happened while we were testing gerrit-new [02:00:15] ran puppet on the stat hosts too [02:00:32] Clicking on it gives you this: https://gerrit.wikimedia.org/r/#/c/298009/ [02:00:41] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:01:01] Krenair: How? None of the gerrit-new data was kept. [02:01:03] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/300808 (owner: 10Chad) [02:01:27] I also see "Testing gerrit-new (2.12) backport submodule updates" [02:02:11] And I don't see my more recent commits [02:02:14] Ohhhh [02:02:16] I know why [02:02:18] I have to reindex. [02:02:20] Duh [02:02:22] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [02:02:40] Should've started that already [02:03:27] !log gerrit: reindexing lucene now that we have new data. searches/dashboards may look a tad weird for a bit [02:03:29] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch: Elasticsearch SSL on relforge hosts broken - https://phabricator.wikimedia.org/T141234#2491070 (10Dzahn) [02:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:03:51] dont get distracted by that bug. it's been like that for over 3 days in icinga [02:03:58] but i want to ACK it [02:04:02] to make that clear [02:04:52] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:04:53] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch: Elasticsearch SSL on relforge hosts broken - https://phabricator.wikimedia.org/T141234#2491083 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=relforge [02:05:14] ACKNOWLEDGEMENT - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.4.13:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.4.13, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) daniel_zahn https://phabricator.wikimedia.org/T141234 [02:05:14] ACKNOWLEDGEMENT - Elasticsearch HTTPS on relforge1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn https://phabricator.wikimedia.org/T141234 [02:05:14] ACKNOWLEDGEMENT - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.37.21:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.21, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) daniel_zahn https://phabricator.wikimedia.org/T141234 [02:05:14] ACKNOWLEDGEMENT - Elasticsearch HTTPS on relforge1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn https://phabricator.wikimedia.org/T141234 [02:05:18] zuul does not ssh to Gerrit properly for some reason [02:05:33] known_hosts? [02:06:49] ACKNOWLEDGEMENT - puppet last run on relforge1001 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn https://phabricator.wikimedia.org/T141234 [02:07:57] I'm going to take gerrit down for a minute, it's not being super cooperative with the reindex. [02:08:06] Uno moment folks :) [02:11:39] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch: Elasticsearch SSL on relforge hosts broken - https://phabricator.wikimedia.org/T141234#2491085 (10Dzahn) i think this is just ongoing setup like T141085 [02:11:44] Thereeeee it goes. [02:12:23] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch: Elasticsearch SSL on relforge hosts monitoring alerts - https://phabricator.wikimedia.org/T141234#2491087 (10Dzahn) [02:13:25] Ok, I had to end up reindexing offline. Trying to force reindex was fighting with the online reindexer. [02:13:34] Fighting over lock files. [02:13:42] $ ssh -4 jenkins-bot@lead.wikimedia.org -p 29418 [02:13:42] ssh: connect to host lead.wikimedia.org port 29418: No route to host [02:13:48] what am I am screwing up ? :) [02:14:02] Well gerrit's off for a bit like I just said for one ;-) [02:14:21] oh [02:14:27] Connection refused on my end. [02:14:32] No route to host seems different though [02:14:35] no route to host is interesting though [02:14:36] yeah [02:14:36] Like it can't find lead. [02:14:39] what host is that from hashar? [02:14:48] is this from inside labs? [02:15:12] PROBLEM - SSH access on lead is CRITICAL: Connection refused [02:15:12] from gallium [02:15:13] sorry [02:15:24] mutante: Ack for a bit ^? [02:15:25] then if Gerrit is done that is logical [02:15:43] PROBLEM - gerrit process on lead is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [02:15:55] krenair@bastion-01:~$ ssh -4 lead.wikimedia.org -p 29418 [02:15:55] ssh: connect to host lead.wikimedia.org port 29418: Connection refused [02:15:55] krenair@bastion-01:~$ ssh -4 lead.wikimedia.org [02:15:55] ssh: connect to host lead.wikimedia.org port 22: No route to host [02:15:57] Interesting... [02:16:05] ACKNOWLEDGEMENT - SSH access on lead is CRITICAL: Connection refused daniel_zahn gerrit migration [02:16:05] ACKNOWLEDGEMENT - gerrit process on lead is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daniel_zahn gerrit migration [02:16:34] I'm guessing connections from labs to production port 22s are blocked [02:17:07] yupp [02:17:21] "no route to host" is still strange [02:17:26] refused.. yes [02:17:38] yeah gallium is a prod host somewhere, not sure which subnet [02:17:46] it shouldn't have such issues [02:17:54] and it was trying port 29418, so... [02:18:29] public1-b-eqiad [02:18:32] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures [02:18:34] maybe I only accepted the IPv4 address [02:18:38] in known_host [02:18:52] that wouldn't explain no route to host [02:18:58] and Zuul definitely tried to ssh to the IPv6 ones, so it might choke on the unknown key [02:19:14] silver also gets connection refused. is it still saying no route to host hashar? [02:19:21] you know, we have IPv6 now [02:19:23] gerrit is down right now [02:19:27] and before something was missing [02:22:02] nevermind, it was just (gerrit old did not have reverse DNS for IPv6) [02:22:29] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.11) (duration: 09m 09s) [02:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:42] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 4 failures [02:23:51] ACKNOWLEDGEMENT - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 4 failures daniel_zahn cant clone from gerrit right now [02:24:29] mwdeploy@tin ?? [02:24:51] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 2 failures [02:24:53] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures [02:25:32] ACKNOWLEDGEMENT - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 2 failures daniel_zahn gerrit is down [02:26:14] ACKNOWLEDGEMENT - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 2 failures daniel_zahn gerrit is down [02:26:14] ACKNOWLEDGEMENT - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 7 failures daniel_zahn gerrit is down [02:26:14] ACKNOWLEDGEMENT - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures daniel_zahn gerrit is down [02:26:42] 86% done indexing [02:28:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jul 25 02:28:21 UTC 2016 (duration 5m 52s) [02:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:01] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures [02:30:41] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures [02:31:44] ACKNOWLEDGEMENT - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 3 failures daniel_zahn gerrit is down [02:31:44] ACKNOWLEDGEMENT - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn gerrit is down [02:32:00] kafka hosts are somethign different [02:32:29] actually no, also git pull [02:33:15] ACKNOWLEDGEMENT - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn gerrit is down [02:33:15] ACKNOWLEDGEMENT - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn gerrit is down [02:33:15] ACKNOWLEDGEMENT - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn gerrit is down [02:33:15] ACKNOWLEDGEMENT - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn gerrit is down [02:34:40] ostriches: at 98% it slows down, doesnt it :) [02:35:44] Sssshhhh :p [02:37:32] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures [02:37:52] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [02:38:16] neon is because tendril [02:38:31] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures [02:39:53] analytics1027 is hue.. all of that the same thing [02:40:09] ACKNOWLEDGEMENT - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn gerrit is down [02:40:09] ACKNOWLEDGEMENT - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn gerrit is down [02:40:09] ACKNOWLEDGEMENT - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn gerrit is down [02:40:30] ostriches: so progress? :) [02:40:38] 97% [02:40:46] wait [02:40:54] wasn't it at 98% 5 minutes or so ago?? [02:40:58] No [02:41:07] It was at 86% [02:41:12] https://xkcd.com/612/ [02:41:16] mutante was joking about how it gets slow at 98% lol. [02:49:01] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [02:49:15] ^ i have stopped it [02:49:54] ACKNOWLEDGEMENT - zuul_gearman_service on gallium is CRITICAL: Connection refused daniel_zahn gerrit migration [02:49:54] ACKNOWLEDGEMENT - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server daniel_zahn gerrit migration [02:51:02] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [03:01:54] Almost there.... [03:01:59] This is dumb tho [03:02:54] shuts up to not jinx it [03:03:08] having a smoke outside brb [03:07:18] ostriches: so what is up ? that is a Microsoft 1 minute left 99% status ? :) [03:07:33] Yep. [03:08:22] And done! [03:08:41] :) [03:09:01] RECOVERY - gerrit process on lead is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [03:10:23] RECOVERY - SSH access on lead is OK: SSH OK - GerritCodeReview_2.12.2 (SSHD-CORE-0.14.0) (protocol 2.0) [03:10:31] Ok perfect. [03:10:33] Much better. [03:10:35] Dashboards look sane. [03:10:55] yay! [03:11:06] i see gerrit again [03:11:11] trying zuul [03:11:21] Krenair: Your "change looked like one from gerrit-new" is fixed too afaict. [03:11:25] error: [Errno 113] No route to host [03:11:28] (unless it isn't on your side) [03:11:39] logs in [03:11:44] that is over ipv6 [03:11:50] from gallium [03:12:15] Ok let's figure this part out now [03:12:35] 2620:0:861:3:208:80:154:82 is unreacheable [03:12:37] from gallium :( [03:13:04] maybe that is a ferm rule ? [03:13:06] gallium doesn't have an ipv6 address does it? [03:13:23] We should've already applied the updated ferm rules for gallium [03:13:25] Errrr [03:13:26] Wait. [03:13:30] oh [03:13:34] I wonder if it's from the "old" lead. [03:13:40] And puppet didn't detect that to change. [03:13:43] [gallium:~] $ host 2620:0:861:3:208:80:154:85 [03:13:49] that is gerrit [03:13:51] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:13:56] :85 [03:15:05] btw, it worked for me. [03:15:05] it should not be trying :82 [03:15:19] Hi Chad, you have successfully connected over SSH. [03:15:20] :) [03:16:19] maybe becaues the reverse record was longer TTL [03:17:33] what do you mean by step 9 [03:17:47] after verifying the installation [03:17:50] ostriches: so does some ferm rule needs to be adjusted? [03:17:52] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:18:10] or should we point zuul to gerrit.wikimedia.org instead of lead? [03:18:13] hashar: We added ferm rules, but I'm curious if when we swapped dns puppet didn't notice. [03:18:29] it almost certainly did not [03:18:39] we can delete the rules and let it recreate them [03:18:39] But that's probably not it, or it wouldn't work when I just tried myself. [03:18:53] it's trying the wrong IP though [03:19:02] if it's still trying :82 [03:19:27] Ah, doing lead I get the same result as hashar. Doesn't work. [03:19:44] It works with gerrit.wm.o though [03:20:02] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:20:02] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [03:20:07] using gerrit seems better anyways? [03:20:13] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [03:20:15] That's what I've been saying :) [03:20:46] Oh nope, it works too. [03:20:50] Just a million times slower. [03:22:13] well [03:22:23] going to use gerrit.wm.o [03:22:29] ok, confirmed, ping6 to lead .. unreachable [03:22:29] and stop refering to lead [03:22:37] ping6 to gerrit. ok [03:22:47] Ah! [03:22:50] Got it I think [03:23:32] Plus. [03:23:41] ferm rules seem wrong anyway [03:23:45] They just say ssh. [03:23:48] And port 22. [03:23:51] Which is wrong port! [03:24:23] ostriches: since zuul hosts can ssh to gerrit.wikimedia.org IP [03:24:31] we can skip port 29418 on lead.wm.o [03:24:42] and just switch CI to point to the service hostname/IP [03:24:56] Wait what are they using the port 22 ssh daemon for? [03:25:10] they? [03:25:12] zuul? [03:25:12] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:25:15] Yeah. [03:25:22] na they dont use 22 [03:25:29] Well the ferm rules are useless then [03:25:29] if you see connections, that is me [03:25:41] port => '22', [03:26:02] !log scandium: migrating zuul-merger repos from lead to gerrit.wikimedia.org: find /srv/ssd/zuul/git -path '*/.git/config' -print -execdir sed -i -e 's/lead.wikimedia.org/gerrit.wikimedia.org/' config \; [03:26:36] ostriches: where do you look at them? [03:26:44] I'm in contint::firewall [03:26:46] In puppet [03:26:53] (I can't cd to /etc/ferm/, no permission) [03:27:23] Patch incoming. [03:27:52] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:28:06] grrrit-wm: kick [03:28:09] grrrit-wm: reload :) [03:28:48] puppet patch https://gerrit.wikimedia.org/r/#/c/300809/ will switch zuul to use gerrit.wikimedia.org instead of lead [03:28:58] already manually hacked and that works :) [03:30:00] i remember when we wrote that ferm rule [03:30:02] hashar: lgtm [03:30:04] and copied it from the old server [03:30:06] Let's do that [03:30:27] # ssh access for git on old gerrit server [03:30:46] i remember copying that line and changing it to "new server" [03:30:58] mutante: can you land that please: https://gerrit.wikimedia.org/r/#/c/300809/ ? :) [03:30:58] what happened [03:31:03] how? [03:32:34] (03CR) 10Dzahn: [C: 032] zuul: use Gerrit service hostname instead of server [puppet] - 10https://gerrit.wikimedia.org/r/300809 (owner: 10Hashar) [03:32:37] (03PS1) 10Chad: Contint/Gerrit: Fix up ferm rules and zuul service address [puppet] - 10https://gerrit.wikimedia.org/r/300810 [03:32:41] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [03:33:02] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [03:33:06] will run puppet on gallium/scandium [03:33:10] Cannot update refs/heads/production [03:33:12] no, wait [03:33:15] oh [03:33:17] how am i supposed to merge [03:33:38] that is when you press "submit" ? [03:33:46] yes [03:34:05] Did zuul not merge? [03:34:23] ? that change fixes the zuul config [03:34:46] Yeah it fixes zuul, or at least makes it a better choice. [03:35:13] it is stuck in ready to submit nah [03:35:32] Merge Conflict [03:35:32] com.google.gerrit.server.git.IntegrationException: Cannot update refs/heads/production [03:35:42] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:36:00] Well that's no bueno. [03:36:04] Lemme look [03:36:09] at com.google.gerrit.server.git.MergeOp.updateBranch(MergeOp.java:801) [03:36:09] at com.google.gerrit.server.git.MergeOp.integrateIntoHistory(MergeOp.java:431) [03:36:09] at com.google.gerrit.server.git.MergeOp.merge(MergeOp.java:380) [03:36:09] at com.google.gerrit.server.change.Submit.apply(Submit.java:201) [03:36:19] permission issue myabe? [03:36:36] Caused by: java.io.FileNotFoundException: /srv/gerrit/git/operations/puppet.git/logs/refs/heads/production (Permission denied) [03:36:40] ostriches: ^^ [03:37:09] Ok.... [03:37:10] Hmm [03:37:33] Ahhh [03:37:37] ohhh [03:37:40] Couple of repos aren't owned by gerrit2 [03:37:43] Fixing [03:37:48] be back in 2 min [03:38:02] ostriches: that dir is owned by root ! [03:38:12] Running it -R for all repos. [03:39:14] Ok, submit should work again [03:41:57] does gerrit 2.12 fix the issue with autoupdate of mediawiki/extensions/VisualEditor due to another repo with the same basename ( VisualEditor/VisualEditor ) [03:42:22] We haven't figured out yet. [03:42:29] That's the least of my worries right now though [03:42:41] yeah was just wondering :) [03:45:07] https://gerrit.wikimedia.org/r/#/c/300809/ [03:45:09] merged [03:45:12] on puppetmaster [03:45:16] worked now [03:45:16] neat [03:45:53] running puppet on gallium/scandium [03:46:12] for grrrit bot we would have to https://wikitech.wikimedia.org/wiki/Grrrit-wm#Deploying_or_restarting [03:46:17] no clue how that works really [03:46:48] ostriches: com.googlesource.gerrit.plugins.its.base.workflow.RuleBase : Neither global rule file /var/lib/gerrit2/review_site/etc/its/actions.config nor Its specific rule file/var/lib/gerrit2/review_site/etc/its/actions-its-phabricator.config exist. Please configure rules. [03:46:49] yes, i already did that a couple times [03:46:59] ostriches: seems to be a deprecation [03:47:03] Yeah [03:47:07] I've gotta fix that stuff up. [03:47:34] and another one about not being to send email due to missing variables required by the template [03:47:46] Yeah, I think that's related actually [03:47:48] It can't find the rules [03:49:20] when i click the rebase button [03:49:43] there is a popup now that asks for .. [03:49:54] well "or leave empty" [03:50:01] (03PS3) 10Dzahn: gerrit: up heap size limit from 20GB to 28GB [puppet] - 10https://gerrit.wikimedia.org/r/300446 (https://phabricator.wikimedia.org/T141064) [03:51:22] (03PS1) 10Chad: Gerrit: Rules file was renamed [puppet] - 10https://gerrit.wikimedia.org/r/300811 [03:51:37] (03CR) 10Dzahn: [C: 031 V: 031] "maybe now?" [puppet] - 10https://gerrit.wikimedia.org/r/300446 (https://phabricator.wikimedia.org/T141064) (owner: 10Dzahn) [03:52:54] mutante: Let's wait until we're done cleaning up. [03:56:38] ostriches: file rename go now? [03:56:57] Yeah [03:57:13] i cant really confirm that from just puppet itself [03:57:13] ok [03:57:26] (03CR) 10Dzahn: [C: 032] Gerrit: Rules file was renamed [puppet] - 10https://gerrit.wikimedia.org/r/300811 (owner: 10Chad) [03:58:13] are we waiting for a V 2 or not [03:58:42] It should be working again right? [03:59:00] it verified but not merged [03:59:06] Merge should work again? [03:59:07] submitting again [03:59:08] I fixed that [03:59:14] merged [03:59:41] on palladium [03:59:41] PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused [03:59:47] bah [04:00:03] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [04:00:07] mutante: Can you land the "turn off maintenance" one too? [04:00:12] So I can leave puppet running on lead. [04:00:26] https://gerrit.wikimedia.org/r/#/c/300808/ [04:00:39] https://gerrit.wikimedia.org/r/#/q/owner:chad+status:open [04:00:46] i see "merge conflict" there [04:01:26] Yeah it's kinda nice ;-) [04:01:29] (03CR) 10Dzahn: [C: 032] Gerrit: Maintenance is over yay! [puppet] - 10https://gerrit.wikimedia.org/r/300808 (owner: 10Chad) [04:01:36] (03PS2) 10Dzahn: Gerrit: Maintenance is over yay! [puppet] - 10https://gerrit.wikimedia.org/r/300808 (owner: 10Chad) [04:01:36] It's fast-forward [04:01:40] So that's why it says that [04:01:51] RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730 [04:01:59] (03PS2) 10KartikMistry: WIP: Configurable mode_patch for apertium [puppet] - 10https://gerrit.wikimedia.org/r/297350 (https://phabricator.wikimedia.org/T139330) [04:02:11] * AaronSchulz looks at the new GUI [04:02:14] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [04:02:23] (03CR) 10Dzahn: [V: 032] Gerrit: Maintenance is over yay! [puppet] - 10https://gerrit.wikimedia.org/r/300808 (owner: 10Chad) [04:03:19] ostriches: ok, done. firewall fix? [04:03:34] If we think we need it. It seems to be Just Working [04:03:39] Maybe wait on that until we can ask others? [04:04:09] yeah [04:04:16] ok [04:04:18] zuul is now on the service IP / gerrit.wikimedia.org [04:04:21] as should be most [04:04:48] then there is the issue of mail templates [04:06:11] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/300808 (owner: 10Chad) [04:06:22] hashar: Yeah I'm working on the mail template. [04:06:25] I hate those templates. [04:06:46] yea, so it probably never needed that port 22, ack [04:06:49] I must say I find them scary :D [04:06:49] specially Apache Velocity [04:07:55] how about google bot [04:08:53] PROBLEM - Apache HTTP on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:16] we have one more issue in icinga [04:09:21] ssh on lead refused [04:09:49] well that's that same thing [04:10:04] iirc the only reason we had that template was to work around funky truncation behavior in Gerrit subject lines. [04:10:51] RECOVERY - Apache HTTP on mw1267 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.939 second response time [04:12:39] (03CR) 10Dzahn: [C: 031] remove ytterbium from puppet, update gerrit comment [puppet] - 10https://gerrit.wikimedia.org/r/300806 (owner: 10Dzahn) [04:13:14] modules/admin/files/enforce-users-groups.sh: "gerrit2" \ # ytterbium.wikimedia.org [04:13:32] that was for the special gerrit2 user, just comment though [04:14:23] ostriches: so maybe drop that ChangesSubject.vm template ? [04:14:37] we can revisit it / fix the subject later on [04:15:51] also a search for a label vote score of zero yields nothing eg: is:open label:verified=0 [04:15:57] not so worrying though [04:17:17] Nah, I wanna keep it [04:17:20] I'm trying to figure it out [04:17:40] I'm afraid of breaking people's e-mail rules :) [04:18:41] better to break them for a bit but at least get emails :D [04:18:49] CI looks all fine and shinny [04:18:50] when do we change the topic? [04:19:44] is the template issue still part of downtime? [04:19:56] I would say so :D [04:20:09] ostriches: do you still need me around? [04:21:07] (03PS1) 10Dzahn: remove ytterbium from netboot,DHCP [puppet] - 10https://gerrit.wikimedia.org/r/300812 [04:22:06] what does it break exactly right now [04:23:20] mutante: No [04:23:27] It breaks sending e-mails. [04:23:35] I'm just gonna remove the file. [04:24:03] ah ok! [04:24:19] i was just wondering if we should stop people from using it [04:24:24] or encourage them to now [04:24:48] Patch incoming [04:24:59] Er, not remove, but fixed. [04:25:04] ok [04:25:10] (03PS1) 10Chad: Gerrit: Update ChangeSubject.vm mail template [puppet] - 10https://gerrit.wikimedia.org/r/300813 [04:26:50] ah , that [04:28:51] https://phabricator.wikimedia.org/rGGER43b10f86723de9c572dbb78fe26b97b56c45da18 [04:29:00] or so [04:29:05] (03CR) 10Dzahn: [C: 032] Gerrit: Update ChangeSubject.vm mail template [puppet] - 10https://gerrit.wikimedia.org/r/300813 (owner: 10Chad) [04:29:48] (03CR) 10Dzahn: "something like https://phabricator.wikimedia.org/rGGER43b10f86723de9c572dbb78fe26b97b56c45da18" [puppet] - 10https://gerrit.wikimedia.org/r/300813 (owner: 10Chad) [04:30:48] ostriches: please puppet [04:31:06] i kind of remember discussion about gerrit subject in gmail [04:32:12] Yeah [04:35:20] the mail subject is different :D [04:35:51] Different yeah, but I kept the [Gerrit] tag at the start. [04:35:52] it has an extra: "Change in repo[branch]:" [04:36:19] when previously the branch was not part of the subject and repo name at the end of mail :D [04:37:12] It should've been. The old template had the branch in there, along with the repo. [04:37:17] It just usually got snipped off the end. [04:38:50] I would drop the "Change in" and move the repo name back at the ned [04:38:51] end [04:39:03] I don't care about that part right now. [04:39:08] I just wanted e-mail working [04:39:17] And I hate velocity templates. [04:39:19] cause having a mailbox full of [Gerrit] Change in mediawiki/core[master]: .... [04:39:22] is not so nice :D [04:39:59] I think the repo name should be first tbh. [04:40:03] The "Change in" is redundant though [04:40:35] I typically have a filter per repo [04:40:51] so the repo name is redundant to me :] [04:41:26] I have no filters, all my gerrit mail goes to the same folder ;-) [04:41:47] heh, same here [04:42:15] checks queue in web ui :p [04:43:03] heading to bed *wave* [04:43:05] and kudos :) [04:47:12] (03PS1) 10Chad: Gerrit: Minor tweak to change subject... "Change In" is redundant [puppet] - 10https://gerrit.wikimedia.org/r/300814 [04:49:34] (03CR) 10Dzahn: [C: 032] Gerrit: Minor tweak to change subject... "Change In" is redundant [puppet] - 10https://gerrit.wikimedia.org/r/300814 (owner: 10Chad) [04:59:10] ok, that was one last service and bot restart for now [05:03:16] we are calling it a day. laters [05:05:05] (03PS3) 10Chad: Minor tweaks to 2.12.2 package [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (https://phabricator.wikimedia.org/T70271) [05:06:42] (03PS1) 10Chad: TESTING STUFF [puppet] - 10https://gerrit.wikimedia.org/r/300815 (https://phabricator.wikimedia.org/T70271) [05:14:49] Is gerritbot down? [05:15:11] ostriches: ^ ? [05:15:31] The irc bot or the Phab talking bit? [05:15:39] the phab one [05:16:00] https://gerrit.wikimedia.org/r/300816 didn't trigger a phab comment [05:16:09] Yeah I'm trying to figure that bit out [05:16:28] ah ok [05:16:53] I'm assuming it's because the conduit APIs changed or something but there's precious little logging here. [05:17:51] (03CR) 10Faidon Liambotis: ""Doesn't need to be done by gerrit2 anymore so let's do it as root" sounds like the opposite than it should be: the principle of the least" [puppet] - 10https://gerrit.wikimedia.org/r/300711 (owner: 10Chad) [05:21:00] (03CR) 10Chad: "That makes sense. I suppose it was the wrong approach but then we'll have to ensure that file is owned by gerrit2." [puppet] - 10https://gerrit.wikimedia.org/r/300711 (owner: 10Chad) [05:22:02] (03CR) 10Faidon Liambotis: [C: 032] Ciphersuite upgrades for one-off sites [puppet] - 10https://gerrit.wikimedia.org/r/300071 (https://phabricator.wikimedia.org/T118181) (owner: 10BBlack) [05:42:41] (03PS1) 10Dzahn: Revert "Gerrit: Run list_reviewer_counts cron as root" [puppet] - 10https://gerrit.wikimedia.org/r/300818 [05:47:13] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: put pfw1- ge-2/0/11 in the 'fundraising' vlan for new host frqueue1001 - https://phabricator.wikimedia.org/T140991#2483556 (10faidon) I configured pfw-eqiad port ge-2/0/11 to be in the fundraising VLAN. You might want to open a new #ops-eqiad tas... [05:48:00] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2491150 (10HJiang-WMF) Thanks a lot Dan! Will do the config and try the connection as you suggested. [05:50:22] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 108, down: 1, dormant: 0, excluded: 1, unused: 0BRge-2/0/11: down - frqueue1001BR [05:59:37] (03CR) 10Dzahn: "yea, we should use gerrit2 again. i'll follow-up" [puppet] - 10https://gerrit.wikimedia.org/r/300818 (owner: 10Dzahn) [06:01:51] (03CR) 10Dzahn: "Faidon, thoughts about googlebot?" [puppet] - 10https://gerrit.wikimedia.org/r/300692 (owner: 10Chad) [06:02:25] (03CR) 10Faidon Liambotis: [C: 032] Gerrit: Remove googlebot from banned IPs. They ain't so bad [puppet] - 10https://gerrit.wikimedia.org/r/300692 (owner: 10Chad) [06:02:42] (03PS2) 10Faidon Liambotis: Gerrit: Remove googlebot from banned IPs. They ain't so bad [puppet] - 10https://gerrit.wikimedia.org/r/300692 (owner: 10Chad) [06:06:29] (03CR) 10Dzahn: "we are still wondering about heap size. it's set to 20G, we have 32G RAM on lead. this would set it to 28. we wanted to wait until after u" [puppet] - 10https://gerrit.wikimedia.org/r/300446 (https://phabricator.wikimedia.org/T141064) (owner: 10Dzahn) [06:22:59] (03CR) 10Dzahn: "related to this is the cron spam "*** SECURITY information for lead.wikimedia.org ***"" [puppet] - 10https://gerrit.wikimedia.org/r/300711 (owner: 10Chad) [06:24:43] ostriches https://gerrit.googlesource.com/plugins/its-phabricator/+/master/src/main/resources/Documentation/config-connectivity.md [06:30:42] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:43] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:02] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:12] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:22] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:32] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:52] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:41] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: replace gerrit server (ytterbium) with jessie server (lead) - https://phabricator.wikimedia.org/T125018#2491177 (10Dzahn) a:03Dzahn let me close this when all ytterbium remnants are actually gone (decom etc) [06:32:41] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:03] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:23] 06Operations, 10Gerrit, 10Mail, 13Patch-For-Review, 07Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2491183 (10demon) [06:34:01] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:32] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I am ok with the patch, but please fix the tests accordingly" [software/conftool] - 10https://gerrit.wikimedia.org/r/294371 (owner: 10BBlack) [06:40:02] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:46:07] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2491222 (10MoritzMuehlenhoff) I'll drop those people with an expired PGP key later on, so that we can add her to pwstore. [06:51:11] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [06:55:02] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [06:55:33] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:02] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:43] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:53] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:11] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:21] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:23] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:32] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [07:05:32] PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: puppet fail [07:06:12] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:09:43] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:44] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:54] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:54] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:55] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:56] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:56] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:56] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:56] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:57] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:57] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:57] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:57] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:58] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:58] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:09:59] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:00] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:01] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:04] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:04] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:05] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:06] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:08] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:09] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:11] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:14] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:19] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:19] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:19] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:20] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:21] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:24] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:26] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:27] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:28] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:29] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:31] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:32] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:33] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:34] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:35] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:36] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:38] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:40] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:41] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:42] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:43] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:44] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:46] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:48] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:49] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:50] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:53] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:55] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:56] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:57] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:10:59] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:00] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:01] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:03] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:04] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:06] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:07] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:09] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:11] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:12] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:14] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:15] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:16] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:18] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:20] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:22] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:24] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:25] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:27] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:28] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:30] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:32] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:33] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:36] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:37] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:39] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:40] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:41] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:43] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:44] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:46] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:47] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:48] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:50] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:51] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:53] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:55] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:57] <_joe_> ok [07:11:58] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:11:59] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:12:00] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:12:01] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:12:02] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:12:03] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:12:05] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:12:07] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:12:11] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:12:12] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:12:13] !flames BASURAS...... MATARE A A SIMOV...... HAHAHAHAHA [07:12:19] <_joe_> heh [07:12:26] <_joe_> this was enough... [07:13:30] <_joe_> third time in 3 weeks [07:13:35] <_joe_> someone opened the cages [07:31:42] RECOVERY - puppet last run on elastic2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:36:24] 06Operations, 07Puppet, 13Patch-For-Review, 05Puppet-infrastructure-modernization: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#2491254 (10Joe) [07:39:57] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.7 - https://phabricator.wikimedia.org/T141242#2491255 (10Joe) [07:43:04] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.7 - https://phabricator.wikimedia.org/T141242#2491269 (10Joe) [07:50:16] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2491273 (10MoritzMuehlenhoff) @madhuvishy : Please upload your PGP key to the public keyserver network by running gpg --send-key FINGERPRINT_OF_YOUR_KE... [07:53:45] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.7 - https://phabricator.wikimedia.org/T141242#2491274 (10Joe) [07:53:47] (03CR) 10Gilles: lvs: add thumbor to lvs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300244 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [07:56:28] (03CR) 10Gilles: [C: 031] Labs: remove wgThumbnailMinimumBucketDistance - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300474 (owner: 10MaxSem) [07:56:40] (03CR) 10Gilles: [C: 031] Labs: remove wgThumbnailBuckets - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300475 (owner: 10MaxSem) [07:57:11] _joe_: I liked the response "ok", very professional [07:57:15] :D [07:59:42] low-key, understated, efficient [08:08:11] ostriches Glaisher hi, we now need to generate a conduit token for the gerritbot. [08:15:33] hello [08:15:42] Hi [08:15:55] o/ [08:16:27] just woke up. But in theory we are still on Gerrit 2.12.x :] [08:16:38] Yep [08:16:46] but phab bot broken, ie gerritbot [08:17:06] bah [08:17:11] I belive [08:17:14] we need to set [08:17:17] a conduit token now [08:17:34] See task https://phabricator.wikimedia.org/T141241 for where i wrote what the problem might be [08:17:55] Also the task you filled earley today is fixed in gerrit 2.12.3. [08:20:00] paladox: about Gerrit email subject ? [08:20:12] No, the one with labels [08:20:25] AHHHH [08:20:42] Yep [08:20:47] the good thing is you paladox is that I no more have to hunt in upstream changelogs :] [08:21:07] Oh :) [08:21:31] Ive been running gerrit 2.12.3 here http://gerrit-test.wmflabs.org [08:21:33] :) [08:22:38] hashar you can give it a test at ^^ to see if it fixes your problem. [08:22:40] :) [08:23:23] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.7 - https://phabricator.wikimedia.org/T141242#2491319 (10Joe) The main recurring problem seems to be a change in behaviour in erb between ruby 1.8 and ruby 2.1, namely: Ruby 1.8 ```... [08:23:41] hashar it seems to work http://gerrit-test.wmflabs.org/gerrit/#/q/is:open+label:verified%253D0 [08:23:54] Compared to https://gerrit.wikimedia.org/r/#/q/is:open+label:verified%253D0 [08:25:33] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.7 - https://phabricator.wikimedia.org/T141242#2491335 (10Joe) [08:35:30] paladox: the commit https://gerrit-review.googlesource.com/#/c/76685/ is straightforward :) [08:35:43] Yep :) [08:35:44] that is the usual boolean logic error ;D [08:35:57] Oh [08:36:10] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.7 - https://phabricator.wikimedia.org/T141242#2491347 (10Joe) [08:38:59] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2491348 (10elukey) [[https://github.com/memcached/memcached/wiki/ReleaseNotes1429 | 1.4.29]] is out and includes a big change, namely the maximum item size is n... [08:50:23] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 1 failures [08:50:47] (03PS1) 10Filippo Giunchedi: puppetization for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) [08:56:32] (03CR) 10Filippo Giunchedi: lvs: add thumbor to lvs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/300244 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [08:56:46] (03PS1) 10Hashar: package_builder: do not override BUILDRESULT [puppet] - 10https://gerrit.wikimedia.org/r/300830 (https://phabricator.wikimedia.org/T141246) [08:58:36] (03CR) 10Paladox: [C: 031] package_builder: do not override BUILDRESULT [puppet] - 10https://gerrit.wikimedia.org/r/300830 (https://phabricator.wikimedia.org/T141246) (owner: 10Hashar) [09:00:44] (03CR) 10Hashar: [C: 04-1] "Gotta polish it up:" [puppet] - 10https://gerrit.wikimedia.org/r/300830 (https://phabricator.wikimedia.org/T141246) (owner: 10Hashar) [09:05:15] (03PS2) 10Hashar: package_builder: do not override BUILDRESULT [puppet] - 10https://gerrit.wikimedia.org/r/300830 (https://phabricator.wikimedia.org/T141246) [09:05:26] (03CR) 10QChris: Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [09:06:22] qchris: *wave* :] [09:06:30] Heya hashar! [09:06:57] Fresh gerrit \o/ [09:07:20] yeah finally after all those years! [09:07:24] Chad aced it :] [09:07:37] Yay! [09:07:53] we have reason to think we are not going to rollback to 2.8 ;D [09:08:02] (03PS1) 10Filippo Giunchedi: rsyslog: temporarily lower centralserver retention [puppet] - 10https://gerrit.wikimedia.org/r/300833 (https://phabricator.wikimedia.org/T139612) [09:08:07] Hahaha. Glad to hear that. [09:08:14] It dosent seem anything has broken except from the gerritbot for phab [09:08:29] but i belive that is because it needs config updating. [09:08:30] That part is probably easily fixable ;-) [09:08:35] surely [09:08:36] Yep [09:08:41] It needs a conduit token now [09:08:50] according to the docs on the source code for it [09:08:56] and probably a bunch of people workflow / habits will have to be adjusted which isn't surprising after 4 releases [09:09:07] Yep [09:09:10] We're using a token already, aren't we? [09:09:24] I doint think so [09:09:39] It is set like [09:09:40] [its-phabricator] [09:09:40] url = https://phabricator.wikimedia.org/ [09:09:40] username = gerritbot [09:09:54] oh [09:10:07] The token would be in the secret.config [09:10:10] passwords are in a different file [09:11:00] well [09:11:05] I meant the same as qchris :] [09:12:07] It needs to be set like [09:12:13] Argh. I meant /var/lib/gerrit2/review_site/etc/secure.config [09:12:14] [its-phabricator] [09:12:15] url = https://phabricator.wikimedia.org/ [09:12:15] username = gerritbot [09:12:15] certificate=CERTIFICATE_FOR_ABOVE_USERNAME [09:12:24] Per docs https://gerrit.googlesource.com/plugins/its-phabricator/+/master/src/main/resources/Documentation/config-connectivity.md [09:12:26] hashar ^^ [09:12:32] Check the above file, I am sure it's there. [09:12:39] This has always been a requirement. [09:13:04] Without the token, conduit never allowed comments etc. [09:13:24] Oh yes [09:13:28] sorry it is already set [09:13:29] [its-phabricator] [09:13:29] certificate = <%= @phab_cert %> [09:13:33] in the secret file [09:13:36] Cool. [09:14:33] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:14:39] But arent we on a new host [09:14:48] Which means the certificate might not be saved in ~/.arcrc [09:14:51] qchris ^^ [09:15:03] some Gerrit API must have changed which breaks the bot [09:15:09] its-phabricator does not use ~/.arcrc. [09:15:09] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.7 - https://phabricator.wikimedia.org/T141242#2491403 (10Joe) In the case of fluorine, the class just needs to be set in autoload layout (it's in `private:modules/contacts/manifests/... [09:15:14] Chad will dig in it first thing when he is back [09:15:28] Yup. I'll not try to step on his toes. [09:15:50] If he is running into issues, I can free up some cycles on Wednesday to help fixing. [09:16:12] I know that Gerrit folks recently switched to lazy getters. [09:16:22] Not sure if 2.12.2 is already using those. [09:16:29] It may be because the plugin has not been updated in 7 months [09:17:01] paladox: I am sure Luca will gladly merge patches ;-) [09:17:12] (03CR) 10Muehlenhoff: "Can we use the mysql-connector-java provided in Debian? The package name is libmysql-java (also, the version in Debian fixes a security is" (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (https://phabricator.wikimedia.org/T70271) (owner: 10Chad) [09:17:27] Oh [09:18:24] !log swift eqiad-prod: ms-be102[3456] weight 1500 [09:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:20:56] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2491420 (10fgiunchedi) indeed, looks like serpens started leaking memory even with an updated slapd, and as soon as it took over from seaborgium. [09:32:30] (03CR) 10Gilles: puppetization for thumbor (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [09:34:05] 06Operations, 10Icinga: icinga config issue with duplicate californium - https://phabricator.wikimedia.org/T141232#2491461 (10elukey) p:05Triage>03Normal [09:40:34] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [09:41:33] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [09:42:32] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [09:43:46] (03PS4) 10MarcoAurelio: Closing wikimania2015wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298772 (https://phabricator.wikimedia.org/T139032) [09:44:35] 06Operations, 10Traffic: Install XKey vmod - https://phabricator.wikimedia.org/T122881#2491474 (10ema) Wikitech page added: https://wikitech.wikimedia.org/wiki/XKey [09:47:26] (03PS1) 10Giuseppe Lavagetto: snapshots: do not double-enclose wiki name in brackets [puppet] - 10https://gerrit.wikimedia.org/r/300837 (https://phabricator.wikimedia.org/T141242) [09:47:28] (03PS1) 10Giuseppe Lavagetto: role::jobqueue_redis: sort redis instances [puppet] - 10https://gerrit.wikimedia.org/r/300838 (https://phabricator.wikimedia.org/T141242) [09:47:30] (03PS1) 10Giuseppe Lavagetto: role::statistics: fix lookup of statistics classes [puppet] - 10https://gerrit.wikimedia.org/r/300839 (https://phabricator.wikimedia.org/T141242) [09:47:41] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [09:47:43] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [09:47:45] (03CR) 10MarcoAurelio: Disabling local uploads on ms.wikipedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [09:51:51] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [09:53:52] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [09:55:51] 06Operations, 10Icinga: icinga config issue with duplicate californium - https://phabricator.wikimedia.org/T141232#2491043 (10elukey) Looking into the puppet logs the 10.64.20.18 reference to californium was added at: ``` --- /etc/icinga/puppet_hosts.cfg 2016-07-22 15:32:17.000000000 +0000 ``` It was... [09:55:52] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [09:57:08] 06Operations, 10Icinga: icinga config issue with duplicate californium - https://phabricator.wikimedia.org/T141232#2491043 (10Joe) @elukey that's most probably my fault; can you list those hosts? [09:58:21] (03PS2) 10ArielGlenn: fix up base wiki handling for onallwikis [dumps] - 10https://gerrit.wikimedia.org/r/300437 [09:58:30] (03CR) 10Glaisher: Disabling local uploads on ms.wikipedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [10:01:05] (03CR) 10Glaisher: [C: 04-1] Disabling local uploads on ms.wikipedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [10:02:55] (03CR) 10ArielGlenn: [C: 032] clean up verbose mode print of commands to run [dumps] - 10https://gerrit.wikimedia.org/r/300294 (owner: 10ArielGlenn) [10:04:51] !log installing Django security updates [10:04:52] 06Operations, 10Icinga: icinga config issue with duplicate californium - https://phabricator.wikimedia.org/T141232#2491516 (10elukey) @Joe sure! analytics1021 analytics1022 analytics1023 analytics1024 analytics1025 antimony argon berkelium caesium calcium californium capella [10:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:05:39] (03CR) 10ArielGlenn: [C: 032] fix up base wiki handling for onallwikis [dumps] - 10https://gerrit.wikimedia.org/r/300437 (owner: 10ArielGlenn) [10:06:01] (03CR) 10jenkins-bot: [V: 04-1] fix up base wiki handling for onallwikis [dumps] - 10https://gerrit.wikimedia.org/r/300437 (owner: 10ArielGlenn) [10:06:07] (03PS3) 10Elukey: admin: add jsamra to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/300678 (https://phabricator.wikimedia.org/T140445) (owner: 10Dzahn) [10:06:22] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [10:08:12] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [10:10:30] <_joe_> !log remove spurious puppet facts [10:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:10:37] <_joe_> !log remove spurious puppet facts [10:10:41] <_joe_> geez [10:10:43] <_joe_> sigh [10:16:01] 06Operations, 10Monitoring, 10media-storage: icinga hp raid check timeout on busy ms-be machines - https://phabricator.wikimedia.org/T141252#2491529 (10fgiunchedi) [10:22:24] PROBLEM - Host analytics1010 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:24] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:24] PROBLEM - Host cp1043 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:24] PROBLEM - Host analytics1004 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:24] PROBLEM - Host analytics1019 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:24] PROBLEM - Host cp1057 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:24] PROBLEM - Host cp1044 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:25] PROBLEM - Host cp1056 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:36] <_joe_> wtf? [10:22:41] <_joe_> this is my fault for sure [10:22:55] <_joe_> other hosts that we needed to clean I guess [10:23:03] (03CR) 10ArielGlenn: "recheck" [dumps] - 10https://gerrit.wikimedia.org/r/300437 (owner: 10ArielGlenn) [10:23:13] <_joe_> yes [10:23:55] PROBLEM - configured eth on analytics1017 is CRITICAL: Connection refused by host [10:24:16] PROBLEM - dhclient process on analytics1017 is CRITICAL: Connection refused by host [10:24:44] PROBLEM - puppet last run on analytics1017 is CRITICAL: Connection refused by host [10:24:53] <_joe_> sorry for the spam [10:25:04] PROBLEM - salt-minion processes on analytics1017 is CRITICAL: Connection refused by host [10:25:44] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [10:26:54] PROBLEM - DPKG on analytics1017 is CRITICAL: Connection refused by host [10:27:06] PROBLEM - Disk space on analytics1017 is CRITICAL: Connection refused by host [10:29:57] <_joe_> should all go away in minutes [10:30:03] Icinga configuration is correct \o/ [10:44:48] (03PS1) 10ArielGlenn: dumps misc crons: convert media list generation to use onallwikis [puppet] - 10https://gerrit.wikimedia.org/r/300847 (https://phabricator.wikimedia.org/T133694) [10:47:59] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [10:49:20] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [10:49:50] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [10:51:39] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [10:52:05] correct but still a christmas tree :( [10:53:29] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [10:53:40] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [11:00:40] yeah I'm silencing those [11:00:47] what's with ganeti1004? [11:01:52] ah I think expired downtime, got its disks swapped with ssd last week IIRC [11:04:10] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 108, down: 0, dormant: 0, excluded: 2, unused: 0 [11:06:44] in other news, serpens still leaks memory even with sladp 2.4.41 :( [11:07:02] moritzm: ^ [11:12:10] (03CR) 10Filippo Giunchedi: puppetization for thumbor (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [11:12:18] (03PS2) 10Filippo Giunchedi: puppetization for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) [11:13:21] paravoid: sorry didn't see the message, I saw only some alerts in unhandled services but most of them had a downtime msg attached [11:14:07] going to double check again [11:16:35] the only weird ones that I can see are ganeti1004 (but you guys mentioned that it is unrelated) [11:18:14] 06Operations, 10Icinga: icinga config issue with duplicate californium - https://phabricator.wikimedia.org/T141232#2491607 (10elukey) 05Open>03Resolved a:03elukey Joe cleaned up all the old puppet facts (puppet node clean $fqdn) and the issue is now solved. He requested a catalog compilation last week an... [11:21:59] (03PS1) 10Faidon Liambotis: Allocate new cr2-eqiad <-> cr2-esams subnets [dns] - 10https://gerrit.wikimedia.org/r/300851 [11:22:12] gerrit is much snappier [11:22:14] now if only jenkins was.. [11:23:43] godog: saw that while catching up with mail backlog, there are no further syncrepl related fixed up to 2.4.44, we'll have to take this upstream [11:24:05] will follow up on the bug [11:25:38] (03CR) 10Faidon Liambotis: [C: 032] Allocate new cr2-eqiad <-> cr2-esams subnets [dns] - 10https://gerrit.wikimedia.org/r/300851 (owner: 10Faidon Liambotis) [11:27:36] (03PS2) 10ArielGlenn: cleanup media list generation cron job, remove from snapshot1003 [puppet] - 10https://gerrit.wikimedia.org/r/300847 (https://phabricator.wikimedia.org/T133694) [11:28:24] 06Operations, 10netops: Turn up new eqiad-esams wave (Level3) - https://phabricator.wikimedia.org/T136717#2491626 (10faidon) In the meantime, I allocated ports and IPv4/IPv6 subnets for the link and setup DNS with 0a1872b. BFD & IGP is still pending and will follow only after we have confirmed that the link w... [11:29:41] moritzm: thanks! also it seems that even if it fails clients still experience outages as chase was mentioning on the task [11:33:34] all clients should be using both slapds (and that worked in the past when one of the VMs was down with ganeti problems), but it might be a case of the server process still being able to accept the connecion, but not actually handle it... [11:40:53] 06Operations, 10netops: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2491636 (10elukey) Update after a long time (my bad): we have been experimenting with Cassandra bulk loading without really succeeding, even using cassandra 2.2.6. W... [11:43:16] (03CR) 10ArielGlenn: [C: 032] cleanup media list generation cron job, remove from snapshot1003 [puppet] - 10https://gerrit.wikimedia.org/r/300847 (https://phabricator.wikimedia.org/T133694) (owner: 10ArielGlenn) [11:44:44] 06Operations, 06Discovery, 06Maps, 10Monitoring: Map caches metrics look broken - https://phabricator.wikimedia.org/T141186#2491640 (10elukey) p:05Triage>03Normal [11:46:19] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2491644 (10elukey) p:05Triage>03High [11:46:47] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch: Elasticsearch SSL on relforge hosts monitoring alerts - https://phabricator.wikimedia.org/T141234#2491647 (10elukey) p:05Triage>03Normal [11:52:36] (03Abandoned) 10ArielGlenn: [WIP] dumps: convert generation of media titles per project to onallwikis [puppet] - 10https://gerrit.wikimedia.org/r/276907 (owner: 10ArielGlenn) [11:58:23] Hallo [11:58:45] https://phabricator.wikimedia.org/T141255 is probably trivial, but I don't know where to find that sql shellscript. [11:59:02] It says "This file is managed by Puppet (modules/scap/files/sql)" at the top, but in which repo is it actually? [11:59:33] operations-puppet [12:00:10] aharoni: https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/files/sql [12:00:19] But you need it in gerrit obvs :) [12:03:34] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [12:05:14] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [12:06:10] (03PS1) 10ArielGlenn: enable generation of media lists per project on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300857 (https://phabricator.wikimedia.org/T133694) [12:10:14] (03CR) 10Hashar: [C: 031] "My issue appears when using jenkins-debian-glue which set BUILDRESULT and later runs lintian/piuparts against all *.deb found." [puppet] - 10https://gerrit.wikimedia.org/r/300830 (https://phabricator.wikimedia.org/T141246) (owner: 10Hashar) [12:13:39] qchris that's why, the its-phabricator is false see https://gerrit.wikimedia.org/r/#/admin/projects/integration/config please, section its-phabricator Plugin should be true not false. [12:13:45] hashar ^^ ostriches [12:15:12] 06Operations, 06Discovery, 06Labs, 10Labs-Infrastructure, and 2 others: Update coastline data in OSM postgres db (osmdb.eqiad.wmnet) - https://phabricator.wikimedia.org/T140296#2491701 (10akosiaris) 05Open>03Resolved a:03akosiaris So, I 've opted for using the already present puppet code to do the up... [12:19:16] 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2491705 (10faidon) I've deliberated this a little bit and honestly my (slight) preference would be to not (ab)use l... [12:24:36] (03CR) 10ArielGlenn: [C: 032] enable generation of media lists per project on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300857 (https://phabricator.wikimedia.org/T133694) (owner: 10ArielGlenn) [12:28:39] 06Operations, 05codfw-rollout: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#2491743 (10akosiaris) Removing `codfw-rollout-Jan-Mar-2016` since this is going to take place later on [12:29:23] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#2491748 (10ArielGlenn) [12:30:56] 06Operations, 06Project-Admins, 05codfw-rollout: Archive #codfw-rollout-Jan-Mar-2016 - https://phabricator.wikimedia.org/T139711#2491753 (10Danny_B) 05Open>03Resolved Remaining task untagged. Project archived. [12:32:45] (03PS1) 10Elukey: Create the group eventbus-admins [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) [12:33:46] 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2491757 (10elukey) Created https://gerrit.wikimedia.org/r/#/c/300860 [12:33:48] (03CR) 10jenkins-bot: [V: 04-1] Create the group eventbus-admins [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [12:34:44] (03CR) 10Elukey: "The sudoers syntax should be good but I am not sure if the perms that I added are ok." [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [12:36:32] (03CR) 10Filippo Giunchedi: [C: 031] package_builder: do not override BUILDRESULT [puppet] - 10https://gerrit.wikimedia.org/r/300830 (https://phabricator.wikimedia.org/T141246) (owner: 10Hashar) [12:38:25] (03PS2) 10Elukey: Create the group eventbus-admins [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) [12:45:04] (03PS1) 10ArielGlenn: move cron job for pagetitle list generation to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300861 (https://phabricator.wikimedia.org/T133694) [12:49:24] 06Operations, 10Traffic: Push gdnsd metrics to graphite and create a grafana dashboard - https://phabricator.wikimedia.org/T141258#2491791 (10elukey) [12:50:49] (03CR) 10ArielGlenn: [C: 032] move cron job for pagetitle list generation to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300861 (https://phabricator.wikimedia.org/T133694) (owner: 10ArielGlenn) [12:51:01] (03PS1) 10Amire80: Split sql to sql and sqlhost [puppet] - 10https://gerrit.wikimedia.org/r/300862 (https://phabricator.wikimedia.org/T141255) [12:51:19] Reedy: https://gerrit.wikimedia.org/r/300862 [12:51:34] this is rather naïve, because I know relatively little bash, and even less puppet [12:51:54] aharoni: you'll need to tell puppet where to put it on disk too [12:52:10] I don't know, for example, whether I need to write any puppet code to say where the new sqlhost file is supposed to go [12:52:21] aha, thought so [12:52:27] no idea where to do it, though [12:52:38] https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/manifests/scripts.pp#L82-L87 [12:52:49] copy/paste/amend [12:53:10] aha, looks right, thanks [12:53:41] !log installing squid security updates [12:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:55:06] 06Operations, 10Wikimedia-Etherpad: Unable to access Etherpad - https://etherpad.wikimedia.org/p/Fundraising_Staff_Feedback - https://phabricator.wikimedia.org/T140886#2491810 (10akosiaris) https://etherpad.wikimedia.org/p/Fundraising_Staff_Feedback/timeslider should be good enough to get the data out of the p... [12:55:16] (03CR) 10Gilles: [C: 031] puppetization for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [12:55:58] (03PS2) 10Amire80: Split sql to sql and sqlhost [puppet] - 10https://gerrit.wikimedia.org/r/300862 (https://phabricator.wikimedia.org/T141255) [12:56:31] Reedy: done. who should review it? [12:56:32] (03PS1) 10Filippo Giunchedi: nagios_common: add check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/300863 [12:56:42] 06Operations, 10Traffic: Push gdnsd metrics to graphite and create a grafana dashboard - https://phabricator.wikimedia.org/T141258#2491811 (10elukey) p:05Triage>03Low [13:04:35] (03PS4) 10Elukey: admin: add jsamra to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/300678 (https://phabricator.wikimedia.org/T140445) (owner: 10Dzahn) [13:06:53] (03CR) 10Elukey: [C: 032] admin: add jsamra to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/300678 (https://phabricator.wikimedia.org/T140445) (owner: 10Dzahn) [13:07:26] (03PS1) 10ArielGlenn: move dump of cirrus search data to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300865 [13:08:52] 06Operations, 10Parsoid, 06Services, 15User-mobrovac: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2491829 (10akosiaris) 1st week of August is fine for me as well. [13:10:14] 06Operations, 06Discovery, 06Maps, 10Monitoring: Map caches metrics look broken - https://phabricator.wikimedia.org/T141186#2491834 (10BBlack) Yeah, this is a pattern we see often with ganglia :/ Sometimes it gets fixed by restarting the ganglia-monitor service, sometimes it requires tricky work on the br... [13:10:19] akosiaris: if you are around, I could use the package_builder pbuilderrc to let us override BUILDRESULT so the resulting .deb ends up at a different place than /var/cache/pbuilder :) https://gerrit.wikimedia.org/r/300830 [13:10:35] akosiaris: got it applied on the Jenkins slaves and that does the job [13:10:37] 06Operations, 06Performance-Team, 10Thumbor: Package Thumbor for Debian - https://phabricator.wikimedia.org/T134485#2491835 (10Gilles) Pushed to our internal repo by @fgiunchedi [13:10:57] 06Operations, 06Performance-Team, 10Thumbor: Package Thumbor for Debian - https://phabricator.wikimedia.org/T134485#2491836 (10Gilles) 05Open>03Resolved [13:15:52] (03PS2) 10ArielGlenn: move dump of cirrus search data to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300865 [13:17:21] (03CR) 10ArielGlenn: [C: 032] move dump of cirrus search data to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300865 (owner: 10ArielGlenn) [13:24:20] 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2491863 (10MoritzMuehlenhoff) > - stdout is apparently ignored & does not make it into the systemd journal. This is unrelated to firejail and needs... [13:30:17] (03CR) 10Ottomata: "I find journalctl a little cumbersome to use. Logs are also at /var/log/eventlogging/*. Can we allow eventbus-admins to read these too?" [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [13:33:19] (03CR) 10Elukey: "This is only a baseline and I am not super expert in the sudoers syntax, I am planning to ask to Daniel some guidance :)" [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [13:33:33] (03PS3) 10Filippo Giunchedi: puppetization for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) [13:34:39] (03CR) 10jenkins-bot: [V: 04-1] puppetization for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [13:44:09] (03CR) 10MarcoAurelio: Disabling local uploads on ms.wikipedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [13:46:05] (03PS2) 10MarcoAurelio: Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) [13:50:53] (03CR) 10MarcoAurelio: "> (1 comment)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [13:55:02] (03PS1) 10Ottomata: Upgrade Kafka main-codfw to 0.9 [puppet] - 10https://gerrit.wikimedia.org/r/300867 (https://phabricator.wikimedia.org/T138265) [13:55:52] paravoid: I'm getting 404's for every image on Commons, but only from a particular ISP in London [13:56:11] they're WMF 404 pages, over HTTPS [13:56:22] if I switched to my mobile's connection they work fine [13:56:49] (03CR) 10Nemo bis: [C: 031] "This change is ok, we can always have a separate patch for complete disabling if wanted." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [13:57:27] ISP is -----.croy.cable.virginm.net [13:58:19] 06Operations, 10Gerrit, 06Labs: Gerrit username change request - https://phabricator.wikimedia.org/T141261#2491923 (10MarcoAurelio) [13:59:57] 06Operations, 10Gerrit, 06Labs: Gerrit username change request - https://phabricator.wikimedia.org/T141261#2491923 (10Paladox) You carn't change your username in gerrit I doint think. It may be possible as it is ldap but not sure since in gerrit doing it through there test signup warns you that when you set... [14:00:10] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2491940 (10Joe) [14:01:09] (03PS1) 10Giuseppe Lavagetto: puppetmaster: use puppet 3.8 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/300870 (https://bugzilla.wikimedia.org/141242) [14:01:51] <_joe_> akosiaris: ^^ [14:02:38] (03PS2) 10Giuseppe Lavagetto: puppetmaster: use puppet 3.8 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/300870 (https://bugzilla.wikimedia.org/141242) [14:03:17] bugzilla? [14:03:51] elukey if you forget to put T in it, it will turn it into buzilla [14:04:10] ahhhh thanks paladox :) [14:04:17] Your welcome :) [14:05:02] (03PS1) 10Elukey: Create user bcohn for Brent Cohn (Brentjoseph on phab) [puppet] - 10https://gerrit.wikimedia.org/r/300872 (https://phabricator.wikimedia.org/T140449) [14:07:06] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: use puppet 3.8 on jessie [puppet] - 10https://gerrit.wikimedia.org/r/300870 (https://bugzilla.wikimedia.org/141242) (owner: 10Giuseppe Lavagetto) [14:10:58] 06Operations, 10Wikimedia-Etherpad: Unable to access Etherpad - https://etherpad.wikimedia.org/p/Fundraising_Staff_Feedback - https://phabricator.wikimedia.org/T140886#2491985 (10akosiaris) >>! In T140886#2480124, @Gehel wrote: > I tried running `checkPad.js` and it runs without error (or without any output ac... [14:11:01] 06Operations, 06Labs: Create an NFS mount manager - https://phabricator.wikimedia.org/T140483#2491986 (10chasemp) a:03chasemp [14:11:19] 06Operations, 10Traffic: No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org - https://phabricator.wikimedia.org/T81605#2491990 (10BBlack) What's out evaluation plan here? Do we want to stall on proper IPv6 for in our VCL geoip lookup service first and do comparisons on that data? Or do some kin... [14:12:07] 06Operations, 10Gerrit, 06Labs: Gerrit username change request - https://phabricator.wikimedia.org/T141261#2491998 (10MarcoAurelio) >>! In T141261#2491938, @Paladox wrote: > You carn't change your username in gerrit I doint think. > > It may be possible as it is ldap but not sure since in gerrit doing it th... [14:13:24] 06Operations, 10vm-requests: EQIAD: (1) VM request for url-downloader - https://phabricator.wikimedia.org/T134496#2492005 (10akosiaris) [14:13:26] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2492003 (10akosiaris) 05Open>03Resolved Enough days have passed, it seems there has been no issue. Resolving [14:14:44] 06Operations, 10Gerrit, 06Labs: Gerrit username change request - https://phabricator.wikimedia.org/T141261#2491923 (10matmarex) It can be done, but it's a very manual and fairly error-prone process (several different places need to be changed at the same time) so no one likes doing it often. I had mine chang... [14:15:31] 06Operations, 10Gerrit, 06Labs: Gerrit username change request - https://phabricator.wikimedia.org/T141261#2492019 (10matmarex) Oh, hm, or are you talking about the shell name? I was talking about the login and display name. No idea about shell. [14:16:01] (03PS3) 10BBlack: text VCL: refactor backend selection [puppet] - 10https://gerrit.wikimedia.org/r/300560 (https://phabricator.wikimedia.org/T110717) [14:16:02] 06Operations, 06Labs, 13Patch-For-Review: access_new_install role vs. Labs vs. the future - https://phabricator.wikimedia.org/T139971#2492025 (10chasemp) p:05Triage>03Normal [14:17:56] (03CR) 10BBlack: [C: 032] text VCL: refactor backend selection [puppet] - 10https://gerrit.wikimedia.org/r/300560 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [14:18:38] (03PS3) 10BBlack: Text VCL: split X-Wikimedia-Debug from the rest [puppet] - 10https://gerrit.wikimedia.org/r/300561 (https://phabricator.wikimedia.org/T110717) [14:18:56] (03PS1) 10Giuseppe Lavagetto: puppetmaster: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/300873 (https://phabricator.wikimedia.org/T141242) [14:19:54] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/300873 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [14:20:28] <_joe_> bblack: ok to submit your patch? [14:20:48] I was going to do both together, but it's not really critical [14:20:58] how did you force V+2->submit? [14:21:14] <_joe_> I just did V+2 C+2 [14:21:16] <_joe_> as usual [14:21:20] <_joe_> and then pushed submit [14:21:31] the UI doesn't even have V+2 in my view [14:21:44] <_joe_> is the patch rebased? [14:21:54] for that matter, it doesn't have C+2 or submit either, until jenkins reviews [14:22:15] <_joe_> from the reply button up top? [14:22:22] (which it still hasn't done since my last rebase. also, it hasn't yet shown the rebase is out of date since your forced merge above) [14:22:24] <_joe_> there is where I do it [14:22:42] ah I didn't see that, I assumed that was just for typing comments [14:23:03] now it finally says "Cannot merge", it took a while after your merge to show that [14:23:21] (03PS4) 10BBlack: Text VCL: split X-Wikimedia-Debug from the rest [puppet] - 10https://gerrit.wikimedia.org/r/300561 (https://phabricator.wikimedia.org/T110717) [14:23:38] 06Operations, 10Gerrit, 06Labs: Gerrit username change request - https://phabricator.wikimedia.org/T141261#2492052 (10MarcoAurelio) >>! In T141261#2492019, @matmarex wrote: > Oh, hm, or are you talking about the shell name? I was talking about the login and display name. No idea about shell. Hello. Yes, I t... [14:23:52] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: split X-Wikimedia-Debug from the rest [puppet] - 10https://gerrit.wikimedia.org/r/300561 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [14:24:13] merged [14:24:44] <_joe_> ok [14:24:51] <_joe_> you merged my change too? [14:24:58] yeah [14:25:50] 06Operations, 10Traffic: No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org - https://phabricator.wikimedia.org/T81605#2492061 (10faidon) I'm honestly not worried all that much about tunnels anymore. In my experience, they're very rare nowadays and especially in this cross-country fashion (Googl... [14:27:54] 06Operations, 10Traffic: No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org - https://phabricator.wikimedia.org/T81605#2492065 (10BBlack) For the VCL stuff, what I meant is that for IPv6 user traffic, we could compare the runtime lookup we do for Set-Cookie on the IPv6 address to the one done via... [14:28:39] (03PS3) 10MarcoAurelio: Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) [14:30:39] 06Operations, 10Traffic: No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org - https://phabricator.wikimedia.org/T81605#2492069 (10BBlack) Moving forward and checking perf metrics after is an option, too. But unless the change is quite dramatic it will be hard to see it. Rolling forward and back... [14:31:44] (03PS4) 10MarcoAurelio: Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) [14:32:11] PS4 is actually mine [14:32:41] I actually kind of like that change [14:33:00] is always kinda bugged me when I made a minor edit to someone else's patchset and then my name popped up in here as if it were my patch. [14:33:08] the submitter stuff is still visible in the gerrit history [14:35:39] huh, does it report the author and not the commiter now? or the owner? [14:37:38] (03PS4) 10Filippo Giunchedi: puppetization for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) [14:38:12] 06Operations, 10Gerrit, 06Labs: Gerrit username change request - https://phabricator.wikimedia.org/T141261#2492081 (10MarcoAurelio) To clarify, I'd like that whenever I upload new patches to gerrit, it shows that the author is MarcoAurelio, not maurelio. I don't mind SSH-ing with maurelio (which in fact I fi... [14:38:56] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: put pfw1- ge-2/0/11 in the 'fundraising' vlan for new host frqueue1001 - https://phabricator.wikimedia.org/T140991#2492084 (10Jgreen) That's great news re. additional available interfaces, I'll create a new #ops-eqiad to do the cable swap and su... [14:44:03] (03CR) 10Filippo Giunchedi: "This works, however apt fails to install python-thumbor-wikimedia I think because it depends on jessie-backports packages:" [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [14:47:07] 06Operations, 10Gerrit, 06Labs: Gerrit username change request - https://phabricator.wikimedia.org/T141261#2492102 (10matmarex) In that case, I think it's your local config. Run this to view your current author name: git config --global user.name And to change it: git config --global user.name "Marco... [14:49:13] qchris: Yo :) [14:51:38] (03CR) 10Alexandros Kosiaris: [C: 031] "looks fine to me. @Ariel what do you think ?" [puppet] - 10https://gerrit.wikimedia.org/r/300837 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [14:52:12] (03CR) 10Alexandros Kosiaris: [C: 031] role::jobqueue_redis: sort redis instances [puppet] - 10https://gerrit.wikimedia.org/r/300838 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [14:53:12] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2492109 (10Joe) [14:53:21] (03PS1) 10ArielGlenn: move cron job for central auth dump to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300878 (https://phabricator.wikimedia.org/T133694) [14:53:51] (03CR) 10Chad: "I wasn't aware that there was a debian package. Will have to test, but I don't see why not?" (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (https://phabricator.wikimedia.org/T70271) (owner: 10Chad) [14:54:20] <_joe_> apergos: can you look at https://gerrit.wikimedia.org/r/300837 too? [14:54:25] (03CR) 10jenkins-bot: [V: 04-1] move cron job for central auth dump to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300878 (https://phabricator.wikimedia.org/T133694) (owner: 10ArielGlenn) [14:54:29] _joe_: sure [14:56:31] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: puppet fail [14:57:20] (03PS2) 10ArielGlenn: move cron job for central auth dump to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300878 (https://phabricator.wikimedia.org/T133694) [14:57:28] (03PS1) 10Ottomata: Confluent MirrorMaker puppetization [puppet] - 10https://gerrit.wikimedia.org/r/300879 (https://phabricator.wikimedia.org/T134184) [14:58:40] (03CR) 10jenkins-bot: [V: 04-1] Confluent MirrorMaker puppetization [puppet] - 10https://gerrit.wikimedia.org/r/300879 (https://phabricator.wikimedia.org/T134184) (owner: 10Ottomata) [14:59:06] Heya ostriches. Looks like the migration went basically fine. Congrats! [14:59:16] (03CR) 10Alexandros Kosiaris: [C: 031] "PCC is happy: https://puppet-compiler.wmflabs.org/3457/" [puppet] - 10https://gerrit.wikimedia.org/r/300839 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [14:59:20] qchris: I turned the its plugin to true in All-Projects. [14:59:24] Was it really that simple? [14:59:26] * ostriches headdesks [15:00:04] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160725T1500). Please do the needful. [15:00:05] If it's working now, then yes :-D [15:00:12] Let's test :p [15:00:42] (03PS2) 10Chad: TESTING STUFF [puppet] - 10https://gerrit.wikimedia.org/r/300815 [15:00:47] Seems to work again. [15:00:50] https://phabricator.wikimedia.org/T775 [15:00:52] ostriches: ^ [15:01:17] (03PS3) 10Chad: TESTING STUFF [puppet] - 10https://gerrit.wikimedia.org/r/300815 (https://phabricator.wikimedia.org/T70271) [15:01:27] Yay [15:01:34] I'm both thrilled it was so easy.... [15:01:40] And pissed that I missed something so easy.... [15:01:47] _joe_: my only question is what it would do on the current version of puppet [15:02:07] I think the default of turning the plugin off is ... challenging. [15:02:08] <_joe_> apergos: the right thing? [15:02:25] I mean ... if you install a plugin, you probably want it enabled. [15:02:27] <_joe_> apergos: this is bound to be a noop on the current version [15:02:38] (03PS1) 10MarcoAurelio: Submodule commit update (test) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300880 [15:02:41] if it's a noop for the current version then I'm good to go [15:02:47] But that's how the Google overlords thought it should be. [15:03:03] (03CR) 10ArielGlenn: [C: 031] snapshots: do not double-enclose wiki name in brackets [puppet] - 10https://gerrit.wikimedia.org/r/300837 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [15:04:19] (03Abandoned) 10Chad: TESTING STUFF [puppet] - 10https://gerrit.wikimedia.org/r/300815 (https://phabricator.wikimedia.org/T70271) (owner: 10Chad) [15:04:58] (03PS2) 10Giuseppe Lavagetto: snapshots: do not double-enclose wiki name in brackets [puppet] - 10https://gerrit.wikimedia.org/r/300837 (https://phabricator.wikimedia.org/T141242) [15:05:00] 06Operations, 10Gerrit, 06Labs: Gerrit username change request - https://phabricator.wikimedia.org/T141261#2492127 (10MarcoAurelio) 05Open>03Resolved p:05Triage>03Low a:03MarcoAurelio Hi @matmarex -- I did what you said and it's working well: https://gerrit.wikimedia.org/r/#/c/300880 as example. Th... [15:05:05] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: puppet fail [15:05:53] 06Operations, 06Discovery, 06Maps, 10Monitoring: Map caches metrics look broken - https://phabricator.wikimedia.org/T141186#2492132 (10BBlack) ganglia-monitor restart doesn't seem to have fixed it. I'm not even sure I remember who knows how to fix the brokers best. Maybe @Dzahn knows? [15:06:25] (03CR) 10Giuseppe Lavagetto: [C: 032] snapshots: do not double-enclose wiki name in brackets [puppet] - 10https://gerrit.wikimedia.org/r/300837 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [15:07:03] I'll do a puppet run on one of the snaps to double-check [15:07:19] ostriches, i missed that too, until i saw the docs. [15:11:52] (03PS2) 10MarcoAurelio: Bump event-schemas submodule commit to master Change-Id: Ic69f6724619f2afb0cb25be4f3862fde40b76017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300880 [15:11:56] (03CR) 10Giuseppe Lavagetto: [C: 032] role::jobqueue_redis: sort redis instances [puppet] - 10https://gerrit.wikimedia.org/r/300838 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [15:12:03] (03PS2) 10Giuseppe Lavagetto: role::jobqueue_redis: sort redis instances [puppet] - 10https://gerrit.wikimedia.org/r/300838 (https://phabricator.wikimedia.org/T141242) [15:12:16] (03CR) 10Giuseppe Lavagetto: [V: 032] role::jobqueue_redis: sort redis instances [puppet] - 10https://gerrit.wikimedia.org/r/300838 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [15:12:40] (03PS3) 10ArielGlenn: move cron job for central auth dump to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300878 (https://phabricator.wikimedia.org/T133694) [15:12:52] (03PS3) 10MarcoAurelio: Bump event-schemas submodule commit to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300880 [15:14:10] (03CR) 10Elukey: "This change produces metrics like:" [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/299118 (owner: 10Elukey) [15:17:13] ostriches it seems gerrit is much faster at ssh, and plus zuul deffintly seems to be faster at picking up changes. [15:17:26] It is also quick to post comments. [15:17:27] :) [15:17:31] (03CR) 10Elukey: [C: 032] Refactor the JMX GC metrics definition [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/299118 (owner: 10Elukey) [15:17:47] (03PS4) 10ArielGlenn: move cron job for central auth dump to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300878 (https://phabricator.wikimedia.org/T133694) [15:19:09] (03PS1) 10Elukey: Update the jmxtrans module to the latest commit. [puppet] - 10https://gerrit.wikimedia.org/r/300886 [15:19:17] (03CR) 10ArielGlenn: [C: 032] move cron job for central auth dump to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300878 (https://phabricator.wikimedia.org/T133694) (owner: 10ArielGlenn) [15:19:32] (03PS2) 10Giuseppe Lavagetto: role::statistics: fix lookup of statistics classes [puppet] - 10https://gerrit.wikimedia.org/r/300839 (https://phabricator.wikimedia.org/T141242) [15:23:31] (03CR) 10Giuseppe Lavagetto: [C: 032] role::statistics: fix lookup of statistics classes [puppet] - 10https://gerrit.wikimedia.org/r/300839 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [15:23:56] (03CR) 10Elukey: [C: 032] Update the jmxtrans module to the latest commit. [puppet] - 10https://gerrit.wikimedia.org/r/300886 (owner: 10Elukey) [15:24:02] (03PS2) 10Elukey: Update the jmxtrans module to the latest commit. [puppet] - 10https://gerrit.wikimedia.org/r/300886 [15:24:27] <_joe_> kill the submodules [15:24:45] _joe_: i've recently got patches for the zookeeper submodule :p [15:25:00] <_joe_> ostriches: I do a ton of refactoring :) [15:25:04] <_joe_> ups [15:25:05] from the internet at large [15:25:06] <_joe_> kill the submodules [15:25:08] <_joe_> kill the submodules [15:25:09] haha [15:25:10] <_joe_> kill the submodules [15:25:11] <_joe_> kill the submodules [15:25:18] submodules are literally hitler. [15:25:23] mr. anti open source over here [15:25:24] :o [15:27:24] <_joe_> ottomata: uhm in fact now that you make me think about it [15:27:29] <_joe_> kill the submodules [15:28:51] !log Standardized the jmxtrans GC metric names to pick up automatically variations in settings. This introduces metric name changes in Hadoop, Zookeeper, Kafka. (https://gerrit.wikimedia.org/r/#/c/299118/) [15:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:36] (03PS2) 10BBlack: Ciphersuite upgrades for one-off sites [puppet] - 10https://gerrit.wikimedia.org/r/300071 (https://phabricator.wikimedia.org/T118181) [15:31:23] (03CR) 10BBlack: [C: 032] Ciphersuite upgrades for one-off sites [puppet] - 10https://gerrit.wikimedia.org/r/300071 (https://phabricator.wikimedia.org/T118181) (owner: 10BBlack) [15:31:31] haha [15:31:34] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:34:06] 06Operations, 10Gerrit, 10Mail, 13Patch-For-Review, 07Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2492207 (10demon) 05Open>03Resolved a:03demon Should be resolved. If we encounter this again please reopen. [15:34:34] !log T140825, T134016: Reststarting Cassandra to apply stream timeout, and 8MB trickle_fsync (restbase1008-a.eqiad.wmnet) [15:34:36] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [15:34:37] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [15:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:14] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [15:36:05] (03PS1) 10ArielGlenn: move generation of lists of good dumps to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300889 (https://phabricator.wikimedia.org/T133694) [15:36:18] 06Operations, 06Discovery, 06Labs, 10Labs-Infrastructure, and 2 others: Update coastline data in OSM postgres db (osmdb.eqiad.wmnet) - https://phabricator.wikimedia.org/T140296#2492229 (10dschwen) 05Resolved>03Open Uuuuaaahhhh, now I'm getting `ERROR: permission denied for relation coastlines` [15:37:03] <_joe_> uh? [15:37:09] <_joe_> mediawiki exceptions? [15:37:15] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [15:38:41] 06Operations: reinstall maps-test200[1234] with RAID - https://phabricator.wikimedia.org/T140440#2492240 (10akosiaris) Actually that's not true. ``` # smartctl -a /dev/sg0 smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.4.0-1-amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmon... [15:38:53] _joe_: Yeah. ori added an alert for that :) [15:39:18] <_joe_> ostriches: turns out it's a spike of dberrors from one host [15:39:26] It's meant to trigger us into actually looking at logstash :) [15:39:27] <_joe_> I'll take a look after the ops meeting [15:39:29] !log T140825, T134016: Reststarting Cassandra to apply stream timeout, and 8MB trickle_fsync (restbase1008-b.eqiad.wmnet) [15:39:31] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [15:39:31] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [15:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:35] (03PS2) 10Ottomata: Confluent MirrorMaker puppetization [puppet] - 10https://gerrit.wikimedia.org/r/300879 (https://phabricator.wikimedia.org/T134184) [15:39:37] (which I'm now doing hehe :)) [15:40:27] (03PS2) 10ArielGlenn: move generation of lists of good dumps to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300889 (https://phabricator.wikimedia.org/T133694) [15:40:43] (03CR) 10jenkins-bot: [V: 04-1] Confluent MirrorMaker puppetization [puppet] - 10https://gerrit.wikimedia.org/r/300879 (https://phabricator.wikimedia.org/T134184) (owner: 10Ottomata) [15:40:49] 06Operations, 06Discovery, 06Labs, 10Labs-Infrastructure, and 2 others: Update coastline data in OSM postgres db (osmdb.eqiad.wmnet) - https://phabricator.wikimedia.org/T140296#2492245 (10akosiaris) >>! In T140296#2492229, @dschwen wrote: > Uuuuaaahhhh, now I'm getting `ERROR: permission denied for relati... [15:42:00] (03CR) 10ArielGlenn: [C: 032] move generation of lists of good dumps to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300889 (https://phabricator.wikimedia.org/T133694) (owner: 10ArielGlenn) [15:42:15] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2492249 (10yuvipanda) p:05Normal>03High a:05MoritzMuehlenhoff>03None (moving to high since this caused a couple more outages) [15:42:27] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2492252 (10yuvipanda) Have we considered giving it more RAM? [15:43:06] !log T140825, T134016: Reststarting Cassandra to apply stream timeout, and 8MB trickle_fsync (restbase1008-c.eqiad.wmnet) [15:43:07] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [15:43:07] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [15:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:22] (03PS4) 10Ottomata: Confluent MirrorMaker puppetization [puppet] - 10https://gerrit.wikimedia.org/r/300879 (https://phabricator.wikimedia.org/T134184) [15:46:02] 06Operations, 10Traffic, 07HTTPS: letsencrypt puppetization: add parallel rsa+ecdsa cert support - https://phabricator.wikimedia.org/T141266#2492282 (10BBlack) [15:48:52] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825#2492308 (10GWicke) I did locally try the alternative of starting dirty page write-back early with ``` sysctl -w vm... [15:50:14] 06Operations, 10Gerrit: Remove Java 6 from ytterbium.wikimedia.org (Gerrit production host) - https://phabricator.wikimedia.org/T103668#2492313 (10demon) [15:50:16] 06Operations: Separate host lookup from the sql shell script - https://phabricator.wikimedia.org/T141255#2492314 (10Krenair) [15:51:38] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2492321 (10MoritzMuehlenhoff) >>! In T130593#2492252, @yuvipanda wrote: > Have we considered giving it more RAM? Won't help much, only stretching the interval until it OOMs at some point ag... [15:52:45] (03PS2) 10Cmjohnson: Removing mgmt dns from cp1043/1044 decom'd t133614 [dns] - 10https://gerrit.wikimedia.org/r/300284 [15:53:44] !log T140825: Setting vm.dirty_background_bytes=24M on restbase1012.eqiad.wmnet [15:53:45] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [15:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:10] !log T140825, T134016: Reststarting Cassandra to apply stream timeout, and disable trickle_fsync (restbase1012-a.eqiad.wmnet) [15:54:12] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [15:54:12] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [15:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:10] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns from cp1043/1044 decom'd t133614 [dns] - 10https://gerrit.wikimedia.org/r/300284 (owner: 10Cmjohnson) [16:02:15] !log T140825, T134016: Restarting Cassandra to apply stream timeout, and disable trickle_fsync (restbase1012-b.eqiad.wmnet) [16:02:20] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:02:20] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [16:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:04:49] (03PS1) 10BBlack: sslcert: regenerate dhparam.pem [puppet] - 10https://gerrit.wikimedia.org/r/300893 [16:05:24] (03CR) 10BBlack: [C: 032 V: 032] sslcert: regenerate dhparam.pem [puppet] - 10https://gerrit.wikimedia.org/r/300893 (owner: 10BBlack) [16:06:18] 06Operations: reinstall maps-test200[1234] with RAID - https://phabricator.wikimedia.org/T140440#2492377 (10akosiaris) So the controller is a `Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)` it is supported by the mpt3sas on 4.4 kernels (which s... [16:06:55] !log T140825, T134016: Restarting Cassandra to apply stream timeout, and disable trickle_fsync (restbase1012-c.eqiad.wmnet) [16:06:57] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:06:57] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [16:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:56] !log T134016: Restarting Cassandra to apply stream timeout (restbase1013-a.eqiad.wmnet) [16:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:22] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:12:23] PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: puppet fail [16:13:04] (03CR) 10Glaisher: Disabling local uploads on ms.wikipedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [16:15:17] (03PS5) 10MarcoAurelio: Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) [16:16:36] !log T134016: Restarting Cassandra to apply stream timeout (restbase1013-b.eqiad.wmnet) [16:16:37] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:16:37] huh [16:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:43] I did an edit there [16:17:01] anyways <3 inline editing [16:19:49] (03PS1) 10Ottomata: Finalize main-codfw Kafka upgrade [puppet] - 10https://gerrit.wikimedia.org/r/300896 (https://phabricator.wikimedia.org/T138265) [16:21:05] (03CR) 10jenkins-bot: [V: 04-1] Finalize main-codfw Kafka upgrade [puppet] - 10https://gerrit.wikimedia.org/r/300896 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [16:21:24] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:22:49] (03CR) 10Yuvipanda: toollabs: collect stats on grid usage by job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/300534 (https://phabricator.wikimedia.org/T140999) (owner: 10Rush) [16:23:01] 06Operations, 10DBA, 10Phabricator: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2492499 (10mmodell) [16:23:30] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2492506 (10mmodell) [16:23:36] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2492507 (10BBlack) Noting from last meeting about this: We've **tentatively** said we'll try to make this (implementing a robust A/B test infrastructure at the Varnish level) an... [16:25:05] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2492511 (10Nuria) Second @BBlack. We will make this a shared goal among traffic and analytics team [16:29:37] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2492527 (10Cmjohnson) This disk was sent by HP to SF Office despite specifying the shipping address as the data center in Virginia. Robert sent to me but via usps which will be returne... [16:30:11] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2310174 (10ori) What's the rationale for prioritizing it? [16:31:30] (03CR) 10Chad: [C: 031] remove ytterbium from puppet, update gerrit comment [puppet] - 10https://gerrit.wikimedia.org/r/300806 (owner: 10Dzahn) [16:31:44] (03CR) 10Chad: [C: 031] remove ytterbium from netboot,DHCP [puppet] - 10https://gerrit.wikimedia.org/r/300812 (owner: 10Dzahn) [16:34:04] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2492540 (10BBlack) It's a seasonal issue that's come up every few months for the past couple of years. Every time we need to run an A/B test, we go back through the same conver... [16:34:26] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2492543 (10mmodell) [16:36:42] subbu, gwicke, I want to schedule a Labs upgrade window which will (possibly, but not necessarily) run into the Parsoid/OCG/Citoid/Mobileapps window on the 2nd. How much will it mess you up if CI and/or other Labs things are broken during that window? [16:39:43] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:45:41] (03PS1) 10Yuvipanda: tools: Fix iowait check [puppet] - 10https://gerrit.wikimedia.org/r/300901 [16:48:18] andrewbogott, so, next week, we are planning to upgrade the parsoid production cluster .. so, in that period, we won't be deploying anything else. But, mobrovac would have upgraded the beta cluster prior to that, so, it won't affect Parsoid. [16:48:33] upgrade to jessie & node v4 [16:49:04] subbu: is that something that will block on CI tests? [16:49:37] no. [16:50:09] we've already tested parsoid against node v4 and are satisfied with it. [16:50:58] (03PS1) 10Dzahn: labs: restart slapd once a week [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) [16:51:04] subbu: great! I will try not to overlap you but won't worry too much if I do. [16:51:04] thanks [16:51:09] k [16:51:54] (03CR) 10jenkins-bot: [V: 04-1] labs: restart slapd once a week [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [16:51:57] i cannot speak for mobileapps however. i don't think cscott has anything critical for ocg that needs to go out and couldn't wait. [16:52:00] andrewbogott, ^ [16:52:21] * andrewbogott waits for cscott to chime in [16:52:31] * cscott pops head up [16:52:49] (03PS2) 10Dzahn: labs: restart slapd once a week [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) [16:53:00] no, nothing critical for ocg. [16:53:43] (03CR) 10jenkins-bot: [V: 04-1] labs: restart slapd once a week [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [16:55:22] cscott: thanks! [16:56:15] (03PS3) 10Dzahn: labs: restart slapd once a week [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) [16:57:19] (03CR) 10jenkins-bot: [V: 04-1] labs: restart slapd once a week [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [16:57:47] (03PS1) 10ArielGlenn: move wikidata json, ttl dumps cron job to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300903 (https://phabricator.wikimedia.org/T133694) [16:58:32] (03CR) 10jenkins-bot: [V: 04-1] move wikidata json, ttl dumps cron job to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300903 (https://phabricator.wikimedia.org/T133694) (owner: 10ArielGlenn) [16:59:15] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: do not override BUILDRESULT [puppet] - 10https://gerrit.wikimedia.org/r/300830 (https://phabricator.wikimedia.org/T141246) (owner: 10Hashar) [16:59:24] (03PS3) 10Alexandros Kosiaris: package_builder: do not override BUILDRESULT [puppet] - 10https://gerrit.wikimedia.org/r/300830 (https://phabricator.wikimedia.org/T141246) (owner: 10Hashar) [16:59:29] (03CR) 10Alexandros Kosiaris: [V: 032] package_builder: do not override BUILDRESULT [puppet] - 10https://gerrit.wikimedia.org/r/300830 (https://phabricator.wikimedia.org/T141246) (owner: 10Hashar) [16:59:37] (03PS4) 10Dzahn: labs: restart slapd once a week [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) [17:00:04] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160725T1700). [17:00:43] 06Operations, 06Labs, 10Labs-Infrastructure: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#2492736 (10yuvipanda) [17:01:08] (03PS2) 10ArielGlenn: move wikidata json, ttl dumps cron job to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300903 (https://phabricator.wikimedia.org/T133694) [17:02:07] (03CR) 10Andrew Bogott: [C: 031] "This makes me sad but will surely help." [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [17:03:40] andrewbogott: should be fine for mobileapps as well. /cc bearND|afk [17:04:28] ori: FYI http://deployment.wikimedia.beta.wmflabs.org/wiki/Thanks_To_Ori ;) [17:04:50] Luke081515: aww awesome, thank you! so glad it got out [17:05:15] (03PS1) 10Chad: Bacula: Remove old gerrit backup path, unused now [puppet] - 10https://gerrit.wikimedia.org/r/300905 [17:05:23] andrewbogott: mdholloway I'm not worried about labs at that time. For CI going to be down on 8/2 we should also let niedzielski know. How long is it going to be down? [17:05:55] bearND|afk: It's hard to know for sure, but I'm scheduling a 3-hour window [17:05:57] (03CR) 10Chad: "Also, if we have old backups from ytterbium, they can all be purged." [puppet] - 10https://gerrit.wikimedia.org/r/300905 (owner: 10Chad) [17:07:08] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2492793 (10Cmjohnson) HP is sending a new disk and we will need to return the other disk. [17:07:28] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2492794 (10Cmjohnson) Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your request... [17:07:39] bearND mdholloway andrewbogott: i'm missing context but i'm sure we'll be ok :) [17:08:07] I just now sent an email about this to ops@ so everyone will know twice [17:08:14] (03CR) 10ArielGlenn: [C: 032] move wikidata json, ttl dumps cron job to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300903 (https://phabricator.wikimedia.org/T133694) (owner: 10ArielGlenn) [17:08:26] 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2492797 (10Cmjohnson) disks ordered for ms-be1027 [17:10:45] Another SQL query error on Special:Undelete [17:11:26] (03CR) 10Ori.livneh: [C: 031] site: add prometheus::node_exporter to more machines [puppet] - 10https://gerrit.wikimedia.org/r/299970 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [17:11:28] Viewing a deleted edit from July 25 (today) [17:11:42] (03CR) 10Ori.livneh: [C: 031] puppetmaster: generate prometheus targets from ganglia [puppet] - 10https://gerrit.wikimedia.org/r/299539 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [17:11:51] Same db also... [17:12:10] https://phabricator.wikimedia.org/T140650 [17:18:14] all the "Merge Conflict" status in ops/puppet Gerrit :) i know they make sense because of "Fast Forward Only" and need for rebase.. but some kind of psychological effect to have a whole wall of "conflict" [17:18:53] jynus: You here? [17:19:53] (03PS3) 10Dzahn: remove ytterbium from puppet, update gerrit comment [puppet] - 10https://gerrit.wikimedia.org/r/300806 [17:20:09] (03CR) 10Dzahn: [C: 032] remove ytterbium from puppet, update gerrit comment [puppet] - 10https://gerrit.wikimedia.org/r/300806 (owner: 10Dzahn) [17:22:00] would still be nice if grrrit-wm said something when it actually gets merged/submitted [17:23:08] cant tell the difference between actually merged and waiting for verified [17:23:08] It...does sometimes.... [17:23:15] Behavior is weird though [17:23:20] that little pop-up is nice that is telling me when jenkins-bot is done [17:23:59] Yep i like that too [17:24:46] (03PS2) 10Dzahn: remove ytterbium from netboot,DHCP [puppet] - 10https://gerrit.wikimedia.org/r/300812 [17:25:17] apparently when we decom servers there is a missing step [17:25:22] "remove from servermon" [17:25:58] got asked about servers that are gone from everything but still there. but i guess it's the racktables->servermon import [17:26:14] :) [17:28:43] (03CR) 10Dzahn: [C: 032] remove ytterbium from netboot,DHCP [puppet] - 10https://gerrit.wikimedia.org/r/300812 (owner: 10Dzahn) [17:29:32] it was such a habit to hit Ctrl + R to reload a page.. now Ctrl + R selects a file checkbox, heh [17:30:38] PROBLEM - swift-object-auditor on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:30:39] PROBLEM - swift-container-server on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:30:39] PROBLEM - swift-account-server on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:30:49] PROBLEM - swift-object-replicator on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:31:07] PROBLEM - swift-object-server on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:31:07] PROBLEM - swift-object-server on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:31:28] PROBLEM - swift-object-updater on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:31:37] PROBLEM - swift-container-auditor on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:31:37] PROBLEM - swift-account-auditor on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:31:38] PROBLEM - swift-container-auditor on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:31:38] PROBLEM - swift-account-reaper on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:31:48] PROBLEM - swift-account-reaper on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:31:48] PROBLEM - swift-container-replicator on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:31:56] o_O [17:32:00] godog, ^ [17:32:11] (03CR) 10Muehlenhoff: [C: 04-1] "That would restart both endpoints at the same point, which should be avoided. Rather use fqdn_rand()" [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [17:32:12] doh, sorry about that, expired downtime [17:32:21] but no panic, not in service [17:32:33] PROBLEM - swift-object-replicator on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:32:36] (03PS1) 10ArielGlenn: remove obsolete wikiqueries references from snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/300910 [17:32:41] icinga-wm: shush! [17:32:49] (03CR) 10MarcoAurelio: "> Patch Set 5: Published edit on patch set 4." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [17:33:11] PROBLEM - swift-object-updater on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:33:51] PROBLEM - swift-account-replicator on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:34:22] PROBLEM - swift-account-auditor on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:34:22] PROBLEM - swift-container-replicator on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:34:41] PROBLEM - swift-account-replicator on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:34:42] PROBLEM - swift-object-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:34:42] PROBLEM - swift-container-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:34:42] PROBLEM - swift-account-server on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:34:52] PROBLEM - swift-account-server on ms-be3002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:34:52] PROBLEM - swift-account-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:35:12] PROBLEM - swift-account-reaper on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:35:12] PROBLEM - swift-object-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:35:12] PROBLEM - swift-container-server on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:35:22] PROBLEM - swift-container-updater on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:35:32] PROBLEM - swift-object-auditor on ms-be3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:36:02] {{done}} [17:36:12] godog, is swift in esams being used? [17:37:45] (03Restored) 10Addshore: Don't log dewiki_diffstats to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) (owner: 10Addshore) [17:38:06] Krenair: not ATM no, it'll be though to e.g. test upgrades and new features and so on [17:38:55] (03CR) 10ArielGlenn: [C: 032] remove obsolete wikiqueries references from snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/300910 (owner: 10ArielGlenn) [17:39:33] (03PS4) 10Addshore: Add dewiki_diffstats to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) [17:39:56] 06Operations, 06Community-Tech, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2492931 (10kaldari) [17:40:58] (03PS3) 10Yuvipanda: Add golang images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/300800 [17:41:04] (03CR) 10jenkins-bot: [V: 04-1] Add golang images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/300800 (owner: 10Yuvipanda) [17:41:40] (03PS4) 10Yuvipanda: Add golang images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/300800 [17:42:08] (03PS5) 10Addshore: Add dewiki_diffstats to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) [17:43:52] (03CR) 10Yuvipanda: [C: 032] Add golang images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/300800 (owner: 10Yuvipanda) [17:45:38] 06Operations, 06Labs, 10Labs-Infrastructure: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#2492957 (10MoritzMuehlenhoff) There is no failover in that sense, various LDAP clients allow to use multiple servers and depending on their configuration they may use round-ro... [17:46:00] (03Merged) 10jenkins-bot: Add golang images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/300800 (owner: 10Yuvipanda) [17:47:17] (03PS5) 10Dzahn: labs: restart slapd once a week [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) [17:48:56] (03PS1) 10Krinkle: graphite: Set xFilesFactor to 0 for sum/count. [puppet] - 10https://gerrit.wikimedia.org/r/300911 [17:50:25] mutante to create web changes for files you go to the project settings page ie for example https://gerrit.wikimedia.org/r/#/admin/projects/operations/puppet [17:50:28] and click create change [17:50:30] (03CR) 10Dzahn: "PS5: using a random hour and minute but always "sometime on Monday"" [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [17:50:39] then you choose the branch and then the commit msg. [17:51:27] then once that does it you get taken to an empty commit that hasent been published yet, you then click the edit button and type in the file you want to create or edit. [17:51:31] :) [17:52:16] You can also edit existing user commits through browser too so it isent limited to just creating changes. [17:52:18] :) [17:52:23] mutante ^^ [17:52:51] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [17:53:23] paladox: ok, testing it! [17:53:29] :) [17:53:52] you have to click publish twice for it to not be a refs/drafts/ [17:53:56] mutante ^^ [17:55:38] yea, that works. i can create a change in the DNS repo by just clicking [17:55:46] Yay [17:55:47] :) [17:55:49] i get an empty change.. then i have to add files to it though [17:55:53] Yep [17:55:56] then i could edit those files [17:56:00] Yep [17:56:00] (03CR) 10Chad: [C: 031] Add LANG to /etc/defaults/puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis) [17:56:12] it works, but for me personally it's still easier to use my local editor [17:56:23] very true about mobile though [17:56:28] Oh [17:56:30] Yep [17:56:42] and tablet users like the iphone and ipad users :). [17:56:49] it's cool that you could fix a small thing without having to have git and clone at all [17:56:54] Everyone can now contribute more then they could. [17:57:00] Yep [17:57:02] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [17:57:35] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to labtest root for bd808 - https://phabricator.wikimedia.org/T140830#2477476 (10chasemp) it's my understanding this falls to me now (past 3 day window, no objections in ops meeting, labs team is onboard) and I'm in favor [17:58:03] so first it's an "edit", then you publish the edit [17:58:07] and that makes it a draft [17:58:18] yep [17:58:21] (03Draft2) 10Dzahn: decom ytterbium - use only web ui [dns] - 10https://gerrit.wikimedia.org/r/300914 [17:58:23] then you click publish [17:58:31] again which makes it a refs/changes/ [17:58:33] you publish the draft again.. then it's a change [17:58:45] oh i doint think we should be able to view drafts ^^ [17:58:45] so it's 2 publish steps [17:58:48] Yep [17:59:08] One is to save it as a draft then the second one is saving it as a refs/changes/ [17:59:09] change [18:00:08] (03PS6) 10MarcoAurelio: Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) [18:00:10] viewing drafts is a bug, and is fixed in https://gerrit-documentation.storage.googleapis.com/ReleaseNotes/ReleaseNotes-2.11.9.html [18:00:16] which is in gerrit 2.12.3 [18:00:18] ostriches ^^ [18:00:24] < grrrit-wm> (Draft2) Dzahn: decom ytterbium [18:00:37] it became Draft2 when i published the Draft [18:00:38] Yep, shoulden be able to view drafts [18:00:45] but it is fixed in gerrit 2.12.3 [18:00:47] ok [18:01:22] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2493042 (10CCogdill_WMF) Confirmed with IBM that the updated key works, and we've done some validation — looks like DKIM an... [18:01:54] (03Draft2) 10Paladox: Testing [dns] - 10https://gerrit.wikimedia.org/r/300916 [18:01:58] (03Abandoned) 10Paladox: Testing [dns] - 10https://gerrit.wikimedia.org/r/300916 (owner: 10Paladox) [18:02:18] mutante oh i doint think so now, ive published it ^^ and does the same [18:02:26] wierd, maybe a bug that hasent been fixed. [18:02:35] but needs fixing in the web inline web editing [18:04:28] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2493052 (10AlexMonk-WMF) I've been thinking maybe we should try nodepool in labtest (running at a much smaller scale) so we can take a closer look at this... [18:06:59] (03PS2) 10Krinkle: graphite: Set xFilesFactor to 0 for sum/count. [puppet] - 10https://gerrit.wikimedia.org/r/300911 [18:07:00] mutante you can also do followups in the web ui [18:08:05] paladox: we have 6 different themes now [18:08:14] did you try them yet :p [18:08:16] mutante oh really what themes [18:08:17] ? [18:08:21] Heh on the diff page. [18:08:23] I tried those. [18:08:23] dunno, just saw [18:08:27] Oh [18:08:35] I am going to try those [18:08:37] eclipse, elegant, neat, midnight [18:08:41] night, twilight [18:08:42] those [18:08:47] Thanks for telling me [18:08:54] i wonder what they look like [18:08:58] and which one do you like [18:08:59] ? [18:09:01] mutante ^^ [18:10:04] haven't tried yet. i guess midnight because i usually like dark themes [18:10:25] Oh [18:10:31] I hate the twilight theme [18:10:37] too dark to see anything [18:10:39] having an issue with scrollbars on the right [18:10:43] ostriches, I assume people have already complained about this: https://bugs.chromium.org/p/gerrit/issues/detail?id=3970 [18:10:57] (traps ^T and various other firefox browser shortcuts) [18:11:43] I think someone was mentioning it. [18:11:45] ostriches and mutante that is fixed in https://gerrit-review.googlesource.com/#/c/75987 [18:11:46] mutante perhaps [18:11:50] and is fixed in gerrit 2.12.3 [18:12:00] ah in the .3 release [18:12:00] That is in the release notes. [18:12:03] Yep [18:12:10] We are on the .2 release [18:12:22] yeah I just heard that [18:12:30] that's a convincing reason to move to .3 [18:12:47] Yep, also fixes another issue hashar had. [18:13:06] lol [18:13:19] @palladium:~# puppetstoredconfigclean.rb ytterbium.wikimedia.org [18:13:20] Killing ytterbium.wikimedia.org...done. [18:13:26] There's a new gerrit? [18:13:27] w00t! [18:13:33] :) [18:13:37] yes, go visit it and see [18:13:37] Yep [18:13:37] Bsadowski1: yes [18:13:50] Much faster. [18:14:27] (03PS3) 10Greg Grossmeier: [Beta Cluster] Remove PoolCounter override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298919 (https://phabricator.wikimedia.org/T38891) [18:15:06] mutante you can set your edit theme too [18:15:09] LOL [18:15:22] !log ytterbium - revoke puppet cert, delete salt-key, remove from icinga [18:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:15:29] You also have to edit the commit to edit the commit msg now [18:15:43] mutante :) [18:16:00] Oh you can set key map to vim [18:16:30] yep, gerrit takes over the key combo that my browser uses already [18:16:34] what ariel said [18:16:38] Oh [18:16:41] (03CR) 10Rush: [V: 031] "seems right to me" [dns] - 10https://gerrit.wikimedia.org/r/284824 (https://phabricator.wikimedia.org/T119660) (owner: 10Andrew Bogott) [18:17:03] You use firefox? [18:17:11] yes [18:17:28] Oh [18:17:30] no big deal, the reason to use it is gone :p [18:17:37] since there is the pop-up now [18:17:39] Oh [18:19:21] mutante ostriches it seems gerrit 2.12.2 breaks keybored shortkeys for non us keybords on the sidebyside diff [18:19:22] https://gerrit-documentation.storage.googleapis.com/ReleaseNotes/ReleaseNotes-2.11.8.html [18:19:30] gerrit 2.12.3 fixes it [18:19:47] and i know i use a british keybored since i brought the laptop from very.co.uk. [18:20:14] Also fixes Issue 3919: Explicitly set parent project to All-Projects when a project is created without giving the parent. [18:20:21] (03CR) 10Andrew Bogott: [C: 04-1] Add an lvs service ip (labs-ns.wikimedia.org) for the labs dns recursors (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284829 (https://phabricator.wikimedia.org/T119660) (owner: 10Andrew Bogott) [18:21:01] paladox: good, but that would be a separate thing [18:21:06] has US layout [18:21:15] but i doint have a us keybored [18:21:22] meaning the short cuts would be broken for me [18:21:45] even though i doint use short cuts, some other will plus i may have without realising it [18:22:17] *nod* [18:24:27] mutante i belive ctrl+r was broken see https://bugs.chromium.org/p/gerrit/issues/detail?id=3970 [18:24:34] which there is a fix and is in gerrit 2.12.3 [18:24:40] so that should hopefully fix it for you [18:24:55] mutante could you try that command on http://gerrit-test.wmflabs.org/ [18:24:57] please [18:25:02] which is running gerrit 2.12.3 [18:25:52] paladox: ctrl + r is reloading the page there as it was in the old version [18:26:05] So it works [18:26:08] on gerrit-test [18:26:13] but not on gerrit.wikimedia.org [18:26:14] ? [18:26:16] there are no check boxes there next to files [18:26:22] but dont worry about it. i dont need it [18:26:26] Ok [18:26:41] paladox: yes, confirmed that [18:26:56] Ok yay so at least we know the problem is fixed upstream [18:27:01] and patched in gerrit 2.12.3 :) [18:27:36] (03CR) 10Ottomata: [C: 032] Upgrade Kafka main-codfw to 0.9 [puppet] - 10https://gerrit.wikimedia.org/r/300867 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [18:27:43] (03PS2) 10Ottomata: Upgrade Kafka main-codfw to 0.9 [puppet] - 10https://gerrit.wikimedia.org/r/300867 (https://phabricator.wikimedia.org/T138265) [18:27:43] apergos, that's a 2.12.3 there fyi [18:28:27] :) [18:29:07] apergos you should try http://gerrit-test.wmflabs.org/ to see if it fixes the commands for firefox [18:29:09] :) [18:29:15] lemme see [18:29:19] (03Abandoned) 10Dzahn: decom ytterbium - use only web ui [dns] - 10https://gerrit.wikimedia.org/r/300914 (owner: 10Dzahn) [18:29:38] Ok thanks [18:30:47] paladox: I assume I need to create a new account over there? [18:30:52] apergos nope [18:30:56] you can use the admin [18:30:59] it has no password [18:31:00] ah [18:31:03] !log upgrading kafka to 0.9 in main-codfw, first kafka2001 then 2002 [18:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:23] :) [18:31:36] worksforme [18:31:41] :) [18:32:14] (03PS2) 10Dzahn: Create user bcohn for Brent Cohn (Brentjoseph on phab) [puppet] - 10https://gerrit.wikimedia.org/r/300872 (https://phabricator.wikimedia.org/T140449) (owner: 10Elukey) [18:32:46] (03CR) 10Dzahn: [C: 032] Create user bcohn for Brent Cohn (Brentjoseph on phab) [puppet] - 10https://gerrit.wikimedia.org/r/300872 (https://phabricator.wikimedia.org/T140449) (owner: 10Elukey) [18:33:19] gerrit 2.12.3 also fixes reindexing since if someone deletes a draft now it will cause reindexing to fail. [18:35:02] :) [18:36:20] 06Operations, 06Labs, 10Labs-Infrastructure: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#2493220 (10chasemp) @andrew and I were discussing whether LVS would make sense in front of LDAP with the ability to more intelligently depool/handle complex failure cases. [18:37:40] (03PS2) 10Dzahn: Add Bryan to labtest roots. [puppet] - 10https://gerrit.wikimedia.org/r/299959 (https://phabricator.wikimedia.org/T140830) (owner: 10Gehel) [18:38:01] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2493224 (10madhuvishy) @MoritzMuehlenhoff Done! http://keys.gnupg.net/pks/lookup?op=get&search=0xA4D1DAC73B947C4D [18:38:17] (03PS2) 10Chad: Contint/Gerrit: Fix up ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/300810 [18:38:19] (03PS1) 10Chad: Contint: Revoke old gerrit ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/300919 [18:38:49] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825#2493225 (10Eevans) The instances on 1008 were all restarted with a `trickle_fsync` value of 8M starting from ~15:30... [18:38:58] (03CR) 10Dzahn: [C: 032] Add Bryan to labtest roots. [puppet] - 10https://gerrit.wikimedia.org/r/299959 (https://phabricator.wikimedia.org/T140830) (owner: 10Gehel) [18:39:32] (03CR) 10Paladox: [C: 031] Contint/Gerrit: Fix up ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/300810 (owner: 10Chad) [18:40:07] (03PS3) 10Krinkle: graphite: Set xFilesFactor to 0 for sum/count. [puppet] - 10https://gerrit.wikimedia.org/r/300911 [18:40:23] (03PS2) 10Ottomata: Finalize main-codfw Kafka upgrade [puppet] - 10https://gerrit.wikimedia.org/r/300896 (https://phabricator.wikimedia.org/T138265) [18:40:52] (03CR) 10Paladox: [C: 031] Contint: Revoke old gerrit ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/300919 (owner: 10Chad) [18:41:42] (03CR) 10jenkins-bot: [V: 04-1] Finalize main-codfw Kafka upgrade [puppet] - 10https://gerrit.wikimedia.org/r/300896 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [18:41:48] (03CR) 10Ottomata: [C: 032 V: 032] Finalize main-codfw Kafka upgrade [puppet] - 10https://gerrit.wikimedia.org/r/300896 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [18:43:08] (03PS1) 10Yuvipanda: Add jdk8 webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300920 [18:43:10] (03PS1) 10Yuvipanda: Add python3 webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300921 [18:43:12] (03PS1) 10Yuvipanda: Add golang webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300922 [18:43:14] (03PS1) 10Yuvipanda: Bump debian version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300923 [18:43:36] bd808 ^ minor patches? [18:43:42] (03PS1) 10Eevans: (Re)enable Cassanrda instance 1013-c [puppet] - 10https://gerrit.wikimedia.org/r/300924 (https://phabricator.wikimedia.org/T134016) [18:43:57] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to labtest root for bd808 - https://phabricator.wikimedia.org/T140830#2493262 (10Dzahn) 05Open>03Resolved on labtestneutron2001, as a random labtest-* machine. Notice: /Stage[main]/Admin/Admin::Hashuser[bd808]/Admin::User[bd808]/... [18:44:03] ostriches i can load diffs instantly now whereas before the size of the file decided how fast it would load [18:44:04] :) [18:44:09] (03PS3) 10Ottomata: Finalize main-codfw Kafka upgrade [puppet] - 10https://gerrit.wikimedia.org/r/300896 (https://phabricator.wikimedia.org/T138265) [18:44:16] YuviPanda: you're going to force me to learn the new gerrit ui huh :) [18:44:27] Im loading integration/config layout.yaml [18:44:27] 06Operations, 10Ops-Access-Requests: Requesting access to labtest root for bd808 - https://phabricator.wikimedia.org/T140830#2493265 (10Dzahn) [18:44:29] bd808 :D [18:44:29] bd808: join the club. [18:44:31] this merge can be yours for as little as 10 clicks [18:44:31] everyone's doing it ;-) [18:44:42] bd808 you would have to use the new web ui for commit msg. [18:44:50] chasemp: Who clicks? Keyboard shortcuts ftw :p [18:45:34] mutante: are you around? interested in another round of cassandra instance wack-a-mole? :) [18:45:52] bd808: try maing a new change from scratch without using git at all, just web ui, for the heck of it [18:45:55] urandom: yes [18:46:01] LOL [18:46:29] Actually it is pretty easy to create your own change [18:46:32] from web ui [18:46:55] all you go is to the project setting page ie https://gerrit.wikimedia.org/r/#/admin/projects/integration/config and click create change [18:47:04] then you enter the branch and then the commit msg [18:47:37] ostriches: I really need to start [18:47:39] (03CR) 10Ottomata: [C: 032 V: 032] Finalize main-codfw Kafka upgrade [puppet] - 10https://gerrit.wikimedia.org/r/300896 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [18:47:57] (03CR) 10BryanDavis: [C: 032] Add jdk8 webservices (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300920 (owner: 10Yuvipanda) [18:48:20] then you go to an empty change, then you click edit and then you add files, edit or remove them [18:48:24] (03Merged) 10jenkins-bot: Add jdk8 webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300920 (owner: 10Yuvipanda) [18:48:26] (03CR) 10BryanDavis: [C: 032] Add python3 webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300921 (owner: 10Yuvipanda) [18:48:39] you can also edit prevous edits of other users :) [18:48:47] (03CR) 10BryanDavis: [C: 032] Add golang webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300922 (owner: 10Yuvipanda) [18:49:03] (03Merged) 10jenkins-bot: Add python3 webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300921 (owner: 10Yuvipanda) [18:49:45] (03Merged) 10jenkins-bot: Add golang webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300922 (owner: 10Yuvipanda) [18:49:48] (03CR) 10Dzahn: "this was actually -1 by jenkins-bot" [puppet] - 10https://gerrit.wikimedia.org/r/300896 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [18:49:51] (03CR) 10BryanDavis: Bump debian version (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300923 (owner: 10Yuvipanda) [18:50:42] (03PS2) 10Eevans: (Re)enable Cassanrda instance 1013-c [puppet] - 10https://gerrit.wikimedia.org/r/300924 (https://phabricator.wikimedia.org/T134016) [18:51:00] (03PS1) 10Ottomata: Revert unintended change to jmxtrans module version [puppet] - 10https://gerrit.wikimedia.org/r/300925 [18:51:27] (03PS2) 10Yuvipanda: Bump debian version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300923 [18:51:32] (03CR) 10Dzahn: "please don't override jenkins-bot" [puppet] - 10https://gerrit.wikimedia.org/r/300896 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [18:51:35] (03CR) 10Ottomata: [C: 032 V: 032] Revert unintended change to jmxtrans module version [puppet] - 10https://gerrit.wikimedia.org/r/300925 (owner: 10Ottomata) [18:51:37] damn bd808, catching my laziness :) [18:51:40] bd808 fixed [18:51:44] (03PS2) 10Ottomata: Revert unintended change to jmxtrans module version [puppet] - 10https://gerrit.wikimedia.org/r/300925 [18:51:48] (03CR) 10Ottomata: [V: 032] Revert unintended change to jmxtrans module version [puppet] - 10https://gerrit.wikimedia.org/r/300925 (owner: 10Ottomata) [18:52:08] (03PS3) 10Dzahn: Contint/Gerrit: Fix up ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/300810 (owner: 10Chad) [18:52:36] (03CR) 10BryanDavis: [C: 032] Bump debian version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300923 (owner: 10Yuvipanda) [18:52:53] (03CR) 10Dzahn: [C: 032] Contint/Gerrit: Fix up ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/300810 (owner: 10Chad) [18:53:19] (03Merged) 10jenkins-bot: Bump debian version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/300923 (owner: 10Yuvipanda) [18:57:52] (03PS3) 10Dzahn: (Re)enable Cassanrda instance 1013-c [puppet] - 10https://gerrit.wikimedia.org/r/300924 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [18:57:55] 06Operations, 06Release-Engineering-Team, 15User-greg: Institute a weekly review of all UBN! tasks - https://phabricator.wikimedia.org/T141130#2493314 (10greg) I didn't do this at the time I said I would but for today: ```lang=irc 18:39 <+ greg-g> ok, looking at the UBN!s now: https://phabricator.wikimed... [18:58:53] (03CR) 10Dzahn: [C: 032] (Re)enable Cassanrda instance 1013-c [puppet] - 10https://gerrit.wikimedia.org/r/300924 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [19:00:11] (03PS4) 10Eevans: (Re)enable Cassandra instance 1013-c [puppet] - 10https://gerrit.wikimedia.org/r/300924 (https://phabricator.wikimedia.org/T134016) [19:00:50] 06Operations, 06Release-Engineering-Team, 15User-greg: Institute a weekly review of all UBN! tasks - https://phabricator.wikimedia.org/T141130#2493327 (10greg) [19:01:17] (03CR) 10Dzahn: [V: 032] (Re)enable Cassandra instance 1013-c [puppet] - 10https://gerrit.wikimedia.org/r/300924 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [19:01:29] (03CR) 10Dzahn: (Re)enable Cassandra instance 1013-c [puppet] - 10https://gerrit.wikimedia.org/r/300924 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [19:03:01] urandom: it's enabled now [19:03:08] mutante: thank you! [19:03:39] was testing myself with the gerrit ui and verify [19:05:49] (03PS2) 10Dzahn: Contint: Revoke old gerrit ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/300919 (owner: 10Chad) [19:21:19] 06Operations, 10ops-eqiad, 10hardware-requests: eqiad: add all spare network switches to hardware spares tracking - https://phabricator.wikimedia.org/T139775#2493428 (10Cmjohnson) p:05Triage>03Low [19:21:25] !log T134016: Bootstrapping restbase1013-c.eqiad.wmnet [19:21:26] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:21:54] (03PS1) 10Chad: Gerrit: Clean up rsync migration stuff [puppet] - 10https://gerrit.wikimedia.org/r/300929 [19:21:58] (03PS1) 10Chad: WIP: Gerrit: Remove all the junk to support 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/300930 [19:22:00] (03PS1) 10Chad: Gerrit: Remove bugzilla password, unused since 4eva [puppet] - 10https://gerrit.wikimedia.org/r/300931 [19:22:02] (03PS1) 10Chad: Gerrit: Remove old library linking [puppet] - 10https://gerrit.wikimedia.org/r/300932 [19:22:04] 06Operations, 10ops-eqiad: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2493433 (10Cmjohnson) @faidon or @bblack could you look at this when you get a chance. [19:23:01] 06Operations, 10ops-eqiad, 13Patch-For-Review: Broken memory on mw1217 - https://phabricator.wikimedia.org/T138925#2493437 (10Cmjohnson) I do not have a decom'd R420 at this time and the R410 DIMM is not a match. Typically we decom broken out of warranty servers. Would like input from @Joe [19:23:19] 06Operations, 10ops-eqiad, 13Patch-For-Review: Broken memory on mw1217 - https://phabricator.wikimedia.org/T138925#2493440 (10Cmjohnson) p:05Triage>03High [19:23:37] 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2493442 (10Cmjohnson) p:05Triage>03Low THey [19:26:20] (03CR) 10Dzahn: [C: 032] "yes, these IPs have been removed" [puppet] - 10https://gerrit.wikimedia.org/r/300919 (owner: 10Chad) [19:30:39] (03CR) 10Dzahn: "yes, manually stop rsyncd, delete config files so it can't come back. then just remove the class. what i did before in this case" [puppet] - 10https://gerrit.wikimedia.org/r/300929 (owner: 10Chad) [19:31:26] (03PS2) 10Dzahn: Gerrit: Clean up rsync migration stuff [puppet] - 10https://gerrit.wikimedia.org/r/300929 (owner: 10Chad) [19:32:24] (03CR) 10Dzahn: [C: 032] Gerrit: Clean up rsync migration stuff [puppet] - 10https://gerrit.wikimedia.org/r/300929 (owner: 10Chad) [19:33:20] mutante: Let's not merge that just yet [19:33:33] I'll amend so it just removes the class. [19:33:50] ostriches: it's actually good this way :p [19:33:59] Well, yes and no. [19:34:05] still removing the cron [19:34:16] and ferm [19:34:20] cron's on ytterbium [19:34:28] but it's true about rsyncd [19:34:39] Ok we can land it [19:34:43] Then I'll follow-up [19:34:51] Otherwise I'll be fighting puppet to remove it [19:34:51] ok [19:35:50] so it's not about removing a package, but deleting the config files [19:35:59] since rsync client is normal to have [19:36:11] submitted [19:36:29] Errrr. [19:36:31] No bueno [19:36:35] Invalid parameter ensure. [19:36:40] I was building for child. [19:37:10] rsync::server::module doesn't take an ensure. [19:37:21] it, yea. eh [19:37:30] well, wa wanted the other part of that [19:37:48] Yeah [19:37:49] just use the follow-up to remove it [19:37:50] One sec [19:38:00] I'm just gonna add an ensure, seems useful ;-) [19:38:11] nice! [19:38:13] yes [19:38:55] (03PS1) 10Chad: Rsyncd: Allow ensure => absent on config files [puppet] - 10https://gerrit.wikimedia.org/r/300935 [19:39:29] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: Connection refused [19:39:56] (03PS1) 10Yuvipanda: tools: s/kube2dynproxy.py/kube2proxy.py/ [puppet] - 10https://gerrit.wikimedia.org/r/300936 [19:40:18] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-07-26 19:40:00. [19:40:25] (03CR) 10jenkins-bot: [V: 04-1] Rsyncd: Allow ensure => absent on config files [puppet] - 10https://gerrit.wikimedia.org/r/300935 (owner: 10Chad) [19:41:23] (03PS2) 10Yuvipanda: tools: Fix iowait check [puppet] - 10https://gerrit.wikimedia.org/r/300901 [19:41:25] (03PS2) 10Yuvipanda: tools: s/kube2dynproxy.py/kube2proxy.py/ [puppet] - 10https://gerrit.wikimedia.org/r/300936 [19:41:39] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fix iowait check [puppet] - 10https://gerrit.wikimedia.org/r/300901 (owner: 10Yuvipanda) [19:41:52] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: s/kube2dynproxy.py/kube2proxy.py/ [puppet] - 10https://gerrit.wikimedia.org/r/300936 (owner: 10Yuvipanda) [19:43:12] (03PS2) 10Chad: Rsyncd: Allow ensure => absent on config files [puppet] - 10https://gerrit.wikimedia.org/r/300935 [19:51:59] new gerrit: username: cscott, Full Name: Cscott [19:52:08] ^ how do i change the full name reported by new-gerrit? [19:55:28] (03PS1) 10Eevans: Enable Cassandra instance restbase2008-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/300942 (https://phabricator.wikimedia.org/T134016) [19:57:05] cscott: pretty sure that's from LDAP [19:58:59] Yep [19:59:14] (it should've always been that way too) [19:59:16] ldaplist says my ldap display name is "C. Scott Ananian" [19:59:29] Sadly we don't use the display name. [19:59:33] We use the cn. [19:59:37] i have 'sn' and 'cn' set to 'Cscott' [19:59:41] how do i change that? [19:59:50] mutante: I have one more instance bootstrap, if you have the time [19:59:56] mutante: last one today :) [20:00:02] cscott: We can change the cn. [20:00:04] gwicke, cscott, arlolra, subbu, bearND, and mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160725T2000). [20:00:33] !log restbase deploy 8efbc92 to deployment-prep [20:00:38] ostriches: do i have to perform an arcane ritual involving "be bold" mugs in order to do so? [20:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:53] ostriches: You have to file a task and assign it to me :p [20:01:00] Is that arcane enough? [20:01:08] "consult a wizard" [20:02:36] how many LDAP renaming tasks are currently assigned to you ostriches? :) [20:02:44] cscott: Fwiw, it means that you'll be using your CN to login to most things. We don't use sn's except as your actual shell name. [20:02:55] Krenair: Um, 2 that are stalled. [20:02:59] Others I closed out. [20:02:59] we use uid as shell name [20:03:12] for anyone it may concern, i am planning a mobileapps deployment but ran into some unexpected unit test errors so it'll be a bit [20:03:12] ostriches: gerrit says my username is "cscott" (no caps) but my "real name" is "Cscott" [20:03:20] ostriches: are those both using the cn? [20:03:46] accountFullName = cn [20:04:14] what else uses the cn as login? [20:04:29] Phab [20:04:38] every misc service using HTTP basic auth [20:04:52] Krenair: i use MW OAuth to log into phab [20:04:53] ldap.laccountSshUserName defaults to uid. [20:05:03] no parsoid deploy today. [20:05:09] !log restbase deploy 8efbc92 to staging [20:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:14] Jenkins [20:05:26] cscott, okay but you can use LDAP [20:06:05] oh, the "arcane" part is mapping ostriches -> demon so you can file the phab task [20:06:12] :D [20:06:15] maybe i shouldn't have disclosed the secret in publick [20:06:16] :P [20:06:27] NOW I NEED A NEW NICK GAWSH [20:08:10] most of the things listed here use cn to login: https://wikitech.wikimedia.org/wiki/LDAP_Groups [20:08:29] I'm not changing it at this point since people are used to it. [20:08:39] :) [20:09:23] ostriches: try https://en.wikipedia.org/wiki/Special:RandomInCategory/Animals ? [20:09:36] Huh? [20:09:47] for your next ircnick [20:09:49] to pick a new name [20:09:51] lol [20:10:03] !log restbase deploy 8efbc92 canary deploy to restbase1007 [20:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:10:11] cscott, Krenair: We could maybe set full name to displayName instead of cn. It's the default, tbh, we override it. [20:11:08] ostriches, does that set the login name? [20:11:16] or just the name shown to other users? [20:11:16] Nope. [20:11:20] ok [20:11:20] That's based on accountBase. [20:11:29] Er, accountPattern [20:11:30] accountPattern = (&(objectClass=person)(cn=${username})) [20:11:46] 06Operations, 06Release-Engineering-Team, 15User-greg: Institute quarterly(?) review of incident reports and follow-up - https://phabricator.wikimedia.org/T141287#2493671 (10faidon) [20:12:02] Krenair: I guess my only question is "Does everyone have a sane displayName set?" [20:12:42] no [20:12:52] for example uid=matmarex does not have one set [20:13:20] cscott: ostriches the long read is https://phabricator.wikimedia.org/T113792#1676462 [20:13:59] and https://gerrit.wikimedia.org/r/#/c/4166/ (yeah 4166) [20:14:46] feel free to copy paste my reply to the task detail, add examples as needed and make that an RFC :] [20:14:50] I know. [20:15:06] more for cscott :D [20:15:07] I'm not changing the accountPattern, logins will remain as cn. [20:15:22] Now, if we have sane displayNames, I wouldn't mind swapping that. [20:15:27] But sounds like it's not always the case. [20:15:42] 06Operations, 10Gerrit: Rename "Dzahn" to "Daniel Zahn" in Gerrit - https://phabricator.wikimedia.org/T113792#2493688 (10hashar) [20:17:05] 06Operations, 10Gerrit: Rename "Dzahn" to "Daniel Zahn" in Gerrit - https://phabricator.wikimedia.org/T113792#1676037 (10hashar) [20:17:20] 06Operations, 10Gerrit: Change LDAP cn to something more useful (was Rename "Dzahn" to "Daniel Zahn" in Gerrit) - https://phabricator.wikimedia.org/T113792#1676037 (10hashar) [20:17:58] 06Operations, 06Release-Engineering-Team, 15User-greg: Institute quarterly(?) review of incident reports and follow-up - https://phabricator.wikimedia.org/T141287#2493130 (10faidon) I'd like us (#operations) to be involved in those discussions. We are de facto and de jure the primary incident responders as w... [20:20:56] 06Operations, 10Gerrit: Change LDAP cn to something more useful (was Rename "Dzahn" to "Daniel Zahn" in Gerrit) - https://phabricator.wikimedia.org/T113792#2493719 (10hashar) [20:21:02] done [20:21:11] and added cscott account as an example on T113792 [20:21:12] T113792: Change LDAP cn to something more useful (was Rename "Dzahn" to "Daniel Zahn" in Gerrit) - https://phabricator.wikimedia.org/T113792 [20:23:03] 06Operations, 06Release-Engineering-Team, 15User-greg: Institute quarterly(?) review of incident reports and follow-up - https://phabricator.wikimedia.org/T141287#2493729 (10greg) +1 :) I think this first meeting I'm having with TPG is just "is someone willing to help brainstorm" :) so, yeah, I'll loop you i... [20:23:11] 06Operations, 10Gerrit: Change LDAP cn to something more useful (was Rename "Dzahn" to "Daniel Zahn" in Gerrit) - https://phabricator.wikimedia.org/T113792#2493731 (10demon) Keep in mind, your user name in commits is set based off of your local git's `user.name` setting and has no impact on Gerrit. The only pl... [20:23:44] hashar: Undid you, sorry. [20:24:35] We can change cn's. [20:25:25] And it updates everything that is possible :) [20:25:35] (what's not is git, but no amount of ldap changes will change that) [20:29:29] yeah [20:29:39] for git we have the .mailmap to alias folks though [20:30:01] the thing is that Gerrit uses our login name instead of our DisplayName [20:30:34] so if I hit [Rebase] the commit ends up with: Commiter: Hashar [20:31:02] I think that was part of the issue of DZahn, the other would be the mailmap [20:31:55] Does anything really use mailmap? [20:31:58] I doubt gerrit cares. [20:35:07] Krenair: wut [20:36:24] the name gerrit uses, and also the name i log in with, is "Bartosz Dziewoński". the "ń" is sometimes fun, e.g. when i can't log into kibana with some browsers that apparently encode it wrong [20:36:58] but i had it changed by ostriches from "Matmarex" or something else with silly capitalization some years ago [20:37:25] !log restbase deploy 8efbc92 [20:37:30] (for the reference, the correct way is "matmarex" or "Matma Rex", i should fix my irc nickname some day too) [20:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:43:48] having accents in your login name is lots of fun [20:44:06] at some point I couldn't use Labs for half a year [20:45:20] with Horizon it was only two weeks [20:45:55] so yes, detaching displayed name from login name would make our infrastructure much saner [20:45:56] Accents are sillÿ [20:46:27] * MatmaRex stabs östrichës [20:46:58] tgr: pfft, i consider it a great test of software excellence [20:47:08] for example MediaWiki can take them fine. just sayin'. ;) [20:53:50] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [20:54:43] <|---|> o.O [20:55:05] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2493811 (10hashar) >>! In T115194#2493052, @AlexMonk-WMF wrote: > I've been thinking maybe we should try nodepool in labtest (running at a much smaller scale) so we... [20:55:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [20:58:25] mobileapps deployment is cancelled. [20:59:36] (03PS3) 10Gilles: Add ability to dual-serve a portion of Swift rewrite.py traffic to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/298431 (https://phabricator.wikimedia.org/T140072) [20:59:40] MatmaRex: hear, hear. cscott = ok. CScott = well, fine, if you must. Cscott = not okay. [21:00:04] dapatrick and bawolff: Dear anthropoid, the time has come. Please deploy Security (non-urgent) deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160725T2100). [21:00:31] when I was at OLPC our standard burn-in test for a new OS image involved creating an account name with as many accents and special characters as possible, since python2 often had serious issues with non-ascii strings. [21:00:53] in JS land we try to use surrogate characters (above UTF-16) whenever possible during testing [21:00:59] (03PS4) 10Gilles: Add ability to dual-serve a portion of Swift rewrite.py traffic to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/298431 (https://phabricator.wikimedia.org/T140072) [21:01:43] ObGoogle: I wonder if I should switch my DNS hosting from gandi.net to the new google domains service? [21:02:08] MatmaRex: I've filed a number of bugs over the years against software that broke on "^demon" because they didn't escape regex chars ;-) [21:02:33] $demon would be more evil [21:02:42] ^demon actually works as a regexp, usually. [21:03:02] Well, it breaks when your regex ends up being /^^demon/ :p [21:17:20] (03PS1) 10Yuvipanda: tools: Add a role to help build tools images [puppet] - 10https://gerrit.wikimedia.org/r/301000 [21:17:34] ostriches i belive i know of some css fixes [21:17:42] which im going to upload now [21:17:53] * paladox uses the new web ui editor :) [21:20:06] (03CR) 10Yuvipanda: [C: 032] tools: Add a role to help build tools images [puppet] - 10https://gerrit.wikimedia.org/r/301000 (owner: 10Yuvipanda) [21:41:41] (03PS1) 10Andrew Bogott: Update libvirt driver (with our hack) for Liberty [puppet] - 10https://gerrit.wikimedia.org/r/301003 (https://phabricator.wikimedia.org/T131548) [21:43:33] (03CR) 10Andrew Bogott: [C: 032] Update libvirt driver (with our hack) for Liberty [puppet] - 10https://gerrit.wikimedia.org/r/301003 (https://phabricator.wikimedia.org/T131548) (owner: 10Andrew Bogott) [21:45:50] (03Draft2) 10Paladox: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) [21:46:03] ostriches mutante ^^ [21:47:20] (03PS3) 10Paladox: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) [21:52:18] !log deployed security patch for T137551 [21:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:57:14] (03PS5) 1020after4: Specify home directory for phd user [puppet] - 10https://gerrit.wikimedia.org/r/300468 [22:11:15] (03PS6) 10Addshore: Add dewiki_diffstats to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) [22:13:01] (03CR) 10Dereckson: Add dewiki_diffstats to wmgMonologChannels (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) (owner: 10Addshore) [22:13:05] (03Draft2) 10Jforrester: De-deploy ImageMetrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301009 (https://phabricator.wikimedia.org/T140952) [22:16:34] (03PS7) 10Addshore: Add dewiki_diffstats to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) [22:20:18] (03CR) 10Dereckson: [C: 031] Add dewiki_diffstats to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) (owner: 10Addshore) [22:20:30] thanks Dereckson ! :) [22:21:21] You're welcome. [22:27:35] (03Abandoned) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442 (owner: 10EBernhardson) [22:27:41] (03PS1) 10Yuvipanda: python: Load python and python3 plugins [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301014 [22:31:57] (03PS1) 10Reedy: Run createTxtFileSymlinks.sh update tracked dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301015 [22:32:24] (03CR) 10Reedy: [C: 032] Run createTxtFileSymlinks.sh update tracked dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301015 (owner: 10Reedy) [22:35:16] (03PS1) 10Reedy: Swap to using static php array for TrustedXFF usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301016 (https://phabricator.wikimedia.org/T141120) [22:44:41] (03CR) 10Reedy: [V: 032] Run createTxtFileSymlinks.sh update tracked dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301015 (owner: 10Reedy) [22:46:02] !log reedy@tin Synchronized docroot/noc/conf/: Update dblist symlinks (duration: 00m 37s) [22:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:48:15] !log restarted zuul due to depends-on lockup [22:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:49:46] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:53:11] (03PS2) 10Reedy: Swap to using static php array for TrustedXFF usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301016 (https://phabricator.wikimedia.org/T141120) [22:53:15] (03PS3) 10Reedy: Swap to using static php array for TrustedXFF usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301016 (https://phabricator.wikimedia.org/T141120) [22:53:41] (03CR) 10Reedy: [C: 031] "Needs to wait till relevant changes are deployed to Wikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301016 (https://phabricator.wikimedia.org/T141120) (owner: 10Reedy) [22:53:54] I keep getting 502 bad gateway errors on en.wikipeida.org. [22:54:06] Intermittently. [22:54:11] Example URL: https://en.wikipedia.org/wiki/MediaWiki:Robots.txt [22:54:44] seems fine from europe [22:54:53] Debra you going to the wrong link, its en.wikipedia.org [22:55:27] ditto, I just got two 502 messages in a row (both on officewiki). Cannot reproduce though. [22:55:38] paladox: Thx, but I'm hitting the correct URL. [22:55:45] Oh [22:55:46] That was a typo in chat. [22:55:47] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:55:49] Oh [22:56:01] Im getting it correctly in europe too. [22:56:30] Reedy: Do we have a graph of 502s somewhere? [22:57:05] I thought we did [22:57:21] I think we do [22:57:34] where's gdash gone? [22:57:41] ie public graphite stuff [22:57:48] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes ? [22:57:53] https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json ? [22:58:05] ahh Debra yeh that looks like it [22:58:21] I don't see an obvious spike, but I got two 502s in close proximity. :-( [22:59:03] there were a 'bunch' of redis errors a little while ago it would seem (spike) [22:59:21] but probably not related [23:00:05] RoanKattouw, ostriches, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160725T2300). [23:00:05] Addshore: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:12] *waves* here! [23:00:22] I can SWAT this evening. [23:00:41] (03CR) 10Reedy: [C: 04-1] Swap to using static php array for TrustedXFF usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301016 (https://phabricator.wikimedia.org/T141120) (owner: 10Reedy) [23:03:43] (03PS8) 10Dereckson: Add dewiki_diffstats to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) (owner: 10Addshore) [23:03:54] (03CR) 10Dereckson: [C: 032] Add dewiki_diffstats to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) (owner: 10Addshore) [23:04:22] (03Merged) 10jenkins-bot: Add dewiki_diffstats to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) (owner: 10Addshore) [23:06:43] addshore: live on mw1099 [23:07:15] and wikipedia still loads there, so looks like it is good to go out! [23:08:19] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2494260 (10Nuria) @BBlack: i volunteer to write a design doc with user cases /high level design ideas and issues by the end of this quarter so we can use it to scope the work we... [23:09:16] ok [23:09:16] yup, and the log has appeared on fluorine :) [23:09:21] nice [23:10:11] Debra, quiddity: I don't see anything in fatal.log for "Robots" or "office". Was it just the plain white and black nginx 502 bad gateway error? [23:10:23] yup [23:10:56] Yes. [23:10:58] legoktm: Are we replacing all $wmg to $wg when we convert to load extension? [23:11:14] Reedy: yes [23:11:15] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add dewiki_diffstats to wmgMonologChannels ([[Gerrit:288158]], T134861) (duration: 00m 25s) [23:11:16] T134861: Data need: Explore range of article revision comparisons - https://phabricator.wikimedia.org/T134861 [23:11:16] legoktm: Might've happened at the Varnish level? [23:11:18] that sounds like a varnish/traffic issue then [23:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:19] not MW [23:11:26] Reedy: yes, except for $wmgUseExtensionName [23:11:34] Right, I was pointing at MediaWiki, I was pointing at operations. ;-) [23:11:42] Err, wasn't * [23:11:58] someone from ops probably needs to look then [23:12:12] Cool. Do they have an IRC channel? [23:12:32] addshore: live in prod [23:12:52] Dereckson: many thanks! :) [23:13:00] You're welcome, thanks for testing. [23:13:31] (03PS4) 10Krinkle: graphite: Set xFilesFactor to 0 for sum/count. [puppet] - 10https://gerrit.wikimedia.org/r/300911 [23:13:50] (03PS5) 10Reedy: Load RestBaseUpdateJobs via wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298095 (https://phabricator.wikimedia.org/T139800) [23:15:06] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.106 second response time [23:17:29] Reedy: with one change, in what order do you offer to deploy that? [23:17:47] eh? [23:17:49] IS, CS: we lost after IS the expected wmg [23:18:02] CS, IS: we don't have wgRe... defined [23:18:03] sync-dir wmf-config PROFIT [23:19:03] hmmmm last time we did a full scap with the same rationale of sync two files simultaneously, we served 70 fatal errors. [23:20:59] th.cipriani recommends two changes so we avoid any sync issue if rsync lags [23:21:15] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.122 second response time [23:23:20] https://lists.wikimedia.org/pipermail/wikitech-l/2016-July/086142.html ? [23:25:18] Dereckson: I'd do it without 2 seperate commits then [23:26:24] copying manually the array in IS, sync, removing the array, sync? [23:27:45] wfLoadExtension [23:33:49] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [23:33:59] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [23:35:29] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [23:35:58] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5084918 keys - replication_delay is 0 [23:55:08] (03PS1) 10Paladox: gerrit: Fix the css for inline diff [puppet] - 10https://gerrit.wikimedia.org/r/301027