[00:00:04] twentyafterfour and mutante: Dear anthropoid, the time has come. Please deploy Phabricator migration iridium -> phab1001 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170804T0000). [00:00:53] here [00:01:08] steps: https://phabricator.wikimedia.org/T163938#3499331 [00:01:22] mutante: which part should I do? [00:01:38] * twentyafterfour stops phd [00:01:43] * twentyafterfour silences icinga first [00:01:44] (03CR) 10Reedy: [C: 031] Change wikipedia.org SPF record to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/370040 (https://phabricator.wikimedia.org/T170891) (owner: 10Herron) [00:01:47] twentyafterfour: stop the existing thing and log that it starts maintenance [00:01:51] i will do rsync [00:02:15] !log Taking phabricator down for maintenance / migration to a new server: phab1001 [00:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:32] copies the steps OUT of phab itself, heh [00:03:14] let me know when we can be sure the repos are not writable [00:03:53] !log phab1001 - /usr/bin/rsync -av rsync://iridium.eqiad.wmnet/srv-repos /srv/repos/ [00:03:57] mutante: ok phd and apache are stopped [00:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:05] we will need to disable ssh too? [00:04:26] yea, but just git-ssh [00:04:45] mutante: can we just block the sshd port or remove the IP? otherwise it will still serve ssh requests even with apache down [00:05:06] cant we stop the process that runs the git-ssh [00:05:23] * twentyafterfour is looking for the service unit [00:05:41] oh, and let me disable puppet now unless you have already [00:05:49] systemctl stop ssh-phab ? [00:06:02] PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - string focus on bug not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2309 bytes in 0.011 second response time [00:06:17] heh, that is success ?:) [00:06:21] because phab2001 (sic) tells us [00:06:42] PROBLEM - https://phabricator.wikimedia.org on phab1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - string focus on bug not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2309 bytes in 0.011 second response time [00:06:52] mutante: I got the ssh stopped [00:07:20] !log stopped ssh-phab on iridium [00:07:24] schedules downtime for all of iridium [00:07:28] !log stoped phd and apache on iridium [00:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:49] twentyafterfour: cool, rsycing one more time [00:07:59] cool [00:08:09] done [00:08:25] no we need to change the domain defined on phab1001 and run puppet, right [00:08:26] (03PS3) 10Dzahn: exim/phabricator: send mail to phab1001, not iridium [puppet] - 10https://gerrit.wikimedia.org/r/369834 (https://phabricator.wikimedia.org/T163938) [00:08:34] and then the director [00:08:43] now i will redirect the email [00:08:55] (03CR) 10Paladox: [C: 031] exim/phabricator: send mail to phab1001, not iridium [puppet] - 10https://gerrit.wikimedia.org/r/369834 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [00:09:12] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down! [00:09:13] PROBLEM - PyBal IPVS diff check on lvs1002 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab1001-vcs.eqiad.wmnet]) [00:09:13] (03CR) 10Dzahn: [C: 032] exim/phabricator: send mail to phab1001, not iridium [puppet] - 10https://gerrit.wikimedia.org/r/369834 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [00:09:35] what.. no [00:09:42] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down! [00:09:50] that's not good [00:10:06] and we didn't even switch it over yet [00:10:22] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down! [00:10:38] hmm [00:10:52] PROBLEM - PyBal IPVS diff check on lvs1005 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab1001-vcs.eqiad.wmnet]) [00:11:41] twentyafterfour: i dont know enough about the pybal part to debug in realtime what is missing [00:11:52] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2] [00:12:06] well then what shall we do? [00:12:14] the git-ssh port isn't very critical [00:12:18] barely anyone uses it [00:12:47] ok, we can switch the IP and try it [00:12:53] as planned..or go back [00:12:57] twentyafterfour: Joe (my team's PM) spent all day writing up three giant tasks, and when he submitted them he got phab errors due to the maintenance. I know phab is pretty good about storing in-progress *comments*, but does it do anything for tasks? Is there any way he can recover his work? [00:13:17] oh no [00:13:38] RoanKattouw: if he hasn't closed the tab then maybe resubmit after we bring it back up [00:13:53] PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab1001-vcs.eqiad.wmnet]) [00:14:09] He tried the back button on one of them, at my suggestion [00:14:34] (03PS2) 10Dzahn: phabricator: set phab1001 to active phab server [puppet] - 10https://gerrit.wikimedia.org/r/369836 (https://phabricator.wikimedia.org/T163938) [00:14:46] it would be nice if phabricator saved in-progress tasks like it does with comments... [00:14:50] but sadly, no [00:14:59] RoanKattouw: step 1: don't panic, keep the browser open and let's hope for the best when we get it back up. Also, apologies :( [00:15:01] saved to database? [00:15:03] (03CR) 10Dzahn: [V: 032 C: 032] phabricator: set phab1001 to active phab server [puppet] - 10https://gerrit.wikimedia.org/r/369836 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [00:15:13] MaxSem: does with comments, yeah [00:15:19] (03PS4) 10Dzahn: cache::misc/phabricator: switch from iridium to phab1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/369820 (https://phabricator.wikimedia.org/T163938) [00:15:27] well, if the host is down... [00:15:45] it saves periodically so it would have a previous version [00:15:49] For comments, yes [00:15:51] But for new tassk? [00:15:55] not for tasks [00:16:03] it _should_ do that though [00:16:36] as in, should have a new feature :) [00:16:38] (03CR) 10Dzahn: [C: 032] cache::misc/phabricator: switch from iridium to phab1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/369820 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [00:16:43] (03CR) 10Paladox: cache::misc/phabricator: switch from iridium to phab1001 backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369820 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [00:17:32] (03CR) 10Dzahn: cache::misc/phabricator: switch from iridium to phab1001 backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369820 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [00:17:54] paladox: https://gerrit.wikimedia.org/r/#/c/370104/ [00:18:07] ah [00:18:08] thanks [00:18:24] (03PS2) 10Dzahn: phab: phabricator-new to phab2001, phab1001 using normal domain [puppet] - 10https://gerrit.wikimedia.org/r/370104 (https://phabricator.wikimedia.org/T163938) [00:18:36] +1 [00:19:58] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1002 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab1001-vcs.eqiad.wmnet]) daniel_zahn phab migration [00:19:58] ACKNOWLEDGEMENT - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down! daniel_zahn phab migration [00:19:58] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1005 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab1001-vcs.eqiad.wmnet]) daniel_zahn phab migration [00:19:58] ACKNOWLEDGEMENT - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down! daniel_zahn phab migration [00:19:58] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab1001-vcs.eqiad.wmnet]) daniel_zahn phab migration [00:19:58] ACKNOWLEDGEMENT - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down! daniel_zahn phab migration [00:19:58] ACKNOWLEDGEMENT - https://phabricator.wikimedia.org on phab1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - string focus on bug not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2291 bytes in 0.009 second response time daniel_zahn phab migration [00:19:59] ACKNOWLEDGEMENT - https://phabricator.wikimedia.org on phab2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - string focus on bug not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2309 bytes in 0.009 second response time daniel_zahn phab migration [00:20:16] (03CR) 10Paladox: [C: 031] phab: phabricator-new to phab2001, phab1001 using normal domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370104 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [00:20:21] (03CR) 10Dzahn: [C: 032] phab: phabricator-new to phab2001, phab1001 using normal domain [puppet] - 10https://gerrit.wikimedia.org/r/370104 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [00:21:22] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational [00:21:40] running puppet on misc-cache..merging all the planned things [00:21:47] ran puppet on mx for mail [00:21:53] on cp1045 to confirm varnish [00:22:02] now on all via cumin [00:22:24] on phab1001 to fix apache conf [00:23:03] !log iridium sudo ip addr del 10.64.32.186/32 dev eth0 [00:23:12] RECOVERY - https://phabricator.wikimedia.org on phab2001 is OK: HTTP OK: HTTP/1.1 200 OK - 33996 bytes in 0.490 second response time [00:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:16] !log iridium sudo ip addr del 2620:0:861:103:10:64:32:186/128 dev eth0 [00:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:30] !log iridiium sudo ip addr del 208.80.154.250/32 dev lo [00:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:43] RECOVERY - https://phabricator.wikimedia.org on phab1001 is OK: HTTP OK: HTTP/1.1 200 OK - 33997 bytes in 0.255 second response time [00:23:43] !log iridium sudo ip addr del 2620:0:861:ed1a::3:16/128 dev lo [00:23:46] twentyafterfour should apache be stopped on phab1001? [00:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:09] !log phab1001 sudo ip addr add 10.64.32.186/32 dev eth0 [00:24:18] !log phab1001 sudo ip addr add 2620:0:861:103:10:64:32:186/128 dev eth0 [00:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:53] puppet again on phab1001 -- should not take anything away [00:25:23] Notice: /Stage[main]/Phabricator::Vcs/Notify[Warning: phabricator::vcs::listen_address is empty]/message: defined 'message' as 'Warning: phabricator::vcs::listen_address is empty' [00:25:27] that's ok [00:25:32] now https://gerrit.wikimedia.org/r/#/c/370119/ [00:25:44] (03CR) 10Paladox: [C: 031] phabricator: switch service IPs to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370119 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [00:26:17] resolves rebase issue :p [00:26:29] (03PS9) 10Smalyshev: logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [00:27:10] (03CR) 1020after4: [C: 031] phabricator: switch service IPs to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370119 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [00:27:23] ! [remote rejected] HEAD -> refs/publish/production/phab1001 (internal server error: Error inserting change/patchset) [00:27:26] wut [00:27:52] * paladox saw that too [00:27:53] uh [00:27:57] operations/puppet [00:28:00] who can rebase it :p [00:28:00] and mediawiki/core [00:28:05] !log testing phd on phab1001 [00:28:06] oh [00:28:13] i can try [00:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:15] we don't have a phd user on phab1001? [00:29:42] (03PS5) 10Paladox: phabricator: switch service IPs to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370119 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [00:30:17] oh we do but repos aren't owned by that user [00:30:39] ah, sudo chown -R phd:phd /srv/repos [00:30:40] (03CR) 10Dzahn: [C: 032] phabricator: switch service IPs to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370119 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [00:30:54] twentyafterfour: ooh.. fixing in a second [00:30:55] !log phab1001 chown -R phd:phd /srv/repos [00:31:05] that's because phd doesnt have same UID on both [00:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:10] usual problem with rsyncd [00:31:21] mutante: already fixing it [00:31:23] we could have changed the UID beforehand [00:31:27] ok, cool [00:31:37] sounds much better than "no user" sounded :) [00:32:03] but the rsync might break it again... [00:32:11] if the cron is enabled [00:32:12] ya'll are confusing twitter ;) http://imgur.com/a/WMvuY [00:32:16] "autosync => true" [00:32:30] did we do "true", paladox? [00:32:41] um, /me checks [00:32:53] greg-g lol [00:32:55] greg-g: :)) [00:33:16] nope [00:33:17] https://gerrit.wikimedia.org/r/#/c/368841/9/modules/profile/manifests/phabricator/main.pp [00:33:19] submits the last patch on master [00:33:23] lol [00:33:52] paladox: ok, that means _NOT_ auto with cron [00:33:55] good [00:33:57] ah i see [00:34:04] so the permissions won't be rebroken [00:34:18] https://github.com/wikimedia/puppet/blob/8f958d09dd48ca7993dd61381ec44ff79853ed8d/modules/rsync/manifests/quickdatacopy.pp#L28 [00:34:22] mutante ^^ [00:34:25] it's true by default [00:34:45] and phabricator web is back online [00:34:56] !log phab1001 - service IPs switched - puppet ran - ssh-phab service up [00:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:12] !log phabricator web service is also up [00:35:19] come on, now it would be nice to see those Icinga alerts also recover ... [00:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:31] (03PS1) 10EBernhardson: Update CirrusSearch AB test rescore profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370127 (https://phabricator.wikimedia.org/T171212) [00:35:34] btw, re Joe's potentially lost tasks: I think we're OK (at least for one of them: https://phabricator.wikimedia.org/T172468#3499463 ) [00:35:58] IP addresses in phab1001: [00:36:07] * greg-g goes afk [00:36:09] 10.64.16.8 (server) [00:36:17] 10.64.32.186 [00:36:25] 2620:0:861:103:10:64:32:186 [00:36:30] lol https://phabricator.wikimedia.org/ redirects to diffusion [00:36:44] wow the new machine is better hardware, huh? 32 cores? :D [00:36:50] not for me [00:36:57] i see Phab now [00:36:57] :) [00:36:59] yay [00:37:01] paladox: not for me either [00:37:17] chown is taking forever [00:37:21] oh, took me a while to load https://phabricator.wikimedia.org/diffusion/ seems the header links diffusion [00:37:31] twentyafterfour: yes, 31 processors indeed [00:37:34] 32 [00:37:34] ah [00:37:36] works now [00:37:50] I think the old one was 24 or 16 [00:37:59] 16 [00:38:06] systemd state still needs to recover [00:38:07] that's a nice upgrade [00:38:08] it has alot more processing power now heh :) [00:38:19] paladox: at least a lot more parallelism [00:38:25] :) [00:38:26] and i want to see the git-ssh● phd.service loaded failed failed phabricator-phd [00:38:30] unless it's just the hyperthreading [00:38:32] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:38:36] ^ that [00:39:04] we need to do sudo chown -R phd:phd /srv/repos again [00:39:17] we are on jessie now [00:39:24] using systemd [00:39:41] is that issue caused by old init script / start method? [00:39:46] oh [00:40:13] /srv/repos is owned by phd:www-data [00:40:13] mutante: maybe? phd doesn't do systemd [00:40:14] !log ebernhardson@tin Synchronized php-1.30.0-wmf.12/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: Turn CirrusSearch MLR test back off (duration: 00m 47s) [00:40:17] paladox: how do you see the permissions of /srv/repos [00:40:22] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [00:40:22] twentyafterfour: define "doesnt do" :) [00:40:23] mutante i see [00:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:29] drwxr-xr-x 4 phd www-data 4096 Jul 13 19:37 repos [00:40:37] inside there [00:40:38] mutante: well, phd manages it's own daemons [00:40:39] i see [00:40:39] drwxr-xr-x 7 phd phd 4096 Jul 13 15:37 1 [00:40:39] drwxr-xr-x 7 phd phd 4096 Jul 13 19:37 2 [00:40:52] twentyafterfour: but phd itself is a systemd unit [00:41:11] well it mimics init.d style service [00:41:15] but it works on jessie in labs [00:41:16] sudo chown phd:www-data /srv/repos [00:41:20] cd /srv/repos [00:41:24] paladox: that command is still running [00:41:29] ok [00:41:32] twentyafterfour: how was phd started [00:41:47] mutante: I stopped it, so started by puppet I guess [00:42:42] ok, stopping puppet, waiting until it's ready to be restarted [00:43:20] I don't know why it takes so long to chown /srv/repos [00:43:29] these must be spinning disks [00:43:37] * twentyafterfour is spoiled by ssds [00:43:48] i don't understand "CRITICAL: Hosts in IPVS but unknown to PyBal: set(['phab1001-vcs.eqiad.wmnet'])" [00:43:49] twentyafterfour spoiled spoiled [00:43:55] it's not like a new thing we added [00:44:03] we made no change to that name today [00:44:17] and it started when you simply stopped stuff on iridium [00:44:25] i wish that would recover now [00:44:28] mutante: yeah I don't know [00:44:52] lets see, is git-ssh all up and running on phab1001? [00:45:01] git-ssh is not working for me [00:45:09] ssh: connect to host git-ssh.wikimedia.org port 22: No route to host [00:45:11] ● git-ssh.service [00:45:11] Loaded: not-found (Reason: No such file or directory) [00:45:13] :/ [00:45:20] mutante try [00:45:24] 208.80.154.250 [00:45:25] systemctl start ssh-phab [00:45:42] Active: active (running) since Fri 2017-08-04 00:34:47 UTC; 10min ago [00:45:45] it looks like it's running [00:45:52] sshd listening to port 22 on 208.80.154.250 [00:45:55] restarts it [00:46:05] so it's pybal config issue if anything [00:46:21] but .. phab1001-vcs exited the whole time [00:46:31] existed [00:46:46] hmm, phab1001-vcs.eqiad.wmnet is linked to iridium [00:46:55] ssh_exchange_identification: Connection closed by remote host [00:46:56] fatal: Could not read from remote repository. [00:47:12] paladox: where is it linked? [00:47:34] mutante hmm not sure but we were going to use that for the migration to codfw [00:47:52] but could it be possible that when we disabled ssh on iridium did it start then? [00:48:06] yes, that's when it started [00:48:14] before we moved IPs [00:48:57] ah [00:49:00] i guess it's that [00:49:10] 10.64.32.186 ip is linked to phab1001-vcs [00:49:27] yes, go on? [00:49:47] did we move that ip over? [00:49:50] yes, we did [00:49:54] hmm [00:49:59] and the problem started before we even moved it [00:50:03] and it's now on phab1001 [00:50:05] ok [00:51:03] 404 on https://config-master.wikimedia.org/conftool/ heh [00:51:12] man IO is slow [00:51:20] https://config-master.wikimedia.org/pybal/ [00:51:36] https://config-master.wikimedia.org/pybal/eqiad/git-ssh [00:51:39] this is like before.. [00:51:44] didnt change it [00:51:59] PROBLEM - PHD should be running on phab1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 498 (phd) [00:52:28] now that was the first alert that actually pages [00:52:33] phab1001-vcs.eqiad.wmnet has address 10.64.32.186 [00:52:49] that ^ should be pointed to the new IP? [00:52:59] in dns [00:53:00] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:53:19] !log starting phd to shut up icinga [00:53:24] ACKNOWLEDGEMENT - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn phab migration [00:53:24] ACKNOWLEDGEMENT - PHD should be running on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn phab migration [00:53:27] ACKNOWLEDGEMENT - PHD should be supervising processes on phab1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn phab migration [00:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:59] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational [00:54:19] RECOVERY - PHD should be running on phab1001 is OK: PROCS OK: 1 process with regex args php ./phd-daemon, UID = 498 (phd) [00:54:25] mutante: phab1001-vcs.eqiad.wmnet has address 10.64.32.186 [00:54:35] that IP doesn't exist on iridium or on phab1001 [00:54:38] mutante could it be ferm? [00:54:51] and why is phab2001 having an issue [00:54:57] since services were stopped on phab1001 [00:55:03] that shouldn't happen [00:55:10] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 29 processes with UID = 498 (phd) [00:55:21] twentyafterfour: inet 10.64.32.186/32 scope global eth0 [00:55:58] hmm should we change the ip to git-ssh one? [00:56:40] compair ip address on phab1001 and iridium [00:57:11] mutante: where do you see that? I don't see it on phab1001 with ifconfig [00:57:18] compair are a compressor manufacturer [00:57:25] lol [00:57:39] well we do need to compress the number of problems Reedy [00:57:41] twentyafterfour: ip a s [00:57:44] compair = compare [00:57:45] also this new server has the worst disk io I've seen in years [00:58:07] twentyafterfour it still has decent cpu power :P [00:58:14] twentyafterfour maybe we should request some ssd's? :) [00:58:19] decent cpu is useless if the disk is useless [00:58:28] this is true [00:58:38] maybe we "borrow" some ssds :P [00:59:10] I've seen floppy disks faster than this [00:59:12] :D [00:59:27] did iridium have ssd's? [00:59:34] we could just move them to phab1001 [00:59:51] That's probably not a good idea [00:59:55] it's writing 1013.97KB/s [00:59:56] Different warranty [01:00:00] so a bit faster than a floppy but not much [01:00:01] oh [01:00:08] paladox: no ssds [01:00:11] ok [01:00:38] we should request some :). Phab will need it being that it does tasks, files, repo's and alot other things [01:01:03] https://phabricator.wikimedia.org/T156970 [01:01:13] twentyafterfour: is it writing slowly because the source is reading slowly? Or because of the network? [01:01:19] https://phabricator.wikimedia.org/T156970#2991554 [01:01:32] wow [01:01:33] Dual 1TB SATA (sw raid) [01:01:35] more storage [01:02:02] My steam library is bigger than that [01:02:23] Reedy: this is local disk access [01:02:24] well.. if it's too slow.. and we cant fix git-ssh.. afraid we have to go back [01:02:25] chown -R [01:02:34] it finally finished the chown [01:02:45] god dang that took forever [01:02:57] it rsync'd faster than that I think [01:03:01] could it be the amount of traffic slowing the disks down? [01:03:23] Reedy lol [01:04:50] SMTP Error: Could not connect to SMTP host. at [/externals/phpmailer/class.phpmailer.php:804] [01:05:12] it may be phd catching up which is using all the io bandwidth [01:05:46] Was just gonna say, check with something like iotop... [01:06:52] "Pull of 'operations-puppet' failed: Command failed with error #255! COMMAND git fetch origin '+refs/*:refs/*' --prune STDOUT (empty) STDERR error: cannot open FETCH_HEAD: Permission denied" [01:06:56] https://phabricator.wikimedia.org/source/operations-puppet/manage/status/ [01:07:05] guess it needs to have the update button pressed [01:07:07] to clean the error? [01:07:40] Depends if the error stops it running for ever [01:07:46] Or if it's next scheduled run will just fix it [01:08:02] If the disk is already busy, adding extra load isn't gonna help [01:08:11] yep [01:12:04] !log phab1001 - stopped and started exim, which is now running with same options as iridium [01:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:19] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [01:14:38] try restarting ssh-phab? [01:15:09] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.78 ms [01:16:43] !log twentyafterfour@phab1001:/srv/repos$ sudo chmod -x /usr/local/sbin/sync-srv-repos [01:16:50] !log twentyafterfour@phab1001:/srv/repos$ sudo chown -R phd:www-data /srv/repos [01:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:09] yea, smtp issue :/ [01:19:54] systemctl stop ssh-phab && systemctl start ssh-phab && systemctl status ssh-phab [01:19:55] :) [01:20:18] why not just use restart? [01:20:38] or that ^^ [01:22:15] paladox: it's active and running [01:22:37] yep but somehow traffics not reaching port 22 [01:23:23] hmm 208.80.154.250 is linked to git-ssh.eqiad.wikimedia.org [01:23:32] which is not git-ssh.wikimedia.org. [01:26:28] paladox: yes, i moved that IP too, it's still eqiad [01:26:38] ok [01:26:39] looking at the mail issue though [01:26:52] ok [01:27:11] what error do you get with the mail? [01:34:41] incoming works it seems mutante [01:35:56] paladox: i can see that the mailservers (mx) try to send mail to phab1001, like when you send mail to task@ [01:36:06] yep [01:36:20] but it fails.. while the config looks right and i can also telnet to port 25 [01:36:32] and i only see log entries on the sending side [01:36:38] hmm [01:37:41] hmm [01:38:56] wait [01:38:57] 2017-08-04 01:26:31 1ddRNh-0004Nh-PU => general R=phab T=phab_pipe S=3043 DT=5s [01:39:02] there's a mail [01:42:14] https://phabricator.wikimedia.org/T172472#3499567 [01:42:37] mutante ^^ heh [01:42:38] lol Reedy [01:43:05] duh :) [01:43:10] and yay [01:43:13] :) [01:43:31] I did say it worked... 9 minutes ago ;) [01:43:43] lol [01:43:50] 10Operations, 10Traffic, 10netops: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3499584 (10ayounsi) [01:44:08] Reedy: highlight only works when at the beginning of the line, need to fix my client , heh [01:44:12] or something [01:45:13] paladox: rsync is syncing even when it's told not to.. bug [01:45:27] mutante auto is on by default [01:45:34] in the rsync module [01:45:44] https://github.com/wikimedia/puppet/blob/8f958d09dd48ca7993dd61381ec44ff79853ed8d/modules/rsync/manifests/quickdatacopy.pp#L28 [01:45:49] gaah, that's what i meant to check when we talked about it earlier [01:45:51] mutante lets disable it temp [01:46:00] by crossing it out [01:46:03] yes, we need it off asap [01:46:07] * paladox submits patch [01:46:11] cool [01:47:52] (03Draft1) 10Paladox: phabricator: Turn rsync off for iridium -> phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370131 [01:47:57] mutante ^^ :) [01:48:31] (03PS2) 10Paladox: phabricator: Turn rsync off for iridium -> phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370131 [01:48:39] thats slow heh [01:49:54] (03CR) 10Eevans: "> "restbase1016.eqiad.wmnet is decommissioned" just means "it's taken" [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [01:51:43] (03PS3) 10Paladox: phabricator: Turn rsync off for iridium -> phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370131 [01:52:33] (03CR) 10Dzahn: [C: 032] phabricator: Turn rsync off for iridium -> phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370131 (owner: 10Paladox) [01:52:42] thanks :) [02:07:15] (03PS1) 1020after4: PHAB: hard-code IP address for smtp [puppet] - 10https://gerrit.wikimedia.org/r/370132 [02:09:45] (03CR) 10Paladox: [C: 031] PHAB: hard-code IP address for smtp [puppet] - 10https://gerrit.wikimedia.org/r/370132 (owner: 1020after4) [02:11:08] (03CR) 10Dzahn: [C: 032] PHAB: hard-code IP address for smtp [puppet] - 10https://gerrit.wikimedia.org/r/370132 (owner: 1020after4) [02:11:19] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 49 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [02:12:48] (03Draft1) 10Paladox: phabricator/dumps: Add phab1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/370133 [02:12:48] (03PS2) 10Paladox: phabricator/dumps: Add phab1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/370133 [02:13:11] !log outgoing phab mail is working again [02:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:02] !log phab1001 can't talk to mx servers via IPv6, but works via IPv4. iridium and other mailservers can also talk IPv6 to it. why? it did not change even when stopping ferm on client and on server it allows from anywhere. workaround for now was to hardcode IPv4 IP in phab config. (T163938) [02:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:18] T163938: setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938 [02:28:14] !log dzahn@neodymium conftool action : set/pooled=no; selector: name=phab1001-vcs.eqiad.wmnet [02:28:24] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=phab1001-vcs.eqiad.wmnet [02:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:09] RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal [02:29:15] hah :) [02:29:17] yay [02:29:20] RECOVERY - PyBal IPVS diff check on lvs1002 is OK: OK: no difference between hosts in IPVS/PyBal [02:30:59] RECOVERY - PyBal IPVS diff check on lvs1005 is OK: OK: no difference between hosts in IPVS/PyBal [02:32:11] (03PS1) 10Krinkle: webperf: Fix broken example value for eventlogging_path [puppet] - 10https://gerrit.wikimedia.org/r/370138 [02:34:57] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3499647 (10Krinkle) [02:35:00] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3049036 (10Krinkle) [02:35:06] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3049036 (10Krinkle) [02:36:37] (03CR) 10Dzahn: [C: 032] phabricator/dumps: Add phab1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/370133 (owner: 10Paladox) [02:37:25] (03PS2) 10Dzahn: phabricator/dumps: remove iridium as allowed dumps host [puppet] - 10https://gerrit.wikimedia.org/r/370123 (https://phabricator.wikimedia.org/T163938) [02:40:30] (03CR) 10Dzahn: [C: 032] phabricator/dumps: remove iridium as allowed dumps host [puppet] - 10https://gerrit.wikimedia.org/r/370123 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [02:41:19] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 0 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [02:47:09] (03CR) 10Krinkle: [C: 031] cache::misc/graphite: rename director, don't send cross-dc traffic [puppet] - 10https://gerrit.wikimedia.org/r/370107 (owner: 10Dzahn) [02:48:41] twentyafterfour just wanted to say that most of the repos still show that permission error [02:48:58] ah [02:49:01] working now [02:49:01] paladox: hmm [02:49:05] just slow to catch up [02:49:39] (03PS1) 10Dzahn: phabricator: remove iridium remnants, replace with phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370140 (https://phabricator.wikimedia.org/T163938) [02:50:45] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3499665 (10Dzahn) [02:51:01] twentyafterfour i wonder is there a way to click the update now button for all repos? (maybe shell script that does a recheck of them every 5 or 10 secs? [02:51:35] (03CR) 10Paladox: [C: 031] phabricator: remove iridium remnants, replace with phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370140 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [02:52:58] (03CR) 10Dzahn: "this means now rsync will be possible instead from phab1001 to phab2001, but not happen automatically yet" [puppet] - 10https://gerrit.wikimedia.org/r/370140 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [02:53:00] (03CR) 10Dzahn: [C: 032] phabricator: remove iridium remnants, replace with phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/370140 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [02:53:59] twentyafterfour we can try to fix git-ssh tommror :) [02:54:16] * paladox goes - 4am; [02:55:01] sounds good, good night paladox [02:55:08] that was also the last cleanup patch for now i think [02:55:30] and i like that LVS is happy again at least [02:55:39] and mail of course [02:55:56] :) [02:56:19] twentyafterfour: the rsync will now be allowed from 1001 to 2001 [02:56:20] yeah I think we can call it a night, we need help from netops to figure out the routing on git-ssh and ipv6 [02:56:23] but not auto [02:56:25] unless we enable it [02:56:26] ok [02:56:37] yes [02:56:41] ok :) [02:56:47] it wasn't THAT bad, heh [02:57:10] mutante: do we need to write an incident report even though it was a planned maintenance? (since it didn't quite go according to plan) [02:57:52] twentyafterfour: phab itself , http and mail, using tickets.. that was all up within the planned time, right [02:57:54] the service was actually back online pretty quickly [02:57:56] or did we take longer [02:58:16] only the git-ssh part but like I said, it's barely used by anyone [02:58:41] hmm.. let's make a ticket that it's broken [02:58:46] ok [02:58:51] i think that is enough [02:58:57] want me to do that part? you've done enough [02:59:05] sure:) [02:59:09] thanks [03:04:38] https://phabricator.wikimedia.org/T172478 [03:04:53] 10Operations, 10netops: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499689 (10mmodell) [03:04:57] 10Operations, 10Phabricator, 10netops: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499702 (10mmodell) [03:05:11] 10Operations, 10Phabricator, 10netops: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499689 (10mmodell) [03:06:00] 10Operations, 10Phabricator, 10netops: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499689 (10mmodell) p:05Triage>03High [03:07:39] (03CR) 10Krinkle: "What is the default for $wgRCFeeds from wmf-config without this override? Is that suitable for labs? Looks like it might wrongly inherit a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [03:08:21] paladox: go to bed ;) [03:08:38] good night everyone thanks for all of your support [03:08:46] 10Operations, 10Phabricator, 10netops: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499709 (10mmodell) Note that this is high priority but not UBN, simply because git-ssh is barely used currently. Phabricator supports git over https which is... [03:09:33] good night twentyafterfour [03:21:30] 10Operations, 10Phabricator, 10netops: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499712 (10ayounsi) a:03ayounsi [03:27:39] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 754.62 seconds [03:28:57] 10Operations, 10MediaWiki-JobRunner, 10Release-Engineering-Team, 10monitoring, 10Wikimedia-Incident: Collect error logs from jobchron/jobrunner services in Logstash - https://phabricator.wikimedia.org/T172479#3499719 (10Krinkle) [03:32:18] 10Operations, 10Phabricator, 10netops: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499734 (10bd808) Diffusion is the master repo for some sub-set of #toolforge projects. Its not a huge number of people impacted, but it is certainly non-zero.... [03:41:43] 10Operations, 10MediaWiki-JobRunner, 10Release-Engineering-Team, 10monitoring, 10Wikimedia-Incident: Collect error logs from jobchron/jobrunner services in Logstash - https://phabricator.wikimedia.org/T172479#3499754 (10Krinkle) [04:03:20] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 31 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [04:07:30] PROBLEM - PHD should be running on iridium is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 997 (phd) [04:08:29] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) [04:09:00] arg, downtime expired. but wasnt icinga'a fault [04:09:07] can be ignored on iridium [04:09:56] ACKNOWLEDGEMENT - PHD should be running on iridium is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 997 (phd) daniel_zahn migration [04:10:51] ACKNOWLEDGEMENT - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) daniel_zahn migration [04:13:59] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 274.15 seconds [04:23:08] 10Operations, 10Phabricator, 10netops: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499771 (10mmodell) ok so that takes care of the smtp/ipv6 issue, however, git-ssh still doesn't work. So I guess I was wrong about them being related. [04:24:11] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3499773 (10mmodell) [04:24:16] 10Operations, 10Phabricator, 10netops: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499772 (10mmodell) 05Resolved>03Open [04:25:01] (03PS1) 1020after4: Revert "PHAB: hard-code IP address for smtp" [puppet] - 10https://gerrit.wikimedia.org/r/370144 [04:25:35] (03PS2) 1020after4: Revert "PHAB: hard-code IP address for smtp" [puppet] - 10https://gerrit.wikimedia.org/r/370144 (https://phabricator.wikimedia.org/T172478) [04:27:32] 10Operations, 10Phabricator, 10netops, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499790 (10mmodell) We can revert the hard-coded smtp server IPs now: cd461e5cf761f053d453528cd26331c80ba66f17 [04:29:31] 10Operations, 10Phabricator, 10netops, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499792 (10Dzahn) That IP that was removed also existed on iridium before, on eth0, were i removed it: 00:23 mutante: iridium sudo ip a... [04:36:14] (03PS3) 10Dzahn: Revert "PHAB: hard-code IP address for smtp" [puppet] - 10https://gerrit.wikimedia.org/r/370144 (https://phabricator.wikimedia.org/T172478) (owner: 1020after4) [04:38:29] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [04:49:20] 10Operations, 10Traffic, 10netops: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3499804 (10Marostegui) Hi, We have some critical DB hosts on that row that would need to be either failed over or to communicate to users that a period of read-only is happening. To fail over those hos... [04:56:27] 10Operations, 10Phabricator, 10netops, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499806 (10ayounsi) the git-ssh issue is due to LVS not knowing where to forward the packets. ``` ayounsi@lvs1002:~$ sudo ipvsadm -Ln TC... [04:57:38] (03PS1) 10Dzahn: phab: fix IP for phab1001-vcs [dns] - 10https://gerrit.wikimedia.org/r/370145 (https://phabricator.wikimedia.org/T172478) [04:59:13] (03CR) 10Dzahn: [C: 032] phab: fix IP for phab1001-vcs [dns] - 10https://gerrit.wikimedia.org/r/370145 (https://phabricator.wikimedia.org/T172478) (owner: 10Dzahn) [04:59:49] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:02] (03PS1) 10Dzahn: phabricator: fix IP for git-ssh.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/370146 (https://phabricator.wikimedia.org/T172478) [05:02:43] (03CR) 10Dzahn: [C: 032] phabricator: fix IP for git-ssh.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/370146 (https://phabricator.wikimedia.org/T172478) (owner: 10Dzahn) [05:07:01] 10Operations, 10Phabricator, 10netops, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499811 (10Dzahn) picked a new IP in the 10.64.16.0/22 network (row B) and used that instead [05:09:55] (03PS4) 10Dzahn: Revert "PHAB: hard-code IP address for smtp" [puppet] - 10https://gerrit.wikimedia.org/r/370144 (https://phabricator.wikimedia.org/T172478) (owner: 1020after4) [05:14:04] (03CR) 10Dzahn: [C: 032] Revert "PHAB: hard-code IP address for smtp" [puppet] - 10https://gerrit.wikimedia.org/r/370144 (https://phabricator.wikimedia.org/T172478) (owner: 1020after4) [05:23:21] !log phab1001 sudo ip addr del 10.64.32.186/32 dev eth0 (T172478) [05:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:36] T172478: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478 [05:27:59] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:04:50] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [06:06:19] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [06:08:50] woo! works, who fixed it? :D [06:10:47] XioNoX did [06:10:57] by restarting pybal [06:12:00] 10Operations, 10Phabricator, 10netops, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499869 (10Dzahn) Xionox restarted pybal after this .. and then: 23:04 <+icinga-wm> RECOVERY - PyBal backends health check on lvs1005 is... [06:21:20] 10Operations, 10Phabricator, 10netops, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499870 (10Dzahn) We can now talk to the ssh. Tested from external, IPv4 and IPv6. There is apparently another issue with Phabricator its... [06:42:29] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 50 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:47:14] !log Sanitize hiwikiversity on sanitarium and sanitarium2 - T171829 [06:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:29] T171829: Prepare and check storage layer for hi.wikiversity - https://phabricator.wikimedia.org/T171829 [07:07:30] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:07:47] !log installing imagemagick regression security updates on trusty [07:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:06] (03PS1) 1020after4: PHAB: move the ssh hook somewhere sshd won't complain about [puppet] - 10https://gerrit.wikimedia.org/r/370153 [07:14:40] (03CR) 10jerkins-bot: [V: 04-1] PHAB: move the ssh hook somewhere sshd won't complain about [puppet] - 10https://gerrit.wikimedia.org/r/370153 (owner: 1020after4) [07:14:54] (03PS2) 10Dzahn: PHAB: move the ssh hook somewhere sshd won't complain about [puppet] - 10https://gerrit.wikimedia.org/r/370153 (owner: 1020after4) [07:15:30] (03CR) 10jerkins-bot: [V: 04-1] PHAB: move the ssh hook somewhere sshd won't complain about [puppet] - 10https://gerrit.wikimedia.org/r/370153 (owner: 1020after4) [07:17:15] (03PS3) 10Dzahn: PHAB: move the ssh hook somewhere sshd won't complain about [puppet] - 10https://gerrit.wikimedia.org/r/370153 (owner: 1020after4) [07:17:55] (03CR) 10jerkins-bot: [V: 04-1] PHAB: move the ssh hook somewhere sshd won't complain about [puppet] - 10https://gerrit.wikimedia.org/r/370153 (owner: 1020after4) [07:19:08] (03PS4) 10Dzahn: PHAB: move the ssh hook somewhere sshd won't complain about [puppet] - 10https://gerrit.wikimedia.org/r/370153 (https://phabricator.wikimedia.org/T172478) (owner: 1020after4) [07:20:08] (03CR) 10Dzahn: [C: 032] PHAB: move the ssh hook somewhere sshd won't complain about [puppet] - 10https://gerrit.wikimedia.org/r/370153 (https://phabricator.wikimedia.org/T172478) (owner: 1020after4) [07:20:38] <_joe_> tests running in 10 seconds :) [07:20:50] <_joe_> mutante: did you feel jenkins responding faster today? [07:21:05] yes, that felt relatively fast now [07:21:14] it does seem fairly quick, would be awesome of mediawiki tests ran quickly [07:21:42] twentyafterfour: try again [07:22:19] mutante: works!!! [07:22:22] YAY! [07:22:34] and this should conclude the phab migration we started ... earlier :) [07:22:59] hahaha finally [07:23:12] thanks mutante for all the help, I owe you one, and XioNoX too! [07:23:15] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3499923 (10Dzahn) [07:23:21] 10Operations, 10Phabricator, 10netops, 10Patch-For-Review: git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001 - https://phabricator.wikimedia.org/T172478#3499921 (10Dzahn) 05Open>03Resolved 00:21 < mutante> twentyafterfour: try again 00:22 < twentyafterfour> mutante: works!!! [07:23:31] and finally we can check off one more ubuntu host from the long list [07:23:33] twentyafterfour: high-five [07:23:37] it's nearly the last one isn't it? [07:23:43] yes, i like removing -buntu [07:24:25] "nearly the last" i think not yet.. let's see [07:25:33] we should edit the plan on https://phabricator.wikimedia.org/T163938 :) [07:26:11] wants to close that and just decom ticket for iridium [07:26:23] but tomorrow is still time to cleanup [07:27:58] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#3499927 (10mmodell) 05Open>03Resolved [07:28:15] :) good work mutante, thanks again, sorry this kept you up so late [07:28:31] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3499928 (10Dzahn) [07:30:16] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3273664 (10Dzahn) T152129 has been resolved [07:31:17] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3499936 (10mmodell) [07:31:37] twentyafterfour: :) glad we got everything done now. better late than continue another day or get more people involved. thanks as well. and good night for now [07:31:48] good night :) [07:33:13] also, phab1001 works with scap, with one caviet that we need to run puppet after a scap deploy due to a bunch of silly stuff in puppet [07:33:36] ubuntu-- [07:33:38] scap++ [07:39:46] (03PS1) 10Marostegui: db-codfw.php: Depool db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370155 (https://phabricator.wikimedia.org/T171321) [07:43:59] 10Operations, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3499956 (10Dzahn) [07:44:21] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370155 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [07:45:32] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3499968 (10Dzahn) [07:45:49] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370155 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [07:45:53] 10Operations, 10ops-eqiad, 10Phabricator, 10Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#3499972 (10Dzahn) [07:45:58] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3273670 (10Dzahn) 05Open>03Resolved [07:46:20] (03CR) 10jenkins-bot: db-codfw.php: Depool db2073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370155 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [07:47:10] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2073 - T171321 (duration: 00m 47s) [07:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:20] T171321: Finish dbstore2002 migration to multi-instance - https://phabricator.wikimedia.org/T171321 [07:47:22] !log Stop MySQL on db2073 to copy its data to dbstore2002 - https://phabricator.wikimedia.org/T171321 [07:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:00] 10Operations, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3499979 (10Dzahn) 05Open>03stalled [07:52:09] (03PS1) 10Marostegui: mariadb: Add instance s4 to dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/370156 (https://phabricator.wikimedia.org/T171321) [07:54:47] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler02/7294/" [puppet] - 10https://gerrit.wikimedia.org/r/370156 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [07:54:49] (03CR) 10Marostegui: [C: 032] mariadb: Add instance s4 to dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/370156 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [08:04:25] 10Operations, 10Traffic, 10netops: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3500004 (10elukey) A couple of notes from my side after reading the host list: Analytics: 1) all the analytics* host in row D down shouldn't be an issue for a brief amount of time since the Hadoop clus... [08:04:42] !log Sanitize wikimania2018wiki on sanitarium and sanitarium2 - T155041 [08:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:55] T155041: Prepare and check storage layer for wikimania2018wiki - https://phabricator.wikimedia.org/T155041 [08:13:54] (03PS3) 10Giuseppe Lavagetto: Rakefile: re-add some global tasks [puppet] - 10https://gerrit.wikimedia.org/r/369918 [08:19:07] !log Deploy schema change directly on s3 master for wikimania2018wiki - T172485 [08:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:18] T172485: Convert unique keys into primary keys for some wiki tables on s3 (both eqiad and codfw) - https://phabricator.wikimedia.org/T172485 [08:25:33] (03CR) 10Marostegui: [C: 04-1] mariadb/phabricator: update GRANTS from iridium to phab1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [08:33:03] 10Operations, 10Wikidata, 10Patch-For-Review, 10User-notice, 10Wikimedia-Incident: Wikidata and dewiki databases locked - https://phabricator.wikimedia.org/T171928#3500069 (10jcrespo) [08:35:04] !log Deploy schema change directly on s3 master for hiwikiversity - T172485 [08:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:16] T172485: Convert unique keys into primary keys for some wiki tables on s3 (both eqiad and codfw) - https://phabricator.wikimedia.org/T172485 [08:35:27] (03CR) 10Giuseppe Lavagetto: [C: 032] Rakefile: re-add some global tasks [puppet] - 10https://gerrit.wikimedia.org/r/369918 (owner: 10Giuseppe Lavagetto) [08:37:04] 10Operations, 10Wikidata, 10Patch-For-Review, 10User-notice, 10Wikimedia-Incident: Wikidata and dewiki databases locked - https://phabricator.wikimedia.org/T171928#3500094 (10jcrespo) [08:37:15] (03Abandoned) 10Giuseppe Lavagetto: admin: fixup for I191cbe091347e [puppet] - 10https://gerrit.wikimedia.org/r/368610 (owner: 10Giuseppe Lavagetto) [08:43:54] 10Operations, 10DBA, 10Patch-For-Review: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#3500144 (10jcrespo) [08:44:29] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Joe: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3310890 (10Joe) I did rewrite the Rakefile according to @faidon's suggestions, did tweak the... [08:45:00] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3500148 (10Joe) [08:45:05] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Joe: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3500147 (10Joe) 05Open>03Resolved [08:47:49] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#3500151 (10jcrespo) I made a comment somewhere (cannot find where) saying this may aff... [08:49:45] (03PS1) 10Giuseppe Lavagetto: Use a textarea for content differences [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370160 (https://phabricator.wikimedia.org/T172362) [09:07:12] (03PS1) 10Giuseppe Lavagetto: Revert "base: missing quotes around is_virtual 'false' for ipmi" [puppet] - 10https://gerrit.wikimedia.org/r/370165 [09:07:49] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#3500242 (10Marostegui) >>! In T115982#3500151, @jcrespo wrote: > I made a comment some... [09:07:51] (03CR) 10jerkins-bot: [V: 04-1] Revert "base: missing quotes around is_virtual 'false' for ipmi" [puppet] - 10https://gerrit.wikimedia.org/r/370165 (owner: 10Giuseppe Lavagetto) [09:14:20] (03PS1) 10Giuseppe Lavagetto: parsoid: test commmit for T149432 [puppet] - 10https://gerrit.wikimedia.org/r/370168 [09:14:31] (03Abandoned) 10Giuseppe Lavagetto: Revert "base: missing quotes around is_virtual 'false' for ipmi" [puppet] - 10https://gerrit.wikimedia.org/r/370165 (owner: 10Giuseppe Lavagetto) [09:18:54] (03PS1) 10Marostegui: s4.hosts: Add dbstore2002 [software] - 10https://gerrit.wikimedia.org/r/370169 (https://phabricator.wikimedia.org/T171321) [09:19:39] !log Deploy schema change directly on s3 master for techconductwiki - T172485 [09:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:51] T172485: Convert unique keys into primary keys for some wiki tables on s3 (both eqiad and codfw) - https://phabricator.wikimedia.org/T172485 [09:19:52] (03PS1) 10Elukey: modules::geowiki::private_data: fix rsync for data-private-bare [puppet] - 10https://gerrit.wikimedia.org/r/370170 (https://phabricator.wikimedia.org/T132324) [09:21:30] (03CR) 10Marostegui: [C: 032] s4.hosts: Add dbstore2002 [software] - 10https://gerrit.wikimedia.org/r/370169 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [09:22:21] (03Merged) 10jenkins-bot: s4.hosts: Add dbstore2002 [software] - 10https://gerrit.wikimedia.org/r/370169 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [09:22:27] !log Add dbstore2002 to tendril - T171321 [09:22:35] 10Operations, 10puppet-compiler, 10Patch-For-Review: puppet compiler claims "no change" when catalogs are actually different - https://phabricator.wikimedia.org/T149432#3500260 (10Joe) This should be resolved with the new home-brewed differ: I did a test change: https://gerrit.wikimedia.org/r/#/c/370168/... [09:22:38] (03CR) 10Elukey: [C: 032] modules::geowiki::private_data: fix rsync for data-private-bare [puppet] - 10https://gerrit.wikimedia.org/r/370170 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [09:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:39] T171321: Finish dbstore2002 migration to multi-instance - https://phabricator.wikimedia.org/T171321 [09:22:42] 10Operations, 10puppet-compiler, 10Patch-For-Review: puppet compiler claims "no change" when catalogs are actually different - https://phabricator.wikimedia.org/T149432#3500261 (10Joe) 05Open>03Resolved [09:29:24] (03CR) 10Filippo Giunchedi: [C: 031] Use a textarea for content differences [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370160 (https://phabricator.wikimedia.org/T172362) (owner: 10Giuseppe Lavagetto) [09:32:17] 10Operations, 10Puppet, 10DBA: Switch databases to the future parser - https://phabricator.wikimedia.org/T172498#3500273 (10jcrespo) [09:39:29] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3500339 (10elukey) Any news on this? :) [09:50:20] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3500371 (10elukey) >>! In T121561#3323871, @Ottomata wrote: > We should do some work to understand how ACLs work and what ACLs f... [09:53:05] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3500389 (10elukey) > Note that this plan doesn't yet consider encryption of traffic between Kafka and Zookeeper. Should we? We'... [10:03:01] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3500417 (10elukey) @Ottomata should we keep this task open given that we already have https://phabricator.wikimedia.org/T166167 ? [10:05:17] !log Stop replication on db2073 for maintenance [10:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:25] (03CR) 10DCausse: [C: 031] "test on relforge and seems to work as expected" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370127 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [10:14:12] !log Deploy schema change directly on s3 master for atjwiki - T172485 [10:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:23] T172485: Convert unique keys into primary keys for some wiki tables on s3 (both eqiad and codfw) - https://phabricator.wikimedia.org/T172485 [10:35:09] (03CR) 10Elukey: "Updating this code review:" [puppet] - 10https://gerrit.wikimedia.org/r/362237 (https://phabricator.wikimedia.org/T169248) (owner: 10Nuria) [10:44:56] (03PS5) 10Muehlenhoff: Adapt debdeploy server components to Cumin (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/368190 [11:14:53] !log Deploy schema change directly on s3 master for dinwiki - T172485 [11:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:07] T172485: Convert unique keys into primary keys for some wiki tables on s3 (both eqiad and codfw) - https://phabricator.wikimedia.org/T172485 [11:23:19] (03Draft1) 10Paladox: phabricator: Auto rsync phab1001 to phab2001 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/370179 [11:23:23] (03PS2) 10Paladox: phabricator: Auto rsync phab1001 to phab2001 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/370179 [11:30:39] !log Deploy schema change directly on s3 master for kbpwiki - T172485 [11:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:50] T172485: Convert unique keys into primary keys for some wiki tables on s3 (both eqiad and codfw) - https://phabricator.wikimedia.org/T172485 [11:40:41] (03PS1) 10Urbanecm: Fix srwiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370183 (https://phabricator.wikimedia.org/T150618) [11:46:27] !log Deploy schema change directly on s3 master for maiwikimedia - T172485 [11:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:38] T172485: Convert unique keys into primary keys for some wiki tables on s3 (both eqiad and codfw) - https://phabricator.wikimedia.org/T172485 [12:18:00] (03PS6) 10Muehlenhoff: Adapt debdeploy server components to Cumin [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/368190 [12:23:39] (03CR) 10Muehlenhoff: [C: 032] Adapt debdeploy server components to Cumin [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/368190 (owner: 10Muehlenhoff) [12:38:25] (03PS1) 10Elukey: role::an_cluster::hadoop::client: moving to profiles (first part) [puppet] - 10https://gerrit.wikimedia.org/r/370187 (https://phabricator.wikimedia.org/T167790) [12:39:04] (03CR) 10jerkins-bot: [V: 04-1] role::an_cluster::hadoop::client: moving to profiles (first part) [puppet] - 10https://gerrit.wikimedia.org/r/370187 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [12:42:43] (03PS2) 10Elukey: role::an_cluster::hadoop::client: moving to profiles (first part) [puppet] - 10https://gerrit.wikimedia.org/r/370187 (https://phabricator.wikimedia.org/T167790) [12:53:36] (03PS3) 10Elukey: role::an_cluster::hadoop::client: moving to profiles (first part) [puppet] - 10https://gerrit.wikimedia.org/r/370187 (https://phabricator.wikimedia.org/T167790) [12:56:36] (03PS1) 10Muehlenhoff: Remove some obsolete salt references in the docs [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/370191 [12:59:08] (03CR) 10Muehlenhoff: [C: 032] Remove some obsolete salt references in the docs [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/370191 (owner: 10Muehlenhoff) [13:07:53] (03PS1) 10Giuseppe Lavagetto: puppetdb: spin off module, support hsqldb [puppet] - 10https://gerrit.wikimedia.org/r/370192 (https://phabricator.wikimedia.org/T150456) [13:08:07] <_joe_> elukey: an_cluster? [13:08:08] <_joe_> come on [13:08:10] <_joe_> :P [13:08:41] It's friday afternoon! [13:10:22] (03PS4) 10Gehel: wdqs - remove upstart configuration files [puppet] - 10https://gerrit.wikimedia.org/r/369688 [13:10:59] (03CR) 10Gehel: wdqs - remove upstart configuration files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/369688 (owner: 10Gehel) [13:11:57] (03PS4) 10Gehel: wdqs - moving to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/369682 [13:12:26] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:30] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7299" [puppet] - 10https://gerrit.wikimedia.org/r/370192 (https://phabricator.wikimedia.org/T150456) (owner: 10Giuseppe Lavagetto) [13:13:17] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 79151 bytes in 0.140 second response time [13:14:26] <_joe_> arg, damn puppet [13:16:26] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:20:40] _joe_ I can't write more than X chars, how can I fit everything ? :D [13:20:51] <_joe_> elukey: ?? [13:20:56] an_cluster [13:21:10] <_joe_> why you can't write more than X chars? [13:21:19] Jenkins complains if the first line has more than 50 chars IIRC [13:21:21] <_joe_> also, what's wrong with 'analytics' ? [13:21:35] that I have more space to write the headline! :) [13:21:51] <_joe_> or role::...hadoop::client :P [13:21:57] <_joe_> that seems like a real name [13:25:03] (03PS1) 10Giuseppe Lavagetto: puppetdb: move more things to the module [puppet] - 10https://gerrit.wikimedia.org/r/370199 (https://phabricator.wikimedia.org/T150456) [13:26:22] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3500872 (10Ottomata) Let's keep it open and use this task to track actually enabling TLS / ACLs for different clients. [13:31:11] (03CR) 10Ottomata: "I commented in ticket, but https://phabricator.wikimedia.org/T118772#3084302 should probably be done. If so, then your path would probabl" [puppet] - 10https://gerrit.wikimedia.org/r/370138 (owner: 10Krinkle) [13:32:38] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetdb: move more things to the module [puppet] - 10https://gerrit.wikimedia.org/r/370199 (https://phabricator.wikimedia.org/T150456) (owner: 10Giuseppe Lavagetto) [13:33:57] (03CR) 10Ottomata: "Thanks elukey! :)" [puppet] - 10https://gerrit.wikimedia.org/r/370170 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [13:34:22] (03PS3) 10BBlack: Reserve non zero rated IPs and ranges [dns] - 10https://gerrit.wikimedia.org/r/370094 (https://phabricator.wikimedia.org/T170518) (owner: 10Ayounsi) [13:35:27] (03CR) 10BBlack: "Added git-ssh, and s/misc-web2-lb/misc-web/" [dns] - 10https://gerrit.wikimedia.org/r/370094 (https://phabricator.wikimedia.org/T170518) (owner: 10Ayounsi) [13:35:36] RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:37:56] (03CR) 10BBlack: [C: 031] cache::misc/graphite: rename director, don't send cross-dc traffic [puppet] - 10https://gerrit.wikimedia.org/r/370107 (owner: 10Dzahn) [13:37:59] (03CR) 10Luke081515: [C: 031] Allow bureaucrats on WMF wikis to grant and remove 'confirmed' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368939 (https://phabricator.wikimedia.org/T101983) (owner: 10MarcoAurelio) [13:38:07] (03CR) 10BBlack: [C: 031] Reserve non zero rated IPs and ranges [dns] - 10https://gerrit.wikimedia.org/r/370094 (https://phabricator.wikimedia.org/T170518) (owner: 10Ayounsi) [13:38:18] (03CR) 10Elukey: "Andrew let's discuss this :)" [puppet] - 10https://gerrit.wikimedia.org/r/370187 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [13:45:46] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 496.67 seconds [13:46:43] marostegui: I am wondering if we could leave only the log databse in --^ [13:47:07] elukey: that is for the analysts to decide :) [13:47:35] marostegui: almost nobody uses that host, dbstore1002 is the one used by most people.. I'll ask :) [13:47:48] elukey: thanks! :) [13:47:59] elukey: if dbstore1002 goes downt, they'd use it no? [13:48:56] marostegui: not sure, but yesterday I saw that hal*fak created a task to move away from the dbstore1002 model so I guess that we need to do it anyway sooner or later :) [13:51:43] (03PS1) 10BBlack: Add new misc-web-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/370201 (https://phabricator.wikimedia.org/T170518) [13:51:45] (03PS1) 10BBlack: Add new git-ssh IPs [puppet] - 10https://gerrit.wikimedia.org/r/370202 (https://phabricator.wikimedia.org/T170518) [13:54:20] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/359451 (owner: 10Faidon Liambotis) [13:54:41] (03CR) 10jerkins-bot: [V: 04-1] graphite: cleanup configparser_format a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359451 (owner: 10Faidon Liambotis) [13:55:46] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.45 seconds [13:57:35] elukey: going to silence db1047 [13:58:01] it recovered [13:58:46] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [14:04:59] (03CR) 10BBlack: [C: 04-1] "Hmm wait, we have v6 issues here...." [dns] - 10https://gerrit.wikimedia.org/r/370094 (https://phabricator.wikimedia.org/T170518) (owner: 10Ayounsi) [14:07:13] (03PS1) 10Muehlenhoff: Separate generate-debdeploy-spec for Cumin-based debdeploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/370203 [14:13:48] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: add puppetdb installation [puppet] - 10https://gerrit.wikimedia.org/r/370205 [14:19:23] (03CR) 10Muehlenhoff: [C: 032] Separate generate-debdeploy-spec for Cumin-based debdeploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/370203 (owner: 10Muehlenhoff) [14:20:00] (03CR) 10Herron: [C: 032] Change wikipedia.org SPF record to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/370040 (https://phabricator.wikimedia.org/T170891) (owner: 10Herron) [14:20:08] (03PS2) 10Herron: Change wikipedia.org SPF record to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/370040 (https://phabricator.wikimedia.org/T170891) [14:27:27] (03PS4) 10BBlack: Reserve non zero rated IPs and ranges [dns] - 10https://gerrit.wikimedia.org/r/370094 (https://phabricator.wikimedia.org/T170518) (owner: 10Ayounsi) [14:29:09] (03CR) 10BBlack: "V6 fixup: document 2620:0:86[0123]:ed14::/64 as non-zero-rated (to match the ed1a /64), and the ::0:0/112 within as our LVS range for non-" [dns] - 10https://gerrit.wikimedia.org/r/370094 (https://phabricator.wikimedia.org/T170518) (owner: 10Ayounsi) [14:33:19] (03PS2) 10BBlack: Add new misc-web-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/370201 (https://phabricator.wikimedia.org/T170518) [14:33:21] (03PS2) 10BBlack: Add new git-ssh IPs [puppet] - 10https://gerrit.wikimedia.org/r/370202 (https://phabricator.wikimedia.org/T170518) [14:36:21] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: add puppetdb installation [puppet] - 10https://gerrit.wikimedia.org/r/370205 (owner: 10Giuseppe Lavagetto) [14:38:29] (03PS3) 10BBlack: Add new misc-web-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/370201 (https://phabricator.wikimedia.org/T170518) [14:38:31] (03PS3) 10BBlack: Add new git-ssh IPs [puppet] - 10https://gerrit.wikimedia.org/r/370202 (https://phabricator.wikimedia.org/T170518) [14:38:33] (03PS1) 10BBlack: Add LVS nonzero ranges in network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/370210 (https://phabricator.wikimedia.org/T170518) [14:41:59] 10Operations, 10Mail, 10Security: Make SPF for wikipedia.org more strict - https://phabricator.wikimedia.org/T170891#3501212 (10Reedy) [14:46:58] (03CR) 10Elukey: "ppc: https://puppet-compiler.wmflabs.org/compiler02/7301/analytics1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/370187 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:51:24] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3501239 (10Andrew) [14:56:55] Reedy: can you deploy it before weekend? [14:57:03] Deploy what? [14:57:41] (03PS1) 10Rush: labsdb: maintain-views and maintain_meta-p sock option [puppet] - 10https://gerrit.wikimedia.org/r/370217 (https://phabricator.wikimedia.org/T172496) [14:58:06] Reedy: https://gerrit.wikimedia.org/r/#/c/368770/ [14:59:30] Steinsplitter: I could, but probably not much point [14:59:52] Reedy: not much point? [14:59:56] We're not using it [15:00:01] Because of unintended side effect [15:01:09] why it was then on the (german) tech news... okay, thanks for the information [15:04:12] Steinsplitter: https://meta.wikimedia.org/wiki/MediaWiki:Email-blacklist [15:04:13] It's empty [15:04:22] So it does nothing [15:04:29] i added nothing yet. [15:04:55] Also, we're not really supposed to deploy on fridays ;) [15:19:08] what's going on? :) [15:21:46] 10Operations, 10Wikidata, 10Patch-For-Review, 10User-notice, 10Wikimedia-Incident: Wikidata and dewiki databases locked - https://phabricator.wikimedia.org/T171928#3501433 (10jcrespo) 05Open>03Resolved I have created all actionables on both the incident documentation ( https://wikitech.wikimedia.org/... [15:22:39] (03PS3) 10Thcipriani: Allow mwdeploy user to restart jobchron [puppet] - 10https://gerrit.wikimedia.org/r/367815 (https://phabricator.wikimedia.org/T129148) [15:33:54] I am fixing dbstore2001 due to yesterday's x1 creations [15:34:07] at some point I will submit a patch to mediawiki- but then I get lazy [15:42:16] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3501452 (10Andrew) [15:44:57] cd [15:48:32] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3501476 (10RobH) a:05RobH>03Cmjohnson This does indeed need hardware raid setup, please setup in a large raid10 of all disks via crash cart, then I can install. Al... [15:56:08] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3501481 (10Jgreen) >>! In T171962#3492728, @mark wrote: > No objections from me. It does add complexity somewhat and will probably add some failure modes wher... [15:59:58] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3501493 (10RobH) [16:00:00] 10Operations, 10ops-ulsfo, 10hardware-requests: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#3501488 (10RobH) 05Open>03stalled p:05Normal>03Low [16:05:46] (03PS1) 10RobH: cp400[1-4] decom, mgmt dns removal [dns] - 10https://gerrit.wikimedia.org/r/370224 (https://phabricator.wikimedia.org/T169020) [16:06:39] (03CR) 10RobH: [C: 032] cp400[1-4] decom, mgmt dns removal [dns] - 10https://gerrit.wikimedia.org/r/370224 (https://phabricator.wikimedia.org/T169020) (owner: 10RobH) [16:07:54] 10Operations, 10ops-ulsfo, 10hardware-requests, 10Patch-For-Review: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#3501505 (10RobH) [16:08:05] 10Operations, 10ops-ulsfo, 10hardware-requests: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#3384648 (10RobH) [16:25:17] 10Operations, 10Cloud-Services, 10procurement: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3501521 (10RobH) [16:25:21] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3501521 (10RobH) [16:26:50] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3501521 (10RobH) [16:51:56] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 267 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:52:21] (03CR) 10Smalyshev: [C: 031] wdqs - remove upstart configuration files [puppet] - 10https://gerrit.wikimedia.org/r/369688 (owner: 10Gehel) [16:56:45] gwicke: docker-testing01.services.eqiad.wmflabs is a disaster, gobbling diskspace and puppet broken. Can I just delete it? godog [16:56:56] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 9 probes of 267 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [17:00:42] ottomata: would it be possible for you to fix puppet on druid101.analytics.eqiad.wmflabs ? [17:00:46] I can do it, but… I have lot of these [17:02:10] (03CR) 10Mobrovac: [C: 031] JobQueueEventBus: Enable on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370064 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [17:17:05] andrewbogott: deleted it. [17:17:09] thanks [17:18:24] 10Puppet, 10Cloud-VPS: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3501730 (10bd808) [17:22:35] godog: is the 'swift' project still your thing? Both swift-stretch-ms-be02.swift.eqiad.wmflabs and swift-stretch-ms-be01.swift.eqiad.wmflabs have broken puppet [17:23:09] thcipriani: puppet is broken on thcipriani-mediawiki.staging.eqiad.wmflabs, thcipriani-proxy.staging.eqiad.wmflabs, thcipriani-tin.staging.eqiad.wmflabs [17:23:20] do you have time to take a look? I can fix them but may not use a subtle approach :) [17:23:34] andrewbogott: heh, I can take a look, thanks for the heads up :) [17:23:46] thank you! [17:25:56] 10Puppet, 10Cloud-VPS: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3501753 (10bd808) `::profile::puppetmaster::common` seems to be applied somehow via [[https://tools.wmflabs.org/openstack-browser/puppetclass/role::puppet_compiler... [17:35:38] hm, gwicke out today? [17:38:37] andrewbogott: if so I would suspend it per https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/instance_lifecycle [17:39:15] * andrewbogott nods [17:57:25] (03CR) 10Hashar: [C: 04-1] "Ah virtual packages. Yeah that is the good point and I dont remember what was the issue with it." [puppet] - 10https://gerrit.wikimedia.org/r/346165 (owner: 10Hashar) [18:17:32] !log switched most cloud instance to new puppetmasters, as per https://phabricator.wikimedia.org/T171786 [18:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:34] (03Abandoned) 10Smalyshev: [WIP] Enable banning clients by IP by setting wdqs::gui::bad_clients [puppet] - 10https://gerrit.wikimedia.org/r/365821 (https://phabricator.wikimedia.org/T170860) (owner: 10Smalyshev) [19:06:08] 10Puppet, 10Cloud-VPS: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3502028 (10bd808) Caused by a refactoring in progress for `::role::puppet_compiler` by @Joe: https://gerrit.wikimedia.org/r/#/c/370205/1 [19:24:37] (03PS4) 10Eevans: WIP: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) [19:26:16] 10Puppet, 10Cloud-VPS: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3502061 (10bd808) a:03Joe [19:27:06] 10Operations, 10Performance-Team, 10monitoring: Ensure getLagTimes.php is working properly - https://phabricator.wikimedia.org/T172559#3502062 (10Krinkle) [19:33:35] (03PS3) 10Dzahn: cache::misc/graphite: rename director, don't send cross-dc traffic [puppet] - 10https://gerrit.wikimedia.org/r/370107 [19:34:25] (03CR) 10Dzahn: [C: 032] cache::misc/graphite: rename director, don't send cross-dc traffic [puppet] - 10https://gerrit.wikimedia.org/r/370107 (owner: 10Dzahn) [19:34:32] jenkins fast , yes ! [19:36:07] mutante: :) [19:38:26] !log renaming graphite varnish director/fixing config, running puppet on cache misc, tested on cp1045 [19:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:30] (03CR) 10Dzahn: "yea, we should enable it, but we need to fix the permissions issue. it would be fixed if the phd user had the same UID on both servers. we" [puppet] - 10https://gerrit.wikimedia.org/r/370179 (owner: 10Paladox) [19:43:11] (03CR) 10Dzahn: "puppet ran on all misc::cache via cumin. performance.wikimedia.org and graphite.wikimedia.org work like before" [puppet] - 10https://gerrit.wikimedia.org/r/370107 (owner: 10Dzahn) [19:46:40] (03CR) 10Dzahn: mariadb/phabricator: update GRANTS from iridium to phab1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [19:47:00] (03PS4) 10Dzahn: mariadb/phabricator: update GRANTS from iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) [19:49:55] (03PS5) 10Eevans: WIP: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) [19:54:23] (03PS4) 10Andrew Bogott: get $::labsproject from the certname [puppet] - 10https://gerrit.wikimedia.org/r/368606 (https://phabricator.wikimedia.org/T171289) [19:54:25] (03PS1) 10Andrew Bogott: Switch labs to use labs-puppetmaster.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/370246 [19:59:13] (03PS1) 10Andrew Bogott: Add profile::openstack::main::observer_password [labs/private] - 10https://gerrit.wikimedia.org/r/370247 [20:00:45] 10Operations: wikitech-static.wikimedia.org certificate renewal (expiring 2017-08-09) - https://phabricator.wikimedia.org/T172285#3502109 (10Dzahn) a:03Dzahn [20:01:09] (03CR) 10Andrew Bogott: [V: 032 C: 032] Add profile::openstack::main::observer_password [labs/private] - 10https://gerrit.wikimedia.org/r/370247 (owner: 10Andrew Bogott) [20:02:01] !log wikitech-static-ord - apt-get install certbot [20:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:27] (03PS6) 10Eevans: WIP: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) [20:06:24] 10Operations, 10Performance-Team, 10monitoring: Ensure getLagTimes.php is working properly - https://phabricator.wikimedia.org/T172559#3502131 (10Krinkle) 05Open>03Resolved p:05Triage>03Normal a:03Krinkle The script is part of MediaWiki core: * [/maintenance/getLagTimes.php](https://github.com/wiki... [20:08:24] (03CR) 10Andrew Bogott: [C: 032] Switch labs to use labs-puppetmaster.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/370246 (owner: 10Andrew Bogott) [20:11:47] (03CR) 10Andrew Bogott: [C: 032] get $::labsproject from the certname [puppet] - 10https://gerrit.wikimedia.org/r/368606 (https://phabricator.wikimedia.org/T171289) (owner: 10Andrew Bogott) [20:13:16] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3502143 (10Andrew) [20:14:12] 10Operations: wikitech-static.wikimedia.org certificate renewal (expiring 2017-08-09) - https://phabricator.wikimedia.org/T172285#3502145 (10Dzahn) I renewed the certificate with this command: `/usr/local/sbin/acme-setup -i wikitech-static -s wikitech-static.wikimedia.org -m acme -w apache2` (On a puppetized h... [20:17:06] 10Operations: wikitech-static.wikimedia.org certificate renewal (expiring 2017-08-09) - https://phabricator.wikimedia.org/T172285#3502148 (10Dzahn) added to root's crontab: `@monthly /usr/local/sbin/acme-setup -i wikitech-static -s wikitech-static.wikimedia.org -m acme -w apache2 ` [20:17:15] 10Operations: wikitech-static.wikimedia.org certificate renewal (expiring 2017-08-09) - https://phabricator.wikimedia.org/T172285#3502149 (10Dzahn) 05Open>03Resolved [20:18:26] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [20:19:55] !log renewing SSL cert for status.wm.org (just like wikitech-static, but that one didnt have monitoring?) [20:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:53] 10Operations: wikitech-static.wikimedia.org certificate renewal (expiring 2017-08-09) - https://phabricator.wikimedia.org/T172285#3502170 (10Dzahn) There is also https://status.wikimedia.org on this. It had the same issue, the cert was about to expire, but for that one we didn't have monitoring? I did the same... [20:27:11] hmm uploading a file to phab is returning "No configured storage engine can store this file. See "Configuring File Storage" in the documentation for information on configuring storage engines." [20:27:14] twentyafterfour ^^ [20:29:45] (03PS1) 10Dzahn: icinga/certs: add monitoring for status.wm.org cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/370248 (https://phabricator.wikimedia.org/T172285) [20:32:04] oh it's because the image is bigger then phab's mysql limit of 3mb [20:36:25] (03CR) 10Dzahn: [C: 032] icinga/certs: add monitoring for status.wm.org cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/370248 (https://phabricator.wikimedia.org/T172285) (owner: 10Dzahn) [20:36:31] (03PS1) 10Andrew Bogott: labs-puppetmaster: use new labs-puppetmaster host for enc [puppet] - 10https://gerrit.wikimedia.org/r/370249 [20:36:32] (03PS1) 10Andrew Bogott: clean up a few more labs-puppetmaster-eqiad refs [puppet] - 10https://gerrit.wikimedia.org/r/370250 (https://phabricator.wikimedia.org/T171786) [20:36:34] (03PS1) 10Andrew Bogott: toolschecker: use the new puppetmaster for manifest checks [puppet] - 10https://gerrit.wikimedia.org/r/370251 (https://phabricator.wikimedia.org/T171786) [20:36:36] (03PS1) 10Andrew Bogott: shinken: test the new labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/370252 (https://phabricator.wikimedia.org/T171786) [20:37:39] (03PS2) 10Dzahn: icinga/certs: add monitoring for status.wm.org cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/370248 (https://phabricator.wikimedia.org/T172285) [20:39:22] (03PS3) 10Dzahn: icinga/certs: add monitoring for status.wm.org cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/370248 (https://phabricator.wikimedia.org/T172285) [20:40:51] (03CR) 10Paladox: icinga/certs: add monitoring for status.wm.org cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370248 (https://phabricator.wikimedia.org/T172285) (owner: 10Dzahn) [20:43:44] (03CR) 10Andrew Bogott: [C: 032] labs-puppetmaster: use new labs-puppetmaster host for enc [puppet] - 10https://gerrit.wikimedia.org/r/370249 (owner: 10Andrew Bogott) [20:43:51] (03PS2) 10Andrew Bogott: labs-puppetmaster: use new labs-puppetmaster host for enc [puppet] - 10https://gerrit.wikimedia.org/r/370249 [20:54:08] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. [20:55:47] 10Operations, 10Patch-For-Review: wikitech-static.wikimedia.org certificate renewal (expiring 2017-08-09) - https://phabricator.wikimedia.org/T172285#3502304 (10Dzahn) monitoring added: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wikitech-static.wikimedia.org&service=HTTPS-status-wikim... [21:02:41] (03CR) 10Eevans: "Based on the [Puppet Compiler output](http://puppet-compiler.wmflabs.org/7311), it would appear there are problems with this still." [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [21:07:37] (03PS2) 10Andrew Bogott: clean up a few more labs-puppetmaster-eqiad refs [puppet] - 10https://gerrit.wikimedia.org/r/370250 (https://phabricator.wikimedia.org/T171786) [21:08:21] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 4 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3502355 (10Dzahn) [21:09:34] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 4 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3469036 (10Dzahn) [21:09:58] (03CR) 10Andrew Bogott: [C: 032] clean up a few more labs-puppetmaster-eqiad refs [puppet] - 10https://gerrit.wikimedia.org/r/370250 (https://phabricator.wikimedia.org/T171786) (owner: 10Andrew Bogott) [21:11:47] (03CR) 10Dzahn: icinga/certs: add monitoring for status.wm.org cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370248 (https://phabricator.wikimedia.org/T172285) (owner: 10Dzahn) [21:12:36] (03CR) 10Dzahn: icinga/certs: add monitoring for status.wm.org cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370248 (https://phabricator.wikimedia.org/T172285) (owner: 10Dzahn) [21:14:37] (03CR) 10Dzahn: [C: 04-1] "phab1001 is ready now, and phab2001 will follow and is unblocked, but needs checking, not yet" [puppet] - 10https://gerrit.wikimedia.org/r/355869 (https://phabricator.wikimedia.org/T164810) (owner: 10Dzahn) [21:16:16] (03CR) 10Dzahn: "with or without puppet role, reverting would need a bunch of changes now. if we wait here then puppet has to stay disabled, while with rol" [puppet] - 10https://gerrit.wikimedia.org/r/370122 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [21:22:26] (03PS2) 10Dzahn: site/phabricator: remove phab role from iridium, make it a spare [puppet] - 10https://gerrit.wikimedia.org/r/370122 (https://phabricator.wikimedia.org/T163938) [21:24:48] !log T172384: Disabling Puppet in dev environment to prevent unattended Cassandra restarts [21:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:02] T172384: OOM exceptions in dev environment - https://phabricator.wikimedia.org/T172384 [21:25:38] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [21:25:38] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [21:25:38] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [21:25:38] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [21:25:43] (03PS3) 10Dzahn: site/phabricator: remove phab role from iridium, make it a spare [puppet] - 10https://gerrit.wikimedia.org/r/370122 (https://phabricator.wikimedia.org/T163938) [21:26:18] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [21:28:14] (03CR) 10Dzahn: [C: 032] site/phabricator: remove phab role from iridium, make it a spare [puppet] - 10https://gerrit.wikimedia.org/r/370122 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [21:28:18] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [21:28:38] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [21:28:38] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [21:28:39] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [21:28:48] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] [21:29:32] hmm, thats a different one [21:29:38] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [21:30:35] 10Operations, 10MediaWiki-JobRunner, 10MediaWiki-Platform-Team, 10monitoring, and 2 others: Collect error logs from jobchron/jobrunner services in Logstash - https://phabricator.wikimedia.org/T172479#3502395 (10greg) Adding #mediawiki-platform-team as @aaron is maintainer of the jobqueue (per https://www.m... [21:31:08] (03CR) 10Dzahn: "this changed exim config to default, removed diamond collector, , resolv.conf doesn't have codfw.wmnet, sshd_config, removed deploy key, f" [puppet] - 10https://gerrit.wikimedia.org/r/370122 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [21:31:53] 10Operations, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3502398 (10Dzahn) [21:32:33] 10Operations, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3499956 (10Dzahn) please keep stalled for a few more days like this and don't shut down. we are a spare but don't want the disk wiped just yet, and phab-admins still have shell.. just in case [21:32:37] seems to be settling back down on its own [21:32:58] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] [21:33:41] ^ is just delayed with a moving average [21:33:48] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [21:36:29] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:36:38] ^ is that all related ebernhardson [21:36:53] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 306 bytes in 0.001 second response time [21:37:09] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [21:37:14] chasemp: yes, its recovering [21:37:30] cool tx [21:37:53] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.003 second response time [21:38:15] (03PS1) 10Dzahn: admins/dzahn: update to my .bash_profile [puppet] - 10https://gerrit.wikimedia.org/r/370284 [21:39:02] ugh re: thumbor, i don't have ssh access now tho [21:39:09] (03CR) 10Dzahn: [C: 032] admins/dzahn: update to my .bash_profile [puppet] - 10https://gerrit.wikimedia.org/r/370284 (owner: 10Dzahn) [21:39:24] seems its ok [21:39:51] yeah recovered right away but i thought that symptom was fixed [21:39:59] will take a look tomorrow [21:40:10] got recovery [21:40:39] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [21:44:18] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] [21:44:19] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [21:49:06] (03CR) 10Mobrovac: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [21:50:33] (03CR) 10jerkins-bot: [V: 04-1] [WIP] JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [21:50:48] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:51:08] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:51:48] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:58:07] (03CR) 10Dzahn: "for the reasons why in our case rsync is not able to keep the permissions right, also see https://phabricator.wikimedia.org/T79786#1831969" [puppet] - 10https://gerrit.wikimedia.org/r/370179 (owner: 10Paladox) [22:00:53] (03PS2) 10Mobrovac: [WIP] JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 [22:08:58] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [22:10:37] thats a "fake" alert ... it seems one server got stuck, and now it finally unstuck and ran through the 1k queries it had waiting in the thread pool queue, so they all just reported their time and pushed up the total latency [22:16:10] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3502588 (10mobrovac) [22:16:58] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [22:18:42] (03PS3) 10Rush: openstack: clean up openstack::repo [puppet] - 10https://gerrit.wikimedia.org/r/370092 (https://phabricator.wikimedia.org/T171494) [22:18:44] (03PS1) 10Rush: openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) [22:19:20] (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [22:42:09] (03PS1) 10MaxSem: labs: Remove OAuth setting duplicating prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370291 [22:42:11] (03PS1) 10MaxSem: Flow settings: wmg -> wg migration, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370292 [22:42:14] (03PS1) 10MaxSem: Flow settings: wmg -> wg migration, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370293 [22:42:15] (03PS1) 10MaxSem: Flow settings: wmg -> wg migration, part 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370294 [22:42:28] get spammed! [22:43:44] (03CR) 10jerkins-bot: [V: 04-1] Flow settings: wmg -> wg migration, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370292 (owner: 10MaxSem) [22:43:48] (03CR) 10jerkins-bot: [V: 04-1] Flow settings: wmg -> wg migration, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370293 (owner: 10MaxSem) [22:46:45] (03PS2) 10MaxSem: Flow settings: wmg -> wg migration, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370292 [22:46:47] (03PS2) 10MaxSem: Flow settings: wmg -> wg migration, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370293 [22:46:50] (03PS2) 10MaxSem: Flow settings: wmg -> wg migration, part 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370294 [23:04:00] !log phab2001 - changing UID/GID for phd user from 997:997 to 498:498 to make it match phab1001, to fix rsync breaking permissions. (rsync forces --numeric-ids when fetching from and rsyncd configured with chroot=yes). chown -R phw:www-data /srv/repos/ [23:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:57] !log "reserved" UID 498 for phd on https://wikitech.wikimedia.org/wiki/UID | phab2001: find -exec chown to fix all the files , restart cron [23:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:40] (03PS3) 10Dzahn: phabricator: Auto rsync phab1001 to phab2001 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/370179 (owner: 10Paladox) [23:20:10] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Auto rsync phab1001 to phab2001 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/370179 (owner: 10Paladox) [23:20:45] (03PS4) 10Paladox: phabricator: Auto rsync phab1001 to phab2001 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/370179 (https://phabricator.wikimedia.org/T137928) [23:20:58] (03PS5) 10Paladox: phabricator: Auto rsync phab1001 to phab2001 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/370179 (https://phabricator.wikimedia.org/T137928) [23:21:38] (03PS6) 10Dzahn: phabricator: Auto rsync phab1001 to phab2001 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/370179 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:22:27] (03CR) 10Dzahn: [C: 032] phabricator: Auto rsync phab1001 to phab2001 (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/370179 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:22:33] thanks :) [23:23:53] Notice: /Stage[main]/Profile::Phabricator::Main/Rsync::Quickdatacopy[srv-repos]/Cron[rsync-srv-repos]/ensure: created [23:25:48] paladox: almost.. but doesnt work yet.. debugging [23:25:54] oh [23:26:01] the change is fine.. there is something that stops sync though [23:26:07] hmm [23:26:16] tried it manually [23:26:22] ok [23:27:13] (03PS1) 10BryanDavis: toolforge: Add qstat-full to bastions [puppet] - 10https://gerrit.wikimedia.org/r/370298 [23:28:22] (03PS2) 10Krinkle: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368581 (owner: 10Urbanecm) [23:28:24] (03CR) 10Krinkle: [C: 031] Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368581 (owner: 10Urbanecm) [23:28:34] (03CR) 10Krinkle: [C: 031] labs: Remove OAuth setting duplicating prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370291 (owner: 10MaxSem) [23:33:43] (03PS10) 10BryanDavis: logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/299825 [23:34:10] (03CR) 10jerkins-bot: [V: 04-1] logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [23:34:48] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 28 probes of 267 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:35:17] !log phab2001 - installing various package upgrades, apt-get autoremove old kernel images [23:35:22] (03PS11) 10BryanDavis: logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/299825 [23:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:30] !log phab2001 rebooting [23:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:48] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 9 probes of 267 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:46:10] (03CR) 10BryanDavis: "SMalyshev: you should put this up for puppetswat" [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [23:47:58] (03CR) 10Smalyshev: "@BryanDavis Guillaume said he'll take care of in on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [23:51:35] !log phab2001 - removed outdated /etc/hosts entries, that fixed rsync, syncing /srv/repos/ from phab1001 [23:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log