[00:00:04] yurik: Respected human, time to deploy Graphoid service deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151028T0000). Please do the needful. [00:00:10] !lol deploying rand(); [00:00:27] (03PS3) 10Dzahn: labs kvm ssl cert monitoring: fix it [puppet] - 10https://gerrit.wikimedia.org/r/249328 (https://phabricator.wikimedia.org/T116332) [00:01:25] (03CR) 10Dzahn: [C: 032] labs kvm ssl cert monitoring: fix it [puppet] - 10https://gerrit.wikimedia.org/r/249328 (https://phabricator.wikimedia.org/T116332) (owner: 10Dzahn) [00:02:44] yurik: sorry, I'm overflowing the SWAT window [00:02:59] tgr, no worries, take your time, ping me when done [00:06:33] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760278 (10Dzahn) after this. on labvirt1001, the plugin got created: Notice: /Stage[main]/Openstack::Nova::Compute/File[/usr/local/lib/nagios/plugi... [00:09:56] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [00:10:14] yep, i'll check that [00:10:20] about to upload a change anyways [00:11:18] (03PS1) 10Dzahn: labs kvm ssl cert monitoring: fix nrpe command [puppet] - 10https://gerrit.wikimedia.org/r/249331 (https://phabricator.wikimedia.org/T116332) [00:11:22] (03CR) 10jenkins-bot: [V: 04-1] labs kvm ssl cert monitoring: fix nrpe command [puppet] - 10https://gerrit.wikimedia.org/r/249331 (https://phabricator.wikimedia.org/T116332) (owner: 10Dzahn) [00:11:35] (03PS2) 10Dzahn: labs kvm ssl cert monitoring: fix nrpe command [puppet] - 10https://gerrit.wikimedia.org/r/249331 (https://phabricator.wikimedia.org/T116332) [00:12:11] (03CR) 10Dzahn: [C: 032] labs kvm ssl cert monitoring: fix nrpe command [puppet] - 10https://gerrit.wikimedia.org/r/249331 (https://phabricator.wikimedia.org/T116332) (owner: 10Dzahn) [00:15:34] yep, gonna be fixed after next puppet run [00:20:46] PROBLEM - HHVM rendering on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:16] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:36] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:24:07] PROBLEM - DPKG on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:07] PROBLEM - puppet last run on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:19] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1095/" [puppet] - 10https://gerrit.wikimedia.org/r/249038 (owner: 10Dzahn) [00:24:27] (03PS3) 10Dzahn: mariadb: 32 lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249038 [00:24:36] PROBLEM - nutcracker port on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:47] PROBLEM - configured eth on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:47] PROBLEM - RAID on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:56] PROBLEM - SSH on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:24:58] PROBLEM - Check size of conntrack table on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:03] tgr, still deploying? [00:25:08] PROBLEM - salt-minion processes on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:09] PROBLEM - dhclient process on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:09] PROBLEM - HHVM processes on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:37] PROBLEM - nutcracker process on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:38] PROBLEM - Disk space on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:57] yurik, according to `w`, yes [00:26:15] Krenair, `w` ? [00:26:20] the w command [00:26:21] on tin [00:26:33] shows you who is logged in and what they are doing [00:26:44] ah, good, going to check it out )) [00:27:03] safe to run it i presume ) [00:27:49] yes [00:27:51] ooo, such a sweet command, thx )) [00:27:56] !log powercycling unresponsive mw1127 [00:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:00] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:29:10] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:29:39] PROBLEM - Host mw1127 is DOWN: PING CRITICAL - Packet loss = 100% [00:29:39] PROBLEM - RAID on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:29:49] PROBLEM - SSH on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:29:51] PROBLEM - configured eth on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:29:58] hmmmmmm [00:30:00] RECOVERY - nutcracker process on mw1127 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:30:09] PROBLEM - nutcracker port on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:30:10] RECOVERY - Host mw1127 is UP: PING OK - Packet loss = 0%, RTA = 2.04 ms [00:30:19] mw1135 as well now? [00:30:19] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures [00:30:29] RECOVERY - HHVM processes on mw1127 is OK: PROCS OK: 6 processes with command name hhvm [00:30:30] RECOVERY - salt-minion processes on mw1127 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:30:39] RECOVERY - nutcracker port on mw1127 is OK: TCP OK - 0.000 second response time on port 11212 [00:30:40] PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:30:56] 6operations: Update dsh node groups from puppet - https://phabricator.wikimedia.org/T80395#1760346 (10bd808) >>! In T80395#1760088, @Dzahn wrote: > @demon @bd808 what should we do with this ticket? reject? resolve once it uses etcd? I'd vote for closing it when {T115899} or something similar is done. I'd gues... [00:30:59] RECOVERY - RAID on mw1127 is OK: OK: no RAID installed [00:30:59] RECOVERY - configured eth on mw1127 is OK: OK - interfaces up [00:30:59] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:31:09] RECOVERY - SSH on mw1127 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [00:31:15] ori: yes, one seemed normal, 2 starts to be suspicious [00:31:19] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [00:31:29] RECOVERY - dhclient process on mw1127 is OK: PROCS OK: 0 processes with command name dhclient [00:31:29] RECOVERY - Check size of conntrack table on mw1127 is OK: OK: nf_conntrack is 4 % full [00:31:35] well, there's the icinga config fix too [00:31:39] PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:31:40] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.067 second response time [00:32:00] PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:32:10] RECOVERY - DPKG on mw1127 is OK: All packages OK [00:32:20] PROBLEM - puppet last run on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:32:20] RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 64983 bytes in 1.778 second response time [00:32:28] output of mw1135 console: [00:32:29] RECOVERY - Disk space on mw1127 is OK: DISK OK [00:32:42] init: ssh main p [00:32:43] Ubuntu 14.04.2 LTS mw1135 ttyS1 [00:32:49] PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:33:03] OOM killer [00:33:13] [27867198.620200] Out of memory: Kill process 9213 (hhvm) score 880 or sacrifice child [00:33:16] [27867198.627855] Killed process 9229 (hhvm) total-vm:346788kB, anon-rss:33344kB, file-rss:196kB [00:33:19] ori: ^ [00:33:32] * ori looks at app server memory usage on ganglia [00:33:43] 'or sacrifice child' o.O [00:33:43] yurik: yes, waiting for scap [00:33:46] i see that without logging in, on the login screen [00:34:00] RECOVERY - Disk space on mw1135 is OK: DISK OK [00:34:10] PROBLEM - Check size of conntrack table on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:34:23] predicts recoveries [00:34:30] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm [00:34:30] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:34:50] RECOVERY - configured eth on mw1135 is OK: OK - interfaces up [00:35:05] nothing out of the ordinary in ganglia [00:35:09] RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.000 second response time on port 11212 [00:35:09] RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient [00:35:21] !log mw1135 temp. unresponsive - OOM killer killing hhvm [00:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:35:30] RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:35:36] thanks mutante [00:35:59] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [00:36:00] RECOVERY - Check size of conntrack table on mw1135 is OK: OK: nf_conntrack is 0 % full [00:36:09] RECOVERY - RAID on mw1135 is OK: OK: no RAID installed [00:36:21] RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [00:36:42] andrewbogott: fixed. https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=kvm+ssl [00:37:01] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760353 (10Dzahn) works now: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=kvm+ssl [00:37:04] !log tgr@tin Finished scap: Updating MediaViewer with r246112 (duration: 43m 18s) [00:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:37:21] 6operations, 7HTTPS, 7Icinga, 7Monitoring: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1760355 (10Dzahn) [00:37:22] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760354 (10Dzahn) 5Open>3Resolved [00:37:49] yurik: done, sorry for the wait [00:37:57] tgr, no worries ) [00:38:39] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.159 second response time [00:39:00] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 64974 bytes in 1.325 second response time [00:39:12] 6operations: Update dsh node groups from puppet - https://phabricator.wikimedia.org/T80395#1760356 (10Dzahn) Sounds good, thank you @bd808. And i see that is already added as a blocker. Great. [00:43:59] (03PS3) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 [00:44:31] (03CR) 10jenkins-bot: [V: 04-1] lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 (owner: 10Dzahn) [00:45:03] (03PS4) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 [00:46:17] (03PS5) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 [00:49:15] (03PS6) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 [00:49:57] (03CR) 10Dzahn: "made this smaller for easier review - these are all just to get to re-enable that check. not too many to fix globally" [puppet] - 10https://gerrit.wikimedia.org/r/243852 (owner: 10Dzahn) [00:51:24] (03PS2) 10Dzahn: lint: double quoted strings pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/243853 [00:51:58] (03CR) 10jenkins-bot: [V: 04-1] lint: double quoted strings pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/243853 (owner: 10Dzahn) [00:53:12] (03PS2) 10Dzahn: labs restbase: Add en.wikivoyage.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/249319 (owner: 10Alex Monk) [00:54:20] (03CR) 10Dzahn: "uhm.. "No file(s) found for import of '../../../manifests/nagios.pp" ??" [puppet] - 10https://gerrit.wikimedia.org/r/243853 (owner: 10Dzahn) [00:55:18] (03CR) 10Dzahn: [C: 032] labs restbase: Add en.wikivoyage.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/249319 (owner: 10Alex Monk) [00:55:25] !log deployed graphoid - https://gerrit.wikimedia.org/r/#/c/249324/ [00:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:07:05] (03CR) 10Hydriz: [C: 031] copy pagecounts-al-sites files over to labs from datasets [puppet] - 10https://gerrit.wikimedia.org/r/249175 (https://phabricator.wikimedia.org/T93317) (owner: 10ArielGlenn) [02:34:31] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 33s) [02:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:39:23] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-28 02:39:23+00:00 [02:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:00:34] !log l10nupdate@tin Synchronized php-1.27.0-wmf.4/cache/l10n: l10nupdate for 1.27.0-wmf.4 (duration: 06m 00s) [03:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:03:34] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.4) at 2015-10-28 03:03:34+00:00 [03:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:50:35] (03PS1) 10Dzahn: openstack: add links to docs for components, lint [puppet] - 10https://gerrit.wikimedia.org/r/249342 [03:53:18] (03PS2) 10Dzahn: openstack: add links to docs for components, lint [puppet] - 10https://gerrit.wikimedia.org/r/249342 [04:01:36] (03CR) 10Dzahn: "done here, to fix compiler run:" [puppet] - 10https://gerrit.wikimedia.org/r/247217 (owner: 10Muehlenhoff) [04:04:40] (03CR) 10Dzahn: [C: 031] "rebuilt - compiles as noop now" [puppet] - 10https://gerrit.wikimedia.org/r/247217 (owner: 10Muehlenhoff) [04:14:11] (03PS1) 10Dzahn: interface: some lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249344 [04:15:36] (03CR) 10Dzahn: "@ori feel like taking this? this is for mediawiki module what we did for most (all?) misc services already" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [04:26:14] (03PS1) 10Dzahn: move mw jobqueue monitoring class out of misc [puppet] - 10https://gerrit.wikimedia.org/r/249345 [04:32:32] (03PS2) 10Dzahn: move mw jobqueue monitoring class out of misc [puppet] - 10https://gerrit.wikimedia.org/r/249345 [04:33:04] (03PS3) 10Dzahn: move mw jobqueue monitoring class out of misc [puppet] - 10https://gerrit.wikimedia.org/r/249345 [04:43:34] (03PS1) 10Dzahn: kill misc/fundraising.pp, move to role logging [puppet] - 10https://gerrit.wikimedia.org/r/249347 [04:44:39] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:30] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:08] (03PS2) 10Dzahn: kill misc/fundraising.pp, move to role logging [puppet] - 10https://gerrit.wikimedia.org/r/249347 [04:48:10] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [04:48:19] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:19] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [04:49:59] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [04:53:52] 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1760874 (10Dzahn) yep, you are right. i don't know why i thought it was a duplicate, it's not. [04:56:50] 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1760878 (10mmodell) 5Open>3declined a:3mmodell fatalmonitor is a very rudimentary tool. #scap3 should include such information (See {T110068}), but I don't think fatalmonitor is a good place to add such fu... [04:59:10] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:10] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:10] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:10] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:01:24] 10Ops-Access-Requests, 6operations: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1760888 (10Dzahn) 3NEW [05:02:35] 10Ops-Access-Requests, 6operations: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1760895 (10Dzahn) [05:02:49] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [05:02:49] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [05:02:50] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [05:02:50] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:10] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:11] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:11] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:11] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:11] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:09:59] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [05:09:59] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [05:11:13] 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1760913 (10Legoktm) 5declined>3Open Until scap3 is actually in use, this is a valid feature enhancement request for fatalmonitor. If/once we're no longer using fatalmonitor, this task can be declined. [05:11:30] 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1760915 (10Legoktm) a:5mmodell>3None [05:15:14] !log ran mwscript updateSpecialPages.php --wiki=testwiki --only=GadgetUsage on terbium [05:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:15:20] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:15:20] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:22:29] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [05:22:30] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [05:22:30] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [05:22:30] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [05:22:31] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [05:27:59] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:00] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:00] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:00] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:49] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [05:31:40] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:35:19] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:36:59] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [05:37:00] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [05:38:49] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [05:38:49] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [05:44:10] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:10] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:10] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:45:50] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [05:47:49] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:19] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [05:51:19] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [05:51:19] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [05:51:20] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:20] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [05:56:40] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:40] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:41] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:41] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:58:29] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [05:58:29] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [06:04:00] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [06:04:00] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [06:04:09] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [06:04:09] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:09] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:09:29] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [06:09:29] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [06:09:29] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [06:09:29] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [06:09:30] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:11:19] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [06:14:59] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:59] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:59] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:59] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:40] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [06:16:42] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [06:16:42] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [06:16:42] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:06] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Oct 28 06:22:06 UTC 2015 (duration 22m 5s) [06:22:09] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:09] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:09] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:09] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:25:31] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [06:25:31] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [06:25:31] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [06:25:39] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [06:25:39] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [06:25:39] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [06:28:58] PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [06:29:30] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: puppet fail [06:30:30] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:40] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:41] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:40] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:59] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:20] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:30] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:41] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:09] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:10] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 3 failures [06:33:40] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:50] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:34:50] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:34:50] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:34:50] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:34:50] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:35:01] RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0] [06:43:41] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [06:43:41] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [06:43:49] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:45:30] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [06:47:19] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [06:47:19] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [06:49:09] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:51] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:52:41] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:40] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:56:49] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:11] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:57:20] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:30] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:57:31] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:57:41] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:01] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:01] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:01] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:58:10] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:58:11] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:59:51] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [07:00:32] 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown, 10Wikidata: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1761030 (10Addshore) [07:02:17] 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1761034 (10Addshore) [07:02:36] 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1758419 (10Addshore) Amended per @Krenair [07:03:31] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [07:05:20] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:59] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [07:08:59] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [07:09:00] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [07:13:00] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [07:14:29] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:04] 6operations, 10TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1761047 (10Joe) Hi, I don't think we really need 6 videoscalers, or at least I don't see a compelling reason for that given: 1) The current videoscalers are already... [07:16:10] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:49] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:21:01] 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1761048 (10mmodell) @legoktm: scap3 is in use and T110068 should be done in the near future. Have you looked at the code for fatalmonitor? It's a series of unix commands piped together and wrapped in `watch` -... [07:21:30] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [07:21:30] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [07:21:30] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [07:21:30] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [07:21:30] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [07:24:55] (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment: move to role module [puppet] - 10https://gerrit.wikimedia.org/r/249090 (owner: 10Giuseppe Lavagetto) [07:26:59] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:59] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:59] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:59] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:31:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "1) We did load the apache 2.4 module that allow using 2.2 syntax so this just adds complexity with no gain, AFAICT" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [07:31:50] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures [07:32:30] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:32:30] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:33:45] (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment::server: reorganize code [puppet] - 10https://gerrit.wikimedia.org/r/249091 (owner: 10Giuseppe Lavagetto) [07:38:30] (03CR) 10Giuseppe Lavagetto: [C: 032] "noop according to the puppet compiler" [puppet] - 10https://gerrit.wikimedia.org/r/249092 (owner: 10Giuseppe Lavagetto) [07:40:00] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:42:34] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment::mediawiki: rename wikitech::wiki::password class [puppet] - 10https://gerrit.wikimedia.org/r/249093 (owner: 10Giuseppe Lavagetto) [07:44:59] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [07:44:59] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [07:44:59] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [07:44:59] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [07:44:59] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [07:45:00] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [07:45:55] (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment: remove test role [puppet] - 10https://gerrit.wikimedia.org/r/249094 (owner: 10Giuseppe Lavagetto) [07:46:16] (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment::server: drop mod_dav [puppet] - 10https://gerrit.wikimedia.org/r/249102 (owner: 10Giuseppe Lavagetto) [07:54:20] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [07:54:50] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 1 failures [07:55:41] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:50] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:51] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:51] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:06:39] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:29] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [08:08:30] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [08:11:29] 6operations, 10TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1761083 (10brion) @Joe I'm planning to re-run all the Ogg transcodes for improved quality and to fix a bunch of old ones that broke; this will eat all the CPU time for... [08:12:00] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [08:13:50] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:50] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:17:29] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:09] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [08:19:10] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [08:19:10] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [08:19:10] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [08:24:30] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:31] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:31] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:31] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:28:09] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [08:28:09] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [08:29:59] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [08:31:59] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [08:33:41] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:33:41] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:35:29] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [08:35:29] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [08:35:30] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [08:40:50] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:40:50] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:40:50] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:40:50] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:40:50] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:40:55] (03PS2) 10Giuseppe Lavagetto: r::mw::maintenance: include role::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/249108 (https://phabricator.wikimedia.org/T116728) [08:43:22] (03CR) 10Giuseppe Lavagetto: [C: 032] r::mw::maintenance: include role::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/249108 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [08:44:29] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:09] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [08:46:09] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [08:46:09] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [08:46:10] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [08:46:10] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [08:46:10] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [08:46:56] (03PS2) 10Giuseppe Lavagetto: mediawiki: group general monitoring scripts in a single role [puppet] - 10https://gerrit.wikimedia.org/r/249109 [08:51:00] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1761129 (10MoritzMuehlenhoff) I had checked the status of a few long-time depooled mw* servers (and re-enabled a few) when I made the flip to enable ferm. From my IRC logs Ori confirmed that mw1169 can be re-poo... [08:51:31] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:31] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:51] _joe_: ^ can you amend T116256 with the status for mw1161? [08:52:24] <_joe_> moritzm: uh? I have no idea I guess? [08:53:05] <_joe_> if there is an hardware ticket I opened, I guess the status is what chris stated there [08:53:15] ah, sorry. I thought "retroactive commit by joe" referred to your own changes [08:53:21] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:21] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:21] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:21] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:22] <_joe_> nope [08:53:33] <_joe_> I found it depooled with no commit, probably [08:54:14] (03PS3) 10Giuseppe Lavagetto: mediawiki: group general monitoring scripts in a single role [puppet] - 10https://gerrit.wikimedia.org/r/249109 [08:54:15] <_joe_> but I can take a look, yes [08:55:03] there's already a ticket by Rob for that, it's probably just a case of someone forgot to re-pool it [08:55:09] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [08:55:09] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [08:55:10] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [08:55:14] can't find a Phab ticket for hardware problems [08:55:26] <_joe_> I tend to err on the side of caution [08:55:44] <_joe_> btw, rob has also a ticket to have 6 videoscalers, which I just shot down [08:56:06] <_joe_> well, brion created it, and I think it is a bit of an overkill [08:56:40] thanks thanks thanks [08:56:46] indeed [08:56:49] I thought I was going crazy [08:57:21] <_joe_> we can go with 3 as we are now (1.5x what we usually had) or with 4 (2x) if we really need it [08:58:14] <_joe_> jynus: regarding what? [08:58:49] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [08:58:49] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [08:58:51] (03PS4) 10Giuseppe Lavagetto: mediawiki: group general monitoring scripts in a single role [puppet] - 10https://gerrit.wikimedia.org/r/249109 [08:59:49] sorry, I got too excited: post-commit hook on palladium not updating strotium adecuatelly [08:59:55] giving random puppet errors [09:00:22] I will create a ticket, but at least I identified it [09:00:38] which is good because it is actuable [09:00:40] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:40] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:52] <_joe_> jynus: sometimes that has happened in the past - we failed to fix it though [09:01:06] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: group general monitoring scripts in a single role [puppet] - 10https://gerrit.wikimedia.org/r/249109 (owner: 10Giuseppe Lavagetto) [09:01:10] it is ok, I can puppet-merge on strontium [09:01:39] but I was getting confused because the simplest of changes failed [09:02:09] as I said, knowing the why is 90% of the problem solving [09:02:20] documting a problem is an ok solution [09:02:30] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [09:02:30] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [09:02:52] what I feel is that after so many months I continue discovering things [09:03:49] _joe_: so do I have to start the batch recompression first to demonstrate need for CPU time or something? [09:04:19] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:04:19] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:04:19] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:07:51] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:07:51] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:08:29] Is there a reason to have the mariadb module on its own repo instead of operations/puppet? [09:09:18] the same way I do not think there is a reason to have wikimedia-mariadb deb repo [09:09:43] <_joe_> jynus: you might want to ask ottomata about the separate git repo [09:09:50] ok, thanks [09:10:14] but that is strange- he only recently started using that class [09:11:10] also, is should not be named mariadb, I suppose it only had that name to differenciate from the original class [09:11:16] <_joe_> jynus: the reason might be other projects used that puppet class [09:11:29] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [09:12:08] but it still gets merged into the mariadb module on the same puppet place :-/ [09:12:45] please, ignore me _joe_ and keep doing the great work you are doing, do not let me bother you [09:12:57] :-) [09:15:41] <_joe_> !log preparing to reimage mw1152, disabling puppet, scheduling downtime. [09:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:16:50] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:39] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [09:33:20] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [09:33:20] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [09:33:20] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [09:33:20] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [09:33:21] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [09:38:51] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:51] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:43:02] (03PS2) 10Filippo Giunchedi: cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/248960 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [09:43:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/248960 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [09:43:28] mobrovac, godog: shall we start? I'll disable puppet on restbase100[1-6] [09:43:42] moritzm: +1 [09:43:51] kk [09:43:56] moritzm: let me force a run first actually [09:44:19] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [09:44:20] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:44:21] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:44:34] godog: ok, I'll reenable puppet then [09:44:58] done [09:46:30] moritzm: ok! good to go [09:47:55] wrt "disable Cassandra and RESTBase on boot on all boxes", do they use systemd units? is that a simple "systemctl disable foo.unit"? [09:48:09] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [09:48:52] moritzm: yep all systemd [09:49:10] moritzm: actually no, cassandra is systemd but restbase isn't [09:49:59] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:49:59] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:40] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [09:53:40] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:40] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:55:29] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [09:55:30] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [09:56:44] I've silenced those btw [09:58:37] godog: moritzm: rb's got a sysV script, but it's managed with systemd [09:59:02] also, please depool each server before rebooting [09:59:19] we already have a high enough 5xx rate as it is [09:59:31] godog: that rb1007 seems to be really problematic [10:00:25] what do you mean by "managed with systemd", there's no unit file for it? [10:00:52] no [10:01:00] systemd is using the initV script directly [10:03:42] (03PS1) 10Jcrespo: mariadb: update submodule in production repo [puppet] - 10https://gerrit.wikimedia.org/r/249365 [10:05:24] (03Abandoned) 10Jcrespo: mariadb: update submodule in production repo [puppet] - 10https://gerrit.wikimedia.org/r/249365 (owner: 10Jcrespo) [10:06:20] (03PS2) 10Giuseppe Lavagetto: mw1152: convert to be the HAT maintenance host [puppet] - 10https://gerrit.wikimedia.org/r/249110 (https://phabricator.wikimedia.org/T116728) [10:06:22] (03PS1) 10Mobrovac: Labs: Parsoid Cache: Use new IP address for deployment-parsoidcache02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249366 (https://phabricator.wikimedia.org/T103660) [10:06:43] (03CR) 10Giuseppe Lavagetto: [C: 032] mw1152: convert to be the HAT maintenance host [puppet] - 10https://gerrit.wikimedia.org/r/249110 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto) [10:08:34] mobrovac: ok, restbase startup also disabled [10:08:39] 6operations, 6Analytics-Engineering, 10Beta-Cluster-Infrastructure, 7Varnish: On beta cluster varnish stats process points to production statsd - https://phabricator.wikimedia.org/T116898#1761231 (10hashar) 3NEW [10:09:30] kk moritzm [10:10:09] mobrovac, godog: any special order needed/preferred, or all they all alike? [10:10:22] all the same [10:10:33] I'll start by depooling 1001, then [10:11:09] <_joe_> !log reimaging mw1152 [10:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:14:27] (03PS1) 10Jcrespo: Merge changes in the mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/249369 [10:15:15] (03CR) 10Jcrespo: [C: 032] Merge changes in the mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/249369 (owner: 10Jcrespo) [10:17:51] 6operations, 10RESTBase-Cassandra: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1761257 (10fgiunchedi) judging from strace timing it seems mobileapps endpoint is slow, in the ~2s range and sometimes takes more than the `service_checker` timeout to reply (5s). running... [10:19:28] mobrovac: ^ thoughts? (or anyone else really) [10:20:48] godog: inexplicably, this seems to correlate with rb1007 having latency problems [10:21:00] but, one node shouldn't matter [10:21:26] godog: rb1001 would be ready to boot into the new kernel, you keeping an eye on the effect of https://gerrit.wikimedia.org/r/#/c/248960/ ? [10:21:36] moritzm: yup [10:21:55] k, rebooting [10:22:11] mobrovac: true it shouldn't, when we are done with the reboots we can try taking 1007 out of the cluster and see if that changes things [10:22:39] good idea godog! [10:22:49] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: Timeout while attempting connection [10:22:55] (03PS2) 10Hashar: beta: use new IP for Parsoid Cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249366 (https://phabricator.wikimedia.org/T103660) (owner: 10Mobrovac) [10:23:50] (03CR) 10Hashar: [C: 032] "I fixed up the comments to refer to deployment-cache-parsoid05" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249366 (https://phabricator.wikimedia.org/T103660) (owner: 10Mobrovac) [10:23:56] (03Merged) 10jenkins-bot: beta: use new IP for Parsoid Cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249366 (https://phabricator.wikimedia.org/T103660) (owner: 10Mobrovac) [10:28:58] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: better cassandra process checks - https://phabricator.wikimedia.org/T108306#1761279 (10fgiunchedi) 5Open>3Resolved resolved by https://gerrit.wikimedia.org/r/#/c/249082/ [10:29:26] (03Abandoned) 10Filippo Giunchedi: cassandra: switch to nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/230066 (https://phabricator.wikimedia.org/T108306) (owner: 10Filippo Giunchedi) [10:37:13] <_joe_> !log manually removed crontab from mw1152, erroneously created by puppet [10:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:39:10] !log updated kernel on restbase1001 to latest 3.19 [10:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:48] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [10:45:47] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [10:49:19] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [10:50:40] (03PS1) 10Jcrespo: Enabling performance schema experimentally on db1022 [puppet] - 10https://gerrit.wikimedia.org/r/249372 (https://phabricator.wikimedia.org/T99485) [10:56:14] (03PS1) 10Filippo Giunchedi: cassandra: don't enable at boot [puppet] - 10https://gerrit.wikimedia.org/r/249374 [11:01:27] (03PS1) 10KartikMistry: Apertium: Add missing apertium-br-fr [puppet] - 10https://gerrit.wikimedia.org/r/249376 (https://phabricator.wikimedia.org/T102101) [11:01:42] (03PS2) 10Jcrespo: Enabling performance schema experimentally on db1022 [puppet] - 10https://gerrit.wikimedia.org/r/249372 (https://phabricator.wikimedia.org/T99485) [11:02:12] akosiaris: ^^small review for you :) [11:03:19] kart_: today's a public holiday in Greece [11:03:24] oh. [11:03:32] not urgent. [11:04:01] (03CR) 10Jcrespo: [C: 032] Enabling performance schema experimentally on db1022 [puppet] - 10https://gerrit.wikimedia.org/r/249372 (https://phabricator.wikimedia.org/T99485) (owner: 10Jcrespo) [11:06:30] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [11:08:25] (03CR) 10Mobrovac: [C: 031] cassandra: don't enable at boot [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi) [11:12:34] (03CR) 10Filippo Giunchedi: [C: 04-1] "sadly this doesn't seem to work when tested on in labs, puppet still wants to ensure the service is running, testing some more.." [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi) [11:13:23] 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1761393 (10SBisson) What is blocking this ticket? [11:15:36] (03PS1) 10Giuseppe Lavagetto: maintenance: amend hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/249377 [11:16:13] <_joe_> godog: I think we had some logic for that [11:16:19] <_joe_> in base::service_unit [11:16:48] <_joe_> but the problem is that puppet either manages the state of a service, or cannot think it is present. [11:18:05] (03PS14) 10Madhuvishy: burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [11:18:07] (03CR) 10Madhuvishy: "Thanks for the review, Ori!" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy) [11:18:19] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 4 failures [11:19:05] (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: amend hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/249377 (owner: 10Giuseppe Lavagetto) [11:19:47] (03PS1) 10Joal: Correct pageview to dumps synchro [puppet] - 10https://gerrit.wikimedia.org/r/249378 [11:19:57] _joe_: indeed, so in this case it'd be declare_service => false to have the service available but not (re)started by puppet nor at boot [11:20:14] <_joe_> yep [11:21:38] I'll fix the docs for service_unit too [11:21:51] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [11:22:01] <_joe_> godog: can you verify that works? [11:23:22] I can [11:24:21] (03CR) 10Faidon Liambotis: [C: 04-1] "This has been on my radar too. AFAIK, all of that (and all of the remaining nfs.pp) is about to be deprecated *very soon* now, as in this " [puppet] - 10https://gerrit.wikimedia.org/r/249347 (owner: 10Dzahn) [11:28:50] (03PS2) 10Filippo Giunchedi: cassandra: stop declaring service resource [puppet] - 10https://gerrit.wikimedia.org/r/249374 [11:29:46] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1761440 (10hashar) Holy hell, how do you manage to install servers so fast ? :-} [11:31:45] 6operations, 7Database: Adapt wmf-mariadb10 package for jessie or puppetize differently its service to adapt it to systemd - https://phabricator.wikimedia.org/T116903#1761444 (10jcrespo) 3NEW a:3jcrespo [11:33:27] moritzm mobrovac https://gerrit.wikimedia.org/r/#/c/249374/2 this should do it for cassandra [11:34:43] godog: haven't we established ensure => present is not valid? [11:34:56] 6operations, 7Database: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#1761462 (10jcrespo) I've filed T116903 and T116902 for the reminding tasks. I have already filed the Puppet-mariadb issues. T105879 is no longer an issue, so I will repool the server now and close thi... [11:35:01] PROBLEM - service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [11:35:15] wtf? [11:35:47] godog: cass died on rb1007 ^^ [11:35:58] (03PS1) 10Jcrespo: Repooling db1022 after checking its integrity and config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249379 (https://phabricator.wikimedia.org/T101516) [11:36:00] PROBLEM - Cassandra CQL query interface on restbase1007 is CRITICAL: Connection refused [11:36:22] mobrovac: indeed, likely the same as yesterday [11:38:26] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1761469 (10hashar) [11:38:34] (03PS2) 10Jcrespo: Repooling db1022 after checking its integrity and config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249379 (https://phabricator.wikimedia.org/T101516) [11:39:08] (03CR) 10Jcrespo: [C: 032] Repooling db1022 after checking its integrity and config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249379 (https://phabricator.wikimedia.org/T101516) (owner: 10Jcrespo) [11:41:09] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1761472 (10hashar) Need to get rid of the Gerrit replication ( T86661 ) We will need to make sure all slaves (gallium.wikimedia.org and instances in contintcloud and inte... [11:41:13] mobrovac, ping me when you are less busy [11:41:22] (03CR) 10Muehlenhoff: "I think that could work, but I'm unsure what to expect from systems where puppet currently manages the service, does that have an effect o" [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi) [11:41:28] heh kk jynus [11:41:33] hoep that'll happen today :) [11:41:40] it is about a tin commit [11:42:15] oh, it is beta, so probably doesn't matter [11:42:31] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [11:42:48] jynus: which one? [11:42:58] beta: use new IP for Parsoid Cache [11:43:07] jynus: oops sorry [11:43:11] I forgot to deploy that one on prod [11:43:15] it is harmless for prod [11:43:16] i've merged it [11:43:19] sorry :-/ [11:43:29] so I suppose I do not need to deploy it :-) [11:43:38] there was no problem at all [11:44:00] (03CR) 10Mobrovac: cassandra: stop declaring service resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi) [11:44:25] :) [11:46:04] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1022 after maintenance (duration: 00m 19s) [11:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:46:10] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [11:52:27] godog: should i restart cass on rb1007? [11:53:53] mobrovac: heh I was looking at the logs but not a lot of luck, sure go ahead [11:54:10] yeah, me too, no luck [11:54:35] godog: i think we'll have to look into it after we finsh the kernel upgrade [11:54:44] is it another OOM? [11:55:07] 6operations, 7Database: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#1761490 (10jcrespo) 5stalled>3Resolved [11:55:22] it is [11:55:34] mobrovac: *nod* [11:55:38] urandom: still awake? [11:55:39] urandom: yep :| [11:56:51] (03CR) 10Filippo Giunchedi: "@Moritz, afaict it will stop managing the service from puppet's POV but leave it alone otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi) [11:57:10] RECOVERY - service on restbase1007 is OK: OK - cassandra is active [11:57:24] godog, mobrovac: i'm moving that heap dump into a named subdirectory of my home [11:57:35] (03CR) 10Filippo Giunchedi: cassandra: stop declaring service resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi) [11:57:40] since the PID isn't being expanded, it'll be overwritten if not renamed [11:58:09] urandom: yeah mobrovac pointed me to the phab ticket earlier about %p :| [12:00:01] RECOVERY - Cassandra CQL query interface on restbase1007 is OK: TCP OK - 0.004 second response time on port 9042 [12:00:06] 6operations, 7Database, 5Patch-For-Review: implement performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1761498 (10jcrespo) When setting db1022 on production, we lost some accounts and hosts: ``` mysql> SHOW GLOBAL STATUS like 'performance%'; +---------------------------------------... [12:02:41] (03CR) 10Muehlenhoff: [C: 031] "As discussed on IRC, let's try this on one of the restbase hosts with puppet disabled on the others" [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi) [12:03:26] (03PS3) 10Filippo Giunchedi: cassandra: stop declaring service resource [puppet] - 10https://gerrit.wikimedia.org/r/249374 [12:03:32] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: stop declaring service resource [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi) [12:03:44] 6operations, 7Database, 5Patch-For-Review: implement performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1761502 (10jcrespo) By the way, this is a good guide to tune performance_schema: http://marcalff.blogspot.com.es/2013/04/on-configuring-performance-schema.html [12:06:41] (03CR) 10Filippo Giunchedi: "will need to change references to Service['cassandra'] too" [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi) [12:06:43] there will be some puppet failures coming up for restbase machines btw [12:08:20] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: puppet fail [12:10:00] PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: puppet fail [12:11:31] ok I'm going to revert https://gerrit.wikimedia.org/r/249374 [12:13:21] PROBLEM - puppet last run on restbase2005 is CRITICAL: CRITICAL: puppet fail [12:13:47] (03PS1) 10Filippo Giunchedi: Revert "cassandra: stop declaring service resource" [puppet] - 10https://gerrit.wikimedia.org/r/249382 [12:13:51] it seems we are going to spend the day on this :) [12:14:19] "just a simple kernel upgrade" <- famous last words of the day [12:14:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "cassandra: stop declaring service resource" [puppet] - 10https://gerrit.wikimedia.org/r/249382 (owner: 10Filippo Giunchedi) [12:14:40] the punch is that the kernel upgrade itself is fine [12:15:30] PROBLEM - puppet last run on aqs1003 is CRITICAL: CRITICAL: puppet fail [12:18:59] PROBLEM - puppet last run on restbase2002 is CRITICAL: CRITICAL: puppet fail [12:19:26] mobrovac: hehe, anyways I think we should move on with the kernel upgrade at least, keep puppet disabled on the affected machines and then tackle the service management [12:19:37] cc moritzm ^ [12:19:40] (03PS1) 10Jcrespo: Increase the host and account size on P_S config [puppet] - 10https://gerrit.wikimedia.org/r/249385 [12:20:17] (03CR) 10Alex Monk: "tin and terbium use apache 2.2.22... I don't think they run MW though. tin has some sort of deployment stuff (trebuchet?) and terbium runs" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [12:20:22] godog: moritzm: sounds good to me [12:23:06] there seems to be some issue with puppet on stat* I will check it after lunch [12:24:46] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1761540 (10fgiunchedi) afaict our puppet hooks for jessie does include `thirdparty` ``` package_builder::pbuilder... [12:25:29] PROBLEM - puppet last run on restbase1004 is CRITICAL: CRITICAL: puppet fail [12:26:10] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:27:19] RECOVERY - puppet last run on restbase1004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:28:30] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:29:12] godog: ok, but puppet was removed on rb1001 already? [12:29:22] disabled, not removed [12:30:01] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:11] I'll repool rb1001 now if we don't need it for further debugging [12:31:12] moritzm: true I was expecting it to be disabled on 1001 [12:32:11] godog: ok, so mean to proceed with 100[2-6] despite the fact that cassandra/rb get restarted upon reboot (which worked fine for 1001 anyway)? [12:33:04] moritzm: yep, even though with puppet disabled and systemctl disable cassandra it shouldn't really start anything, btw puppet does seem to be enabled atm on 1002 for example [12:35:39] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [12:36:52] (03PS1) 10Giuseppe Lavagetto: service_checker: correctly return an error in case of timeouts [puppet] - 10https://gerrit.wikimedia.org/r/249388 (https://phabricator.wikimedia.org/T116739) [12:37:50] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:38:55] <_joe_> godog: ^^ [12:39:34] godog: ok, I had it disabled via salt initially (and also got the "True" output from salt) (before it was re-enabled it for your final puppet run). I just made a second run to disable it, but got not output from the salt run, I'll check that locally on the systems [12:39:52] ok, but I'll proceed with 100[2-6], then [12:40:01] (03PS15) 10Madhuvishy: burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [12:40:47] (03CR) 10Filippo Giunchedi: [C: 031] service_checker: correctly return an error in case of timeouts [puppet] - 10https://gerrit.wikimedia.org/r/249388 (https://phabricator.wikimedia.org/T116739) (owner: 10Giuseppe Lavagetto) [12:40:56] moritzm: ack [12:41:20] RECOVERY - puppet last run on restbase2005 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:43:07] (03PS2) 10Giuseppe Lavagetto: service_checker: correctly return an error in case of timeouts [puppet] - 10https://gerrit.wikimedia.org/r/249388 (https://phabricator.wikimedia.org/T116739) [12:43:20] RECOVERY - puppet last run on aqs1003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:46:27] (03PS1) 10Hashar: contint: set Zuul URL based on server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) [12:46:50] RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:47:10] (03CR) 10Hashar: "Impacts gallium.wikimedia.org . Once applied, the zuul-merger process needs to be restarted." [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [12:47:21] (03CR) 10Mobrovac: [C: 031] service_checker: correctly return an error in case of timeouts [puppet] - 10https://gerrit.wikimedia.org/r/249388 (https://phabricator.wikimedia.org/T116739) (owner: 10Giuseppe Lavagetto) [12:47:45] (03CR) 10Giuseppe Lavagetto: [C: 032] service_checker: correctly return an error in case of timeouts [puppet] - 10https://gerrit.wikimedia.org/r/249388 (https://phabricator.wikimedia.org/T116739) (owner: 10Giuseppe Lavagetto) [12:48:49] moritzm: i still think it'd be advisable to use systemctl mask [12:54:49] mobrovac, godog: I'm fine either way, we can certainly also use mask [12:54:50] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:56:46] !log depooled restbase1002 for kernel update [12:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:00:22] could use a config change for Gerrit to stop replication to gallium (the CI host) https://gerrit.wikimedia.org/r/#/c/244498/ [13:00:37] the replication is no more needed and I would like to clean it up from the server to unblock some other task [13:01:01] PROBLEM - Restbase endpoints health on restbase2006 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/ [13:01:06] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1761603 (10Joe) 5Open>3Resolved a:3Joe [13:01:23] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1756819 (10Joe) [13:01:24] 6operations, 10RESTBase-Cassandra, 7Monitoring: service_checker reports success even on endpoints timing out - https://phabricator.wikimedia.org/T116770#1761607 (10Joe) 5Open>3Resolved a:3Joe [13:02:53] (03CR) 10Hashar: "Puppet compiler is happy: https://puppet-compiler.wmflabs.org/1111/gallium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [13:03:11] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/ [13:04:03] <_joe_> uhm this ^^ is probably been there all along, but we will notice now that I corrected the bug. [13:04:59] PROBLEM - Restbase endpoints health on restbase2003 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/ [13:05:07] yep very likely [13:05:27] <_joe_> I'm going to debug it for a second [13:05:33] I have to run to lunch, moritzm mobrovac page if you need anything [13:06:09] PROBLEM - Restbase endpoints health on restbase-test2001 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1: [13:06:30] PROBLEM - Restbase endpoints health on restbase2004 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/ [13:06:43] hm so content-encodig [13:06:46] interesting [13:06:59] PROBLEM - Restbase endpoints health on cerium is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wik [13:06:59] PROBLEM - Restbase endpoints health on xenon is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wiki [13:07:03] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1761630 (10Andrew) thanks for fixing, sorry my patch was dumb :( [13:07:17] _joe_: could you ack all those for RB? [13:07:36] <_joe_> mobrovac: yeah but this is a real problem [13:07:46] <_joe_> mobrovac: rb responds with content-encoding: gzip [13:07:53] <_joe_> but then has uncompressed output [13:08:04] _joe_: i agree, we'll look into it [13:08:22] ah i think i know why [13:08:42] <_joe_> I'll open a ticket and ack the alarms [13:08:53] cool, you can assign it to me _joe_ [13:08:56] thnx [13:11:19] PROBLEM - Restbase endpoints health on restbase2001 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/ [13:12:26] 6operations, 10RESTBase, 6Services: restbase endpoint reporting incorrect content-encoding: gzip - https://phabricator.wikimedia.org/T116911#1761636 (10Joe) 3NEW a:3mobrovac [13:12:44] !log repooled restbase1002 [13:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:40] !log depooled restbase1003 for kernel update [13:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:19:39] PROBLEM - Restbase endpoints health on restbase2002 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/ [13:22:21] (03PS1) 10Hashar: contint: install nodejs-legacy on Debian [puppet] - 10https://gerrit.wikimedia.org/r/249391 [13:24:02] (03CR) 10Hashar: [C: 031 V: 031] "Cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/249391 (owner: 10Hashar) [13:24:52] (03CR) 10Hashar: [C: 04-1] "fails on Precise ..." [puppet] - 10https://gerrit.wikimedia.org/r/249391 (owner: 10Hashar) [13:25:50] (03PS2) 10Hashar: contint: install nodejs-legacy on Debian [puppet] - 10https://gerrit.wikimedia.org/r/249391 [13:26:25] (03CR) 10Milimetric: [C: 031] Correct pageview to dumps synchro [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal) [13:28:30] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [13:28:36] (03CR) 10Hashar: [C: 031 V: 031] "cherry picked on integration puppetmaster . Pass on all distributions." [puppet] - 10https://gerrit.wikimedia.org/r/249391 (owner: 10Hashar) [13:34:49] (03PS2) 10Milimetric: Correct pageview to dumps synchro, add projectview [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal) [13:34:55] !log repooled restbase1003 [13:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:35:24] !log depooled restbase1004 for kernel update [13:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:37] 7Puppet, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 5Patch-For-Review, and 2 others: Puppetize npm/grunt manual setup - https://phabricator.wikimedia.org/T113903#1761698 (10hashar) All good on permanent slaves. When https://gerrit.wikimedia.org/r/#/c/244748/ is merged, we ca... [13:47:33] !log repooled restbase1004 [13:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:54:19] !log depooled restbase1005 for kernel update [13:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:56:18] 7Puppet, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 5Patch-For-Review, and 2 others: Puppetize npm/grunt manual setup - https://phabricator.wikimedia.org/T113903#1761752 (10hashar) [13:57:19] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/ [13:57:20] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:58:03] * MatmaRex gently pokes godog about https://phabricator.wikimedia.org/T111838 [14:00:10] (03PS2) 10Jcrespo: Increase the host and account size on P_S config [puppet] - 10https://gerrit.wikimedia.org/r/249385 [14:01:12] oh. Forgot to save deployment page earlier :/ [14:01:28] (03CR) 10Jcrespo: [C: 032] Increase the host and account size on P_S config [puppet] - 10https://gerrit.wikimedia.org/r/249385 (owner: 10Jcrespo) [14:05:26] (03PS1) 10Giuseppe Lavagetto: role::statistics: fix file path [puppet] - 10https://gerrit.wikimedia.org/r/249396 [14:05:57] !log repooled restbase1005 [14:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:21] (03PS2) 10Giuseppe Lavagetto: role::statistics: fix file path [puppet] - 10https://gerrit.wikimedia.org/r/249396 [14:06:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/249396 (owner: 10Giuseppe Lavagetto) [14:08:59] (03PS1) 10Muehlenhoff: Enable ferm on tin [puppet] - 10https://gerrit.wikimedia.org/r/249398 [14:09:10] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:09:21] 6operations, 10Continuous-Integration-Infrastructure: Install Jenkins Job Builder on gallium - https://phabricator.wikimedia.org/T45141#1761789 (10hashar) 5Open>3declined a:3hashar For now on we deploy them manually. Maybe we will get that moved to use scap3 and have it generate the jobs directly from t... [14:09:32] !log depooled restbase1006 for kernel update [14:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:13] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1761796 (10hashar) [14:10:19] MatmaRex: yup I've seen that but not a lot of bandwidth atm [14:11:49] godog: are you really the only person who'd be able to look into that? [14:12:24] (03PS1) 10Subramanya Sastry: WIP: Update parsoid server.js path + removed stale parsoid file [puppet] - 10https://gerrit.wikimedia.org/r/249399 [14:12:30] (03PS1) 10Filippo Giunchedi: cassandra: unblacklist 'max' metric [puppet] - 10https://gerrit.wikimedia.org/r/249400 (https://phabricator.wikimedia.org/T116913) [14:12:42] (also, what has higher priority that data loss bugs? :/ i sure hope the whole site not going down is not depending on you propping it up all the time) [14:13:53] MatmaRex: no I'm certainly not the only person [14:18:19] (03PS1) 10Matthias Mullie: Expire Flow caches after 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249402 (https://phabricator.wikimedia.org/T94029) [14:19:15] (03CR) 10Matthias Mullie: [C: 04-2] "Do not merge before https://gerrit.wikimedia.org/r/#/c/247575/ has beem in production for awhile & having checked its impact." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249402 (https://phabricator.wikimedia.org/T94029) (owner: 10Matthias Mullie) [14:20:28] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1761834 (10saper) Question: wouldn't that be possible to ship the certificate as a parameter to `$wgForeignXXXRepos` and not... [14:21:15] (03PS3) 10Ottomata: Abstract rsync classes into a define, correct pageview to dumps synchro, add projectview [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal) [14:21:31] !log repooled restbase1006 [14:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:49] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:21:49] (03CR) 10jenkins-bot: [V: 04-1] Abstract rsync classes into a define, correct pageview to dumps synchro, add projectview [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal) [14:22:37] !log T112626 Finished running fix-stats.php for CX (from rwwiki to zuwiki) [14:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:10] (03PS4) 10Ottomata: Abstract rsync classes into a define, correct pageview to dumps synchro, add projectview [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal) [14:24:00] (03CR) 10Mobrovac: [C: 04-1] "files/misc/parsoid seems to be used by File['/usr/bin/parsoid'] in the role::parsoid::common class (manifests/role/parsoid.pp:22)" [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry) [15:38:38] PROBLEM - Host labstore1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:38:39] (03CR) 10Hashar: [C: 031] "That is needed to setup zuul-merger on scandium. Else the patch it merges will be reported as being on zuul.eqiad.wmnet which is gallium " [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:38:42] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:42] (03Draft1) 10Hashar: (WIP) contint: scandium configuration (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/249380 [15:38:43] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:38:43] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:43] Hm. If we needed confirmation that NFS issues would alert noisily... [15:38:44] (03CR) 10Hashar: "This will get us access to scandium for the preliminary setup. It grants me root access like on gallium, will probably want to remove it o" [puppet] - 10https://gerrit.wikimedia.org/r/249380 (owner: 10Hashar) [15:38:45] PROBLEM - Cassandra CQL query interface on restbase-test2003 is CRITICAL: Connection refused [15:38:45] PROBLEM - Restbase root url on restbase-test2003 is CRITICAL: Connection refused [15:38:45] PROBLEM - Restbase endpoints health on restbase-test2003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:38:46] ACKNOWLEDGEMENT - Host labstore1002 is DOWN: PING CRITICAL - Packet loss = 100% Coren Switch in progress [15:38:46] (03CR) 10Andrew Bogott: "This presumes that the merger is always running on the same host as Zuul -- is that a safe assumption? Otherwise we could use a hiera set" [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:38:46] (03PS2) 10Subramanya Sastry: WIP: Update parsoid server.js path [puppet] - 10https://gerrit.wikimedia.org/r/249399 [15:38:46] (03CR) 10Subramanya Sastry: "That looks like a stale reference. But, I'll deal with that stale file and references in a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry) [15:38:46] (03PS1) 10coren: Labs: switch active labstore [puppet] - 10https://gerrit.wikimedia.org/r/249408 (https://phabricator.wikimedia.org/T107038) [15:38:46] ebernhardson, legoktm: I plan to enable ferm/firewall on tin before morning swat (for which you two are listed ATM). I don't expect any interference from the change, but if what you're planning to deploy is too critical I can also defer to another day [15:38:46] (03CR) 10coren: [C: 04-1] "Merge only after succesful switch." [puppet] - 10https://gerrit.wikimedia.org/r/249408 (https://phabricator.wikimedia.org/T107038) (owner: 10coren) [15:38:47] (03CR) 10Hashar: "That is the over side. This patch stop assuming the merger run on the same host as Zuul scheduler." [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:38:48] (03PS16) 10Ottomata: burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy) [15:38:48] (03CR) 10Ottomata: [C: 032] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy) [15:38:48] moritzm: worst case mine can go out 8 hours later, its annoying but not the end of the world [15:38:48] ebernhardson: ok, thanks. if anyone should cause problems, I'll be able to spot it quickly in the logs anyway [15:38:48] anything, not anyone... [15:38:49] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Puppet has 1 failures [15:38:49] (03PS2) 10Muehlenhoff: Enable ferm on tin [puppet] - 10https://gerrit.wikimedia.org/r/249398 [15:38:50] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on tin [puppet] - 10https://gerrit.wikimedia.org/r/249398 (owner: 10Muehlenhoff) [15:38:53] moritzm: nope, nothing too critical [15:38:55] * aude panics to put things in swat :) [15:38:55] ebernhardson, legoktm: I've enabled it and added logging, ping me when you started and I'll monitor the logs if anything gets dropped [15:38:55] no need to panic :-) [15:38:58] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/ [15:38:58] legoktm: you wanna deploy this time around? :) [15:38:58] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/ [15:38:58] ha, sure [15:38:58] ebernhardson: is there any order your patches need to be deployed in? [15:38:58] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/ [15:38:58] legoktm: my two patches are ordered (but not in a way that breaks things). Basically the first one has to go out and settle for a bit such that web pages stop requesting schema.Search from RL immediatly on page load [15:38:58] legoktm: by loading that so early in the process basically resource loader is serving up the schema without having received the code for some % of the time [15:38:58] (because its a reasonable chance of being loaded from a machine that hasn't gotten the sync yet) [15:38:58] so, if you could do my first one, then yours, then my last one, should be enough time [15:38:58] (03PS1) 10Aude: Add pageterms mobile api params for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249412 [15:38:58] ebernhardson: are you doing swat? [15:38:59] aude: lego is [15:38:59] (03CR) 10jenkins-bot: [V: 04-1] Add pageterms mobile api params for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249412 (owner: 10Aude) [15:38:59] ah [15:38:59] also working out what the query that needs to be added to fatalmonitor is, is basically just curl + jq [15:38:59] (03PS2) 10Aude: Add pageterms mobile api params for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249412 [15:39:02] (03CR) 10Andrew Bogott: [V: 032] contint: set Zuul URL based on server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:39:02] (03PS2) 10Andrew Bogott: contint: set Zuul URL based on server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:39:02] (03PS2) 10Hashar: contint: scandium configuration [puppet] - 10https://gerrit.wikimedia.org/r/249380 [15:39:02] (03CR) 10Andrew Bogott: [C: 032] contint: set Zuul URL based on server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:39:02] (03PS3) 10Hashar: contint: scandium configuration [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) [15:39:03] (03CR) 10Hashar: "I have added reference to Bug: T95046 and filled T116921 to remember to remove the root access." [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:39:05] !log legoktm@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/modules/ext.wikimediaEvents.search.js: https://gerrit.wikimedia.org/r/#/c/249405/ (duration: 00m 18s) [15:39:05] ebernhardson: ^ doing wmf4 now [15:39:05] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: puppet fail [15:39:05] legoktm: i'd like https://gerrit.wikimedia.org/r/#/c/249412/ in swat [15:39:05] only for test.wikidata for now [15:39:05] ok, I'll do that next while the core patch merges [15:39:05] ok, thanks [15:39:05] (03CR) 10Hashar: "/etc/zuul/zuul-merger.conf" [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:39:05] (03PS1) 10Filippo Giunchedi: cassandra: fix HeapDumpPath and ErrorFile settings [puppet] - 10https://gerrit.wikimedia.org/r/249419 (https://phabricator.wikimedia.org/T116814) [15:39:05] !log legoktm@tin Synchronized php-1.27.0-wmf.4/extensions/WikimediaEvents/modules/ext.wikimediaEvents.search.js: https://gerrit.wikimedia.org/r/#/c/249404/ (duration: 00m 18s) [15:39:05] ebernhardson: ^ [15:39:05] (03CR) 10Legoktm: [C: 032] Add pageterms mobile api params for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249412 (owner: 10Aude) [15:39:05] (03Merged) 10jenkins-bot: Add pageterms mobile api params for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249412 (owner: 10Aude) [15:39:05] ebernhardson, legoktm: no problems from the firewall rules, BTW. no traffic not covered by the existing rules gets dropped [15:39:06] legoktm: looks sane, but i basically have to wait for RL to start sending that out (a few minutes) [15:39:06] !log legoktm@tin Synchronized wmf-config/Wikibase.php: https://gerrit.wikimedia.org/r/#/c/249412/ (duration: 00m 17s) [15:39:06] looking [15:39:06] aude: ^ [15:39:06] bd808: is there some proxy i can talk to from fluorine to query logstash elasticsearch? [15:39:06] hmm, might not be perfect but not horribly broken or such [15:39:06] * aude investigates and probably has follow up later [15:39:06] ok, wikidata isbroken [15:39:06] can't be related [15:39:06] https://www.wikidata.org/wiki/Special:Random [15:39:06] legoktm: can we revert? [15:39:06] in case it's related [15:39:06] wait what [15:39:06] ok [15:39:06] (Cannot access the database: Unknown database 'wikidatawiki' (10.64.0.19)) [15:39:06] which i can't imagine how my change would cause that [15:39:06] and test.wikidata is ok [15:39:06] (03PS1) 10Legoktm: Revert "Add pageterms mobile api params for test.wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249420 [15:39:06] thanks [15:39:06] (03CR) 10Legoktm: [C: 032 V: 032] Revert "Add pageterms mobile api params for test.wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249420 (owner: 10Legoktm) [15:39:06] oh [15:39:06] yes, my mistake [15:39:06] only affects wikidata [15:39:06] ebernhardson: You should be able to talk to the apache that runs logstash.wm.o from anywhere and run queries that would be possible through Kibana, but I haven't tried to actually do it. [15:39:06] ebernhardson: I cheat and ssh into logstash hosts [15:39:06] !log legoktm@tin Synchronized wmf-config/Wikibase.php: revert (duration: 00m 17s) [15:39:06] aude: random is working now [15:39:06] thanks [15:39:06] bd808: yea thats where i tested, but i'm trying to add the count of errors to fatalmonitor on flourine. I have the query + jq command worked out just have to figure out how to run it now [15:39:06] bd808: i'll try the public endpoint, thanks [15:39:06] it needs http basic auth :? [15:39:06] totally my fault, but thanks [15:39:06] bd808: oh the frnotend would, yea [15:39:06] ok, I'm going to do my core/Echo patches now [15:39:06] bd808: maybe i'll just add a fermi exception [15:39:06] The cluster was open to $INTERNAL until this week but we closed it down to protect sensitive log data [15:39:06] (03CR) 10Eevans: [C: 031] cassandra: fix HeapDumpPath and ErrorFile settings [puppet] - 10https://gerrit.wikimedia.org/r/249419 (https://phabricator.wikimedia.org/T116814) (owner: 10Filippo Giunchedi) [15:39:07] bd808: fluorine is rather limited (Deployers, plus a group that can read logs but not run anything), but if thats not limited enough i can figure something else out [15:39:07] since the logs are already there, imo its not a big deal, but willing to work around [15:39:08] ebernhardson: *nod* I think csteipp was mostly concerned about any internal host having access [15:39:08] !log legoktm@tin Synchronized php-1.27.0-wmf.4/includes/: LinksUpdate: Keep track of the triggering User - https://gerrit.wikimedia.org/r/#/c/249350/ (duration: 00m 22s) [15:39:08] if it is a reasonably restricted host he will probably be ok with it [15:39:08] bd808: ok thanks ill check [15:39:08] !log restarting cassandra on restbase1007 with 20G heap [15:39:08] ebernhardson: I actually have a toy written in go just for querying the ELK cluster. [15:39:08] https://github.com/bd808/ggml [15:39:09] !log legoktm@tin Synchronized php-1.27.0-wmf.4/extensions/Echo/: https://gerrit.wikimedia.org/r/#/c/249351/ and https://gerrit.wikimedia.org/r/#/c/249411/ (duration: 00m 18s) [15:39:09] ebernhardson: ready for your second set of patches? [15:39:10] legoktm: should be ready by now yes, thakns [15:39:11] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1761845 (10fgiunchedi) 5Resolved>3Open this isn't resolved as mobileapps still seems slow, related {T116770} is resolved though [15:39:11] 6operations, 10RESTBase, 6Services, 7RESTBase-architecture: Update restbase100[1-6] to the 3.19 kernel - https://phabricator.wikimedia.org/T102234#1761851 (10MoritzMuehlenhoff) 5Open>3Resolved restbase100[1-6] have been updated to the 3.19 kernel. [15:39:11] 6operations, 5Patch-For-Review: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1761853 (10MoritzMuehlenhoff) [15:39:12] !log legoktm@tin Synchronized php-1.27.0-wmf.4/extensions/WikimediaEvents/WikimediaEvents.php: https://gerrit.wikimedia.org/r/249317 (duration: 00m 17s) [15:39:13] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 918684 bytes in 2.966 second response time [15:39:33] (03PS12) 10Alex Monk: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [15:40:04] (03CR) 10Alex Monk: [C: 031] Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [15:40:19] (03PS13) 10Alex Monk: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [15:40:28] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.048 second response time [15:40:31] 6operations, 10Beta-Cluster-Infrastructure, 10Traffic, 5Patch-For-Review: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1762095 (10hashar) 5Open>3Resolved {T103660} has finally been solved. That was the last Varnish cache still using Trusty. [15:40:40] !log legoktm@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEvents.php: https://gerrit.wikimedia.org/r/249316 (duration: 00m 18s) [15:40:43] ebernhardson: all done ^^ [15:40:57] bblack: for information all Varnishes on beta cluster are now using Jessie :-} [15:41:12] (03CR) 10coren: [C: 031] "This needs merging now so that some ancilliary labstore services move to 1001 (backups, mostly)" [puppet] - 10https://gerrit.wikimedia.org/r/249408 (https://phabricator.wikimedia.org/T107038) (owner: 10coren) [15:41:30] (03CR) 10Muehlenhoff: "PS13 includes base::firewall twice?" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [15:41:37] legoktm: ok checking [15:41:47] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:42:01] YuviPanda: andrewbogott: When either of you have a minute, https://gerrit.wikimedia.org/r/#/c/249408/ [15:42:32] legoktm: everything looks reasonable [15:42:32] (03CR) 10Muehlenhoff: [C: 04-1] Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [15:42:34] legoktm: thanks [15:42:47] awesome [15:42:48] (03PS2) 10Andrew Bogott: Labs: switch active labstore [puppet] - 10https://gerrit.wikimedia.org/r/249408 (https://phabricator.wikimedia.org/T107038) (owner: 10coren) [15:42:51] SWAT is done! [15:42:57] Coren: I’ll merge as soon as jenkins catches up [15:43:46] And in today's good news - even the unplanned issue didn't detract from the succesful switch and instances recovered wondering why they have a 1h blackout. :-) [15:44:01] (03CR) 10Andrew Bogott: [C: 032] Labs: switch active labstore [puppet] - 10https://gerrit.wikimedia.org/r/249408 (https://phabricator.wikimedia.org/T107038) (owner: 10coren) [15:45:45] morebots, how’s it going? [15:45:45] I am a logbot running on tools-exec-1203. [15:45:45] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [15:45:45] To log a message, type !log . [15:46:04] !log testing the bot after the nfs move [15:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:44] (03PS1) 10Chad: dsh: remove scap-test group, unused [puppet] - 10https://gerrit.wikimedia.org/r/249428 [15:50:10] mutante: hehe ^ :) [15:51:30] (03Abandoned) 10Chad: beta: Start using parsoid cache 04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247332 (owner: 10Chad) [15:52:04] (03PS9) 10Chad: scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [15:52:11] (03CR) 10BryanDavis: [C: 031] "Ori made this for testing HHVM restarts during scap. We tested, found out it was a horrible idea since we couldn't properly signal pybal t" [puppet] - 10https://gerrit.wikimedia.org/r/249428 (owner: 10Chad) [15:52:39] (03PS1) 10Aude: Add settings for displaying labels on test.wikidata in mobile search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249431 [15:54:33] (03CR) 10Andrew Bogott: "This is great! I'm going to add a few more clarifying comments, while we're at it..." [puppet] - 10https://gerrit.wikimedia.org/r/249342 (owner: 10Dzahn) [15:55:07] moritzm: any insights on this burrow package? [15:56:49] (03PS2) 10Aude: Add settings for displaying labels on test.wikidata in mobile search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249431 [15:57:12] (03CR) 10Aude: "have manually tested this locally and it works properly, regardless of how $wgMFUseWikibaseDescription is set (and we are going to enable " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249431 (owner: 10Aude) [15:58:10] greg-g: legoktm i'd like to try deploying https://gerrit.wikimedia.org/r/#/c/249431/ [15:58:36] more carefully locally tested it that it works ok [15:58:41] as intended :) [15:59:01] aude: I think greg is out today :( [15:59:05] ok [15:59:10] do you want me to deploy it or were you? [15:59:17] if you want to , or i can [15:59:33] uhh, could you? :) [15:59:34] would be nice to have this stuff live on test.wikidata a bit before we enable onw ikidata [15:59:40] sure :) [15:59:42] might be a good idea to try it on mw1017 first [16:00:19] i think it's ok, but sure :) [16:00:19] * aude stuck the config inside an existing if/else block for testwikidata [16:00:48] * aude tries on mw1017 [16:01:06] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1762160 (10coren) [16:01:10] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 5 others: Reinstall labstore1001 and make sure everything is puppet-ready - https://phabricator.wikimedia.org/T107574#1762157 (10coren) 5Open>3Resolved This was confirmed in trial by fire with the switch of labs NFS back to labstore1001. [16:03:49] 6operations, 10ops-eqiad, 10Incident-20150401-LabsNFS-Overload: Inspect and diagnose labstore1001's H800 controler - https://phabricator.wikimedia.org/T95293#1762168 (10coren) 5Open>3Resolved a:3coren Resolved by the switchover test to end all switchover tests: labstore1001 is now back to being the pri... [16:07:45] (03CR) 10Aude: [C: 032] Add settings for displaying labels on test.wikidata in mobile search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249431 (owner: 10Aude) [16:07:52] (03Merged) 10jenkins-bot: Add settings for displaying labels on test.wikidata in mobile search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249431 (owner: 10Aude) [16:11:21] legoktm, aude: Yeah, greg's out today, pulled his shoulder. [16:13:18] :( [16:13:33] i think greg knows what we are doing generally and it's ok :) [16:13:50] * aude will be enabling more geodata stuff on wikidata tomorrow and will put on the deployment calendar [16:15:00] all good on mw1017 :) [16:15:52] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: enable wikibase descriptions on test.wikidata (duration: 00m 17s) [16:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:14] (03CR) 10Eevans: [C: 031] cassandra: unblacklist 'max' metric [puppet] - 10https://gerrit.wikimedia.org/r/249400 (https://phabricator.wikimedia.org/T116913) (owner: 10Filippo Giunchedi) [16:18:18] !log aude@tin Synchronized wmf-config/Wikibase.php: add settings for displaying labels on test.wikidata in mobile (duration: 00m 18s) [16:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:44] wikidata is ok :) [16:19:32] done [16:21:13] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:22:32] 6operations, 10netops: drain ULSFO of all traffic on 2015-11-02 @ 0900 PST - https://phabricator.wikimedia.org/T116928#1762244 (10RobH) 3NEW a:3faidon [16:22:49] (03CR) 10Andrew Bogott: "I have a new version of this patch prepared but can't submit for network reasons... will upload in a couple of hours." [puppet] - 10https://gerrit.wikimedia.org/r/249342 (owner: 10Dzahn) [16:23:14] 6operations, 10netops: drain ULSFO of all traffic on 2015-11-02 @ 0900 PST - https://phabricator.wikimedia.org/T116928#1762244 (10RobH) @Faidon, Please advise if this is not an appropriate time to do this. Previous discussion (linked off T116924) denotes that 'anytime' works. [16:25:53] 6operations, 10netops: drain ULSFO of all traffic on 2015-11-02 before 0900 PST - https://phabricator.wikimedia.org/T116928#1762269 (10RobH) [16:26:18] (03PS1) 10Ottomata: Open up port 8000 for burrow (on krypton) [puppet] - 10https://gerrit.wikimedia.org/r/249433 (https://phabricator.wikimedia.org/T115669) [16:28:12] (03CR) 10Muehlenhoff: Open up port 8000 for burrow (on krypton) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249433 (https://phabricator.wikimedia.org/T115669) (owner: 10Ottomata) [16:28:33] PROBLEM - Restbase endpoints health on restbase-test2002 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1: [16:28:54] (03PS2) 10Ottomata: Open up port 8000 for burrow (on krypton) [puppet] - 10https://gerrit.wikimedia.org/r/249433 (https://phabricator.wikimedia.org/T115669) [16:29:27] (03CR) 10Muehlenhoff: [C: 031] Open up port 8000 for burrow (on krypton) [puppet] - 10https://gerrit.wikimedia.org/r/249433 (https://phabricator.wikimedia.org/T115669) (owner: 10Ottomata) [16:30:39] (03CR) 10Ottomata: [C: 032 V: 032] Open up port 8000 for burrow (on krypton) [puppet] - 10https://gerrit.wikimedia.org/r/249433 (https://phabricator.wikimedia.org/T115669) (owner: 10Ottomata) [16:31:21] (03PS2) 10Filippo Giunchedi: cassandra: fix HeapDumpPath and ErrorFile settings [puppet] - 10https://gerrit.wikimedia.org/r/249419 (https://phabricator.wikimedia.org/T116814) [16:31:29] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: fix HeapDumpPath and ErrorFile settings [puppet] - 10https://gerrit.wikimedia.org/r/249419 (https://phabricator.wikimedia.org/T116814) (owner: 10Filippo Giunchedi) [16:36:35] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1762316 (10ellery) I have never used data on landing page impressions (I only use banner impressions and clicks). [16:37:22] moritzm: yay @ ferm on tin [16:37:26] (03PS2) 10Ottomata: reqstats: test on cp1065 (eqiad text) as well [puppet] - 10https://gerrit.wikimedia.org/r/249237 (owner: 10BBlack) [16:38:22] (03PS2) 10Dzahn: dsh: remove scap-test group, unused [puppet] - 10https://gerrit.wikimedia.org/r/249428 (owner: 10Chad) [16:38:51] mutante: indeed, if you want to you can merge the unification patch for tin/mira later [16:39:03] moritzm: but not the changed to unify mira and tin yet...looking why [16:39:08] change [16:39:14] yes, that:) ok [16:40:21] the isolated enable fix was easier to revert in case of problems [16:40:37] makes sense. yep. i'm amending and doing that now [16:40:50] (03CR) 10Ottomata: [C: 032] reqstats: test on cp1065 (eqiad text) as well [puppet] - 10https://gerrit.wikimedia.org/r/249237 (owner: 10BBlack) [16:41:12] (03PS2) 10Filippo Giunchedi: cassandra: unblacklist 'max' metric [puppet] - 10https://gerrit.wikimedia.org/r/249400 (https://phabricator.wikimedia.org/T116913) [16:41:23] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: unblacklist 'max' metric [puppet] - 10https://gerrit.wikimedia.org/r/249400 (https://phabricator.wikimedia.org/T116913) (owner: 10Filippo Giunchedi) [16:41:49] (03PS14) 10Dzahn: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [16:41:52] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/ [16:42:47] (03PS15) 10Dzahn: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [16:43:13] hmm, bblack [16:43:13] Oct 28 16:41:47 cp1065 systemd[1]: [/lib/systemd/system/varnishreqstats-frontend.service:3] Failed to add dependency on varnish-frontend, ignoring: Invalid argument [16:43:13] Oct 28 16:41:47 cp1065 systemd[1]: [/lib/systemd/system/varnishreqstats-frontend.service:4] Failed to add dependency on varnish-frontend, ignoring: Invalid argument [16:44:38] moritzm: i checked if mira and tin have the same number of iptables lines.. tin has a few more. one seems to be LOGGING. i assume that was to check all is ok [16:44:52] (03CR) 10Andrew Bogott: [C: 031] "works for me, but I would like Chase to review" [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [16:45:54] mutante: I dropped some logging rules during the morning swat [16:46:27] (03CR) 10Dzahn: [C: 032] "tin has firewalling now:) so they are both identical" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [16:46:56] moritzm: yep, the diff is only the logging. alright [16:47:31] @seen hoo [16:47:31] mutante: Last time I saw hoo they were quitting the network with reason: Quit: AndroIRC - Android IRC Client ( http://www.androirc.com ) N/A at 10/28/2015 1:14:05 PM (3h33m26s ago) [16:49:54] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1762373 (10Dzahn) tin got firewalling during this morning's swat deploy. that meant tin and mira are now identical and we could merge hoo's change above to reflect... [16:50:33] 6operations, 10Continuous-Integration-Config, 5Patch-For-Review: Forbid quoted booleans in puppet manifests - https://phabricator.wikimedia.org/T113783#1762381 (10Andrew) 5Open>3Resolved [16:50:47] moritzm: i fixed puppet compiler run for stuff on puppetmaster, so the palladium change +1 [16:51:00] (by adding fake new_install keys to labs/private) [16:51:03] ok, I'll have a look later or tomorrow [16:51:22] mira/tin look all fine [16:51:26] great [16:51:33] 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1762392 (10bd808) my $0.02: Fatalmonitor is hack of tail + sed + awk + watch, and only sees things that are reported to hhvm.log. Adding a curl + jq component to it that reports one number is not going to give... [16:51:34] We need to land the co-master thing tho. [16:52:04] https://gerrit.wikimedia.org/r/#/c/224829/ [16:52:43] ok, just looked at the pending blockers for that tracking ticket [16:52:55] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1762411 (10GWicke) I think this was caused by a bug in the mobile app end point monitoring spec. See https://github.com/wikimedia/restbase/pull/389 for a fix. [16:52:57] for codfw deployment server [16:53:13] "[scap] Add support for syncing /srv/mediawiki-staging including fully working git data to warm spare deploy server" [16:53:16] Yep :) [16:53:24] ok [16:53:29] The scap bits have already been merged. [16:53:33] nice [16:53:40] Just need the puppet config. [16:53:53] 6operations: Opendj on Neptunium running java 6, on Nembus java 7 - https://phabricator.wikimedia.org/T107424#1762423 (10Andrew) 5Open>3declined I think this is moot -- It works, and we're hoping to kill off opendj anyway. [16:54:06] !log installed openjdk security updates on zookeeper hosts [16:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:27] hm, i see a bit of sudo and wrapper script discussion there.. yea.. [16:54:48] !log unblacklist 'max' cassandra metrics and restart cassandra-metrics-collector [16:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:20] (03CR) 10Chad: "Although, the discussion now is about moving scap towards using the info directly from etcd (or having etcd write the dsh files) which wou" [puppet] - 10https://gerrit.wikimedia.org/r/247324 (owner: 10Chad) [16:55:29] (03PS3) 10Dzahn: dsh: remove scap-test group, unused [puppet] - 10https://gerrit.wikimedia.org/r/249428 (owner: 10Chad) [16:55:44] (03CR) 10Dzahn: [C: 032] dsh: remove scap-test group, unused [puppet] - 10https://gerrit.wikimedia.org/r/249428 (owner: 10Chad) [16:58:08] jouncebot: next parsoid [16:58:08] In 1 hour(s) and 1 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151028T1800) [16:58:34] how about parsoid. for https://gerrit.wikimedia.org/r/#/c/249321/ [16:59:19] (03CR) 10Rush: [C: 04-1] contint: scandium configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [17:00:34] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.33% of data above the critical threshold [1000.0] [17:01:36] (03CR) 10Hashar: contint: scandium configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [17:01:38] twentyafterfour: Any objection to me updating production scap to master before the train? [17:01:45] (it's already running master in beta) [17:02:04] chasemp: I am not sure how to set the $cluster and $nagois_contact_group via hiera:-/ [17:02:56] hashar: contact_group: admins,parsoid [17:03:02] for example [17:03:10] ostriches: I don't see any problem with it [17:03:11] going to update gallium :) [17:03:14] or hosts/gadolinium.yaml:contactgroups: 'admins,analytics' [17:03:26] eh, wait [17:03:37] contact_group vs. contactgroups [17:03:45] JohnFLewis: ^ :p [17:04:09] we have both, but i can confirm "contactgroups" works [17:04:13] like here: [17:04:16] mutante: contactgroups: [17:04:19] role/common/lists.yaml:contactgroups: admins,mailman-admins [17:04:23] hashar: ^ that [17:04:26] via role [17:04:31] 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1762460 (10Mooeypoo) I couldn't reproduce the specific problems in this bug, but it also shows several issues with Echo and m... [17:04:32] ACKNOWLEDGEMENT - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.33% of data above the critical threshold [1000.0] Filippo Giunchedi expected, cassandra metric unblacklisted [17:04:36] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1762462 (10GWicke) [17:04:41] it is contactgroups: the underscore felt, eh to me :) [17:04:43] (03PS1) 10Hashar: gallium: migrate cluster/contact to hiera [puppet] - 10https://gerrit.wikimedia.org/r/249441 [17:04:47] JohnFLewis: let's fix common/lvs/configuration.yaml: contact_group: admins,parsoid [17:05:08] unless that is not the same [17:05:34] (03CR) 10John F. Lewis: [C: 04-1] "contactgroups: not contact_group:" [puppet] - 10https://gerrit.wikimedia.org/r/249441 (owner: 10Hashar) [17:05:46] hashar: and $cluster is just "cluster: foo" [17:05:47] mutante: should be fixed yeah [17:05:56] ok ok :) [17:06:10] hashar: in general for the functions scandium is taking over I would like to define them into role and use the role keyword [17:06:33] chasemp: yeah the idea is to use role zuul::merger [17:06:48] but I don't want to get zuul merger enabled there before I get shell access to verify how the network flows behave [17:06:57] ok I understand then thanks [17:07:03] sorry should have made it clear [17:07:17] 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1762473 (10Mooeypoo) More to the point, I don't think this should be "blocked". Some aspects of this ticket are under develop... [17:07:21] (03PS2) 10Hashar: gallium: migrate cluster/contact to hiera [puppet] - 10https://gerrit.wikimedia.org/r/249441 [17:09:15] 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1762476 (10Mattflaschen) Yes, my understanding is we took it because although we thought the fix would touch mobile code (and... [17:10:33] (03PS4) 10Hashar: contint: scandium configuration [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) [17:12:33] bleh, stuck at 15 hosts not fetching. [17:12:40] * ostriches stabs trebuchet a bit [17:12:46] (03CR) 10Hashar: "The $cluster and nagios contact group are now in hiera similar to how I did it for gallium: https://gerrit.wikimedia.org/r/249441" [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [17:13:39] tin and mira among the ones that won't fetch, which are 2 I want, heh [17:14:40] ostriches: will you be able to baby sit the patch to stop gerrit replication to gallium ? https://gerrit.wikimedia.org/r/#/c/244498/ I asked a bit today but without luck :-} [17:14:48] Yeah [17:15:01] ostriches: I don't mind handling the cleanup on gallium [17:15:36] mutante: andrewbogott are we in agreement that because https://gerrit.wikimedia.org/r/#/c/249380/4 is access granted for teh same service on a new box it dosn't require any kind of sudo perms approval? this is lateral migration not escalation [17:15:38] <_joe_> ostriches: let me know if you need help from me for working towards making mira able to deploy [17:16:31] chasemp: yeah, I don’t think it counts as new access. [17:16:46] (03PS6) 10Rush: Drop cirrussearch write jobs after 3 hours of failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [17:16:52] !log deployed scap master@abe1973 to cluster [17:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:16:59] chasemp: mutante: andrewbogott: I could use temp root on scandium to do the service implementation. It might not be needed but having it is a nice convenience. I filled a task to remember to remove the root access [17:17:11] _joe_: It's just that one last patch for puppet :) [17:17:16] (03CR) 10Rush: "what's the story with this patch? We are publishing these updates now right so we need to roll on this to be more safe(-ish)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [17:17:41] hashar: you have root on the current zuul merger right? [17:17:50] ostriches: and https://gerrit.wikimedia.org/r/#/c/247965/ [17:17:54] chasemp: I got root on gallium which host zuul server / zuul merger and jenkins [17:18:05] bd808: Oh yeah I was gonna merge that. [17:18:09] chasemp: the only difference is scandium is in the labs network [17:18:20] hashar: right ok yeah I'm going to roll on it then, de-escalation is another process entirely [17:18:24] \O/ [17:18:26] well it's in the support vlan which we said we are ok w/ [17:18:38] <_joe_> ostriches: throw that patch to me [17:18:39] <_joe_> :P [17:18:41] chasemp: do we have any network map of labs / prod etc? [17:18:43] we talked about this specific instance at the offsite yw! :) [17:18:50] _joe_: https://gerrit.wikimedia.org/r/#/c/224829/ [17:18:53] * hashar loves offsites [17:18:58] bd808: merged [17:18:59] good question, I think andrewbogott made something for labs not sure where it is [17:19:40] hashar: this isn’t quite what you asked for, but it’s what we have: https://wikitech.wikimedia.org/wiki/Labs_infrastructure [17:20:14] (03CR) 10Rush: [C: 032] contint: scandium configuration [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [17:20:19] andrewbogott: nonetheless very interesting, thanks! [17:20:26] <_joe_> ostriches: uhm that patch needs some work, but I don't have time to review it properly now [17:20:37] (03CR) 10Rush: [V: 032] contint: scandium configuration [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [17:20:44] _joe_: What needs work? [17:21:38] (03CR) 10Dzahn: [C: 031] contint: scandium configuration [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [17:22:15] <_joe_> ostriches: permissions are granted too broadly, there is no reason not to add that permission to the scap masters only [17:22:24] <_joe_> ostriches: I can fix that pretty easily [17:22:28] mutante: so 'us Submitted, Merge Pending' any idea why? [17:22:42] chasemp: yea, i'd agree that migration to a new host with the identical group and people doesn't need it. we should base the access on a role, then this would not be a question [17:22:53] chasemp: dependency on a nother change? [17:23:08] !log deployed scap master@f823129 to cluster [17:23:09] yea, it needs https://gerrit.wikimedia.org/r/#/c/249441/2 first [17:23:11] _joe_: you are of course right, it's only needed on the masters [17:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:23:31] bd808: Ok, master including both changes is now in prod. [17:23:44] fancy! [17:23:49] (03CR) 10Rush: [C: 032] gallium: migrate cluster/contact to hiera [puppet] - 10https://gerrit.wikimedia.org/r/249441 (owner: 10Hashar) [17:24:32] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Looks good in general, but I'd like to have these sudo rights to be granted on the masters only. I'm unsure how we should do it though, wi" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [17:25:35] 6operations, 10ops-eqiad: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#1762540 (10hashar) 3NEW [17:25:44] 6operations, 10ops-eqiad, 5Continuous-Integration-Scaling: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#1762547 (10hashar) [17:26:10] hashar: there is a small diff between gallium and scandium: [17:26:19] < ssh::server::disable_nist_kex: false [17:26:20] < ssh::server::explicit_macs: false [17:26:26] chasemp: I got access to scandium thanks! [17:26:33] but that was just needed on gallium because different distro version, right [17:26:52] mutante: yeah that is to disable the new ssh keys on gallium because Jenkins doesn't know those new algorithms [17:27:01] saying that because without that difference we could move the hiera stuff to role/common right away and forget about hostnames [17:27:05] incl. the admin group [17:27:21] mutante: I haven 't checked but zuul-merger that is going to be on scandium should be fine. Else will have to apply the same settings to scandium unfortunately :-( [17:27:30] ok [17:28:01] (03PS10) 10BryanDavis: scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) [17:28:09] _joe_: ^ [17:28:27] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1762568 (10hashar) We got shell access thanks to ops reviews! Will now look at the network flows. Once happy we can apply the zuul::merger role and d... [17:28:34] totally untested but ostriches can try it out in beta cluster [17:28:45] mutante: andrewbogott: chasemp: JohnFLewis: thank you for the reviews! works for me and that is the end of the day :-} [17:29:02] (03PS3) 10Dzahn: contint: install nodejs-legacy on Debian [puppet] - 10https://gerrit.wikimedia.org/r/249391 (owner: 10Hashar) [17:29:08] hashar: np, here's one more change then [17:29:15] (03CR) 10Dzahn: [C: 032] contint: install nodejs-legacy on Debian [puppet] - 10https://gerrit.wikimedia.org/r/249391 (owner: 10Hashar) [17:29:22] hashar: welcome ( I guess :P) [17:31:07] <_joe_> bd808: I'll take a look after the SoS [17:31:26] I'll test it on beta in the meantime [17:31:30] (03CR) 10Hashar: [C: 04-1] "Need rebase and have nodejs-legacy included (from https://gerrit.wikimedia.org/r/249391 )" [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [17:31:50] mutante: gallium is happy. ty! [17:32:09] hashar: ok,cool. good night! [17:33:52] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [17:40:51] (03CR) 10EBernhardson: "we should send this out, it just slipped my mind. will put it in the afternoon swat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [17:41:21] (03CR) 10EBernhardson: "also we are still only sending writes to eqiad and codfw, so this wont have an immediate effect, the labs writes are only turned on for te" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [17:43:15] :q [17:56:07] bd808, _joe_: Tested beta with new sudo rules, co-master sync worked just fine [17:56:48] <_joe_> ostriches: nice, but I need a pause after the SoS :P [17:57:05] No worries, I'm gonna grab my 2nd coffee. [17:58:27] <_joe_> after all, it's only 11 hours that I'm around :P [17:58:35] (03PS4) 10MaxSem: Switch www portals to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) [17:58:48] (03PS1) 10Aaron Schulz: Make mysql-multiwrite use getInstance() factory spec [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 [17:58:59] legoktm: ^ [17:59:29] AaronSchulz: will that work for the wikis that are still on wmf.3? [17:59:31] !log rolling-restart cassandra-metrics-collector, staggered this time [17:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151028T1800). Please do the needful. [18:00:13] legoktm: the spec change was older, but let me check if it's in 3 [18:02:53] twentyafterfour, can I get your review on https://gerrit.wikimedia.org/r/248526 please? [18:04:13] PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: puppet fail [18:04:57] (03CR) 10Mobrovac: "LGTM (won't +1 for now on account of the pending deploy)" [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry) [18:06:23] legoktm: looks like that didn't make it to wmf3 [18:08:32] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [18:08:47] (03CR) 10Aaron Schulz: [C: 04-2] "Blocked on needing wmf4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 (owner: 10Aaron Schulz) [18:09:29] AaronSchulz: if we just want to stop the warnings, we could set 'timeout' in the array definition instead of the global [18:12:28] legoktm: I'll update mc.php [18:12:54] !log mwscript deleteEqualMessages.php --wiki fawiki [18:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:46] !log mwscript deleteEqualMessages.php --wiki hiwiki [18:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:09] (03PS1) 10Aaron Schulz: Set "timeout" field for "memcached-pecl" explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249463 [18:14:37] akosiaris, hi, how do i deploy tileratorui? [18:14:43] i don't see it on itn [18:14:44] tin [18:15:47] bd808: It still won't set the mtime on ./ [18:15:53] Everything else succeeds. [18:16:17] permissions are /identical/ on both dirs. [18:16:24] *grumble* [18:16:41] legoktm: ^ [18:17:18] (03CR) 10Legoktm: [C: 031] Set "timeout" field for "memcached-pecl" explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249463 (owner: 10Aaron Schulz) [18:18:10] !log canary deploy of restbase deploy 3b1f6488f2 to restbase1001 [18:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:29] ostriches: https://serverfault.com/questions/337766/how-to-allow-multiple-people-to-change-mtime-timestamp-of-a-file-through-sftp/337810#337810 [18:19:03] Apparently only the owner can set mtime [18:19:27] (03PS1) 10Mobrovac: RESTBase: Strip redundant headers from back-end services [puppet] - 10https://gerrit.wikimedia.org/r/249465 (https://phabricator.wikimedia.org/T116911) [18:19:28] 6operations, 6Commons, 10Wikimedia-Media-storage, 7Swift: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1762737 (10matmarex) I wouldn't expect the files back at all. Nobody who could do it seems to have time to investigate whether they still exist somewh... [18:19:45] (03PS1) 10coren: Labs: reenable ldap on labstore* [puppet] - 10https://gerrit.wikimedia.org/r/249466 (https://phabricator.wikimedia.org/T116927) [18:20:20] paravoid: ^^ [18:20:56] bd808: why hasn't this come up before? We've never assumed deployer was owner [18:20:57] bd808, do you know by any chance how production is set up tileratorui? It shares the same git repo as tilerator service, but gets deployed as a separate service (different configuration file). I can't seem to find it on tin [18:21:33] yurik: I've never heard of it before, sorry [18:21:49] * bd808 stays aways from the nodejs world mostly [18:22:02] bd808, who sets up the tin deployment dirs? [18:22:09] puppet [18:22:25] And whomever is deploying said service [18:22:39] (03CR) 10Filippo Giunchedi: scap: Add co-master configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:22:42] ostriches, you mean i should manually set up that dir? [18:22:42] package with provider == trebuchet makes the initial clone [18:22:57] i guess i have to wait for akosiaris to sort it out ^ [18:23:19] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, modulo what _joe_ said re: masters" [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:23:26] 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1762753 (10jcrespo) 5Resolved>3Open We have to drop the views on labs, too. [18:23:39] !log rolling deploy of restbase-deploy 3b1f6488f2 to restbase cluster [18:23:42] No puppet will set it up, I'm just saying the person who sets it up is puppet + whoever is deploying the service there. They'd be the person to ask :) [18:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:24:35] (03CR) 10BryanDavis: "The sudoer rules are only applied on masters starting in PS10" [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:25:31] legoktm: want to deploy that change? [18:25:48] or I could go to my other laptop that has my ssh key [18:27:25] you need a bumper sticker [18:27:30] "my other laptop has an ssh key" [18:28:01] heh [18:28:33] "my other ssh key has 5-factor authentication" [18:28:57] 6operations, 6Commons, 10Wikimedia-Media-storage, 7Swift: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1762759 (10aaron) It would either be at the old or the new name. If it's at either then it can be accessed via the right URL (the tricky part is the /... [18:29:03] bblack: One of my factors involves my dog [18:29:05] (03PS1) 10Yuvipanda: admin: Prune unused key [puppet] - 10https://gerrit.wikimedia.org/r/249468 [18:29:07] Heh. Google once had "my other machine is a datacenter" stickers. [18:30:32] RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:31:16] AaronSchulz: yeah I can do it [18:32:06] twentyafterfour: have you started the train yet? [18:32:52] (03CR) 10Dzahn: "let me add Moritz here first. we are in the process of switching to debdeploy for upgrades and i think this will conflict with alternate m" [puppet] - 10https://gerrit.wikimedia.org/r/243925 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar) [18:33:10] legoktm, git deploying maps, almost done [18:33:59] !log updated kartotherian & tilerator services [18:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:10] ok [18:34:15] legoktm, done [18:34:24] I'll assume twentyafterfour hasn't started yet [18:34:34] (03CR) 10Legoktm: [C: 032] Set "timeout" field for "memcached-pecl" explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249463 (owner: 10Aaron Schulz) [18:34:50] !log restbase deploy done [18:34:53] RECOVERY - Host labstore1002 is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [18:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:56] (03Merged) 10jenkins-bot: Set "timeout" field for "memcached-pecl" explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249463 (owner: 10Aaron Schulz) [18:36:08] IOError: [Errno 2] No such file or directory: 'scap-masters' [18:36:08] 18:36:01 sync-file failed: [Errno 2] No such file or directory: 'scap-masters' [18:36:12] ostriches: ^ ? [18:36:54] Blech. [18:37:53] (03CR) 10Filippo Giunchedi: [C: 031] scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:38:30] ostriches: is that a quick fix or should I revert the undeployed change? [18:38:33] (03CR) 10Faidon Liambotis: [C: 032] Labs: reenable ldap on labstore* [puppet] - 10https://gerrit.wikimedia.org/r/249466 (https://phabricator.wikimedia.org/T116927) (owner: 10coren) [18:38:53] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is inactive [18:39:06] legoktm: Quick rollback, yes. But trebuchet stole ownership of some files and I can't! [18:39:14] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is inactive [18:39:18] hmm [18:39:30] I assume those checks will go away once puppet runs [18:39:32] on labstore1002 [18:39:35] since it's not the active host [18:39:45] I'll look at puppet to make sure [18:39:47] Yep. Been switched in hiera. [18:39:48] ostriches: uh, so we need a root? [18:40:07] 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1762788 (10jcrespo) Involving @Coren because of my last comment (this time, no pressure). [18:40:19] legoktm: Worked around it. [18:40:24] Yay git-fu! [18:40:39] ok, should I try syncing now? [18:40:49] 1 sec.... [18:41:17] should be good [18:41:44] !log rolled scap back to master@62a250a, needs puppet changes before new code goes live [18:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:57] (03PS3) 10Andrew Bogott: openstack: add links to docs for components, lint [puppet] - 10https://gerrit.wikimedia.org/r/249342 (owner: 10Dzahn) [18:42:14] * legoktm tries again [18:42:29] !log legoktm@tin Synchronized wmf-config/mc.php: https://gerrit.wikimedia.org/r/#/c/249463/ (duration: 00m 17s) [18:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:37] ostriches: thanks [18:42:40] AaronSchulz: log spam stopped [18:43:16] (03CR) 10Andrew Bogott: [C: 031] openstack: add links to docs for components, lint [puppet] - 10https://gerrit.wikimedia.org/r/249342 (owner: 10Dzahn) [18:47:42] legoktm: thanks [18:50:33] !log mwscript deleteEqualMessages.php --wiki itwikiversity [18:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:52:23] (03PS1) 10Cmjohnson: Updating dhcp address for ms-be109 [puppet] - 10https://gerrit.wikimedia.org/r/249474 [18:53:58] (03CR) 10Cmjohnson: [C: 032] Updating dhcp address for ms-be109 [puppet] - 10https://gerrit.wikimedia.org/r/249474 (owner: 10Cmjohnson) [18:56:14] 6operations, 10Wikimedia-Mailing-lists: Let public archives be indexed and archived - https://phabricator.wikimedia.org/T90407#1762833 (10chasemp) 5Open>3declined a:3chasemp >>! In T90407#1586541, @revi wrote: >>>! In T90407#1555920, @Dzahn wrote: >> Aren't all the public archives on gmane.org anyways an... [18:58:16] (03PS4) 10Dzahn: beta: point parsoid back to source code [puppet] - 10https://gerrit.wikimedia.org/r/243987 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [18:58:33] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: Puppet has 24 failures [18:58:43] hmm [18:58:53] so now codfw labstores also have LDAP [18:59:01] which might or might not have been expected behavior [18:59:14] * YuviPanda goes to look [18:59:45] 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1762849 (10chasemp) 5Open>3declined There aren''t resources or clear list participant interest across the scope of lists for this. We also don't have the manpower to keep... [18:59:48] (03CR) 10Dzahn: [C: 032] "only touches the beta class (yes, separate from prod here as well) and per hashar already "cherry picked on integration puppetmaster"" [puppet] - 10https://gerrit.wikimedia.org/r/243987 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [19:00:17] YuviPanda: that seems to be what Coren said in -labs [19:00:46] mutante: I think it might've been unintentional since it also has admin [19:02:47] YuviPanda: duh [19:02:58] yeah we can put that include under an if $::site guard [19:03:05] and undo the changes manually [19:03:07] yeah [19:03:12] after this dies down I guess [19:04:11] legoktm: no train yet, ready to go now if nothing is blocking it [19:04:19] twentyafterfour: yeah, I finished [19:04:21] MaxSem: review coming up [19:06:18] there was like a huge spike of db errors on testwikidatawiki at 15:20-15:25, is this known? [19:06:28] aude, ^ [19:06:48] should be, aude deployed then undeployed a patch in swat this morning [19:07:12] ok, it is a test wiki because of that [19:07:42] so all good [19:09:07] I see it on SAL, yes [19:09:42] (03PS1) 10BBlack: cipher_sim utility script for manual use [puppet] - 10https://gerrit.wikimedia.org/r/249477 [19:10:18] (03PS5) 10Ottomata: Abstract rsync classes into a define, correct pageview to dumps synchro, add projectview [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal) [19:10:25] (03CR) 10jenkins-bot: [V: 04-1] cipher_sim utility script for manual use [puppet] - 10https://gerrit.wikimedia.org/r/249477 (owner: 10BBlack) [19:11:02] bd808: thcipriani new 'deployment started' gif? https://shogofawafa.files.wordpress.com/2015/08/tumblr_npnpkttr9b1tqtfrjo1_500.gif?w=700&h=438 [19:11:21] no cats [19:11:31] we can make that a pig [19:11:46] that would be acceptable :) [19:11:47] ^ seems like a good compromise. [19:11:50] (03CR) 10Ottomata: [C: 032] Abstract rsync classes into a define, correct pageview to dumps synchro, add projectview [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal) [19:12:10] (03PS6) 10Ottomata: Aggregate from projectviews-*, not projectcounts-* [puppet] - 10https://gerrit.wikimedia.org/r/247458 (https://phabricator.wikimedia.org/T114379) (owner: 10Milimetric) [19:12:17] (03PS2) 10BBlack: cipher_sim utility script for manual use [puppet] - 10https://gerrit.wikimedia.org/r/249477 [19:12:19] (03CR) 10Ottomata: [C: 032 V: 032] Aggregate from projectviews-*, not projectcounts-* [puppet] - 10https://gerrit.wikimedia.org/r/247458 (https://phabricator.wikimedia.org/T114379) (owner: 10Milimetric) [19:12:42] (03PS2) 10Ottomata: Alert about the status of pageview and projectview [puppet] - 10https://gerrit.wikimedia.org/r/247608 (owner: 10Milimetric) [19:12:53] (03CR) 10jenkins-bot: [V: 04-1] cipher_sim utility script for manual use [puppet] - 10https://gerrit.wikimedia.org/r/249477 (owner: 10BBlack) [19:12:55] (03CR) 10Ottomata: [C: 032 V: 032] Alert about the status of pageview and projectview [puppet] - 10https://gerrit.wikimedia.org/r/247608 (owner: 10Milimetric) [19:14:09] 10Ops-Access-Requests, 6operations: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1762943 (10chasemp) p:5Triage>3Normal [19:14:49] 10Ops-Access-Requests, 6operations: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1760888 (10chasemp) is this the actual access request or an outline for greater work? [19:17:05] (03PS1) 10Ottomata: Enable varnishreqstats on all misc and mobile hosts [puppet] - 10https://gerrit.wikimedia.org/r/249479 (https://phabricator.wikimedia.org/T83580) [19:18:22] (03PS1) 10Jcrespo: Enable performance_schema on db1065 [puppet] - 10https://gerrit.wikimedia.org/r/249480 [19:18:51] 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1762964 (10chasemp) So this only applies to: Researchers https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/admin/data/data.yaml;cfa2ef23a4429ab1... [19:19:32] 6operations, 7Database, 5Patch-For-Review: implement performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1762965 (10jcrespo) [19:19:51] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Puppet has 1 failures [19:20:14] (03PS2) 10Ottomata: Enable varnishreqstats on all misc and mobile hosts [puppet] - 10https://gerrit.wikimedia.org/r/249479 (https://phabricator.wikimedia.org/T83580) [19:20:36] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: wdqs-admin group membership for Marius Hoch (hoo) and Jan Zerebecki - https://phabricator.wikimedia.org/T116702#1762978 (10chasemp) https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/admin/data/data.yaml;cfa2ef23a4429ab12df33b0cd0... [19:20:50] (03PS5) 1020after4: Switch www portals to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [19:20:51] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:20:55] (03PS1) 10Andrew Bogott: Nova.conf: Catch up with some changes to the [libvirt] section. [puppet] - 10https://gerrit.wikimedia.org/r/249482 (https://phabricator.wikimedia.org/T116935) [19:21:09] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:21:15] (03CR) 10Ottomata: [C: 032] Enable varnishreqstats on all misc and mobile hosts [puppet] - 10https://gerrit.wikimedia.org/r/249479 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [19:21:31] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1762981 (10chasemp) >>! In T116487#1754582, @chasemp wrote: >>>! In T116487#1754144, @greg wrote: >> Jan: can you put in the description why you are requesting access? :) > > Th... [19:21:51] (03PS2) 10Andrew Bogott: Nova.conf: Catch up with some changes to the [libvirt] section. [puppet] - 10https://gerrit.wikimedia.org/r/249482 (https://phabricator.wikimedia.org/T116935) [19:21:55] (03CR) 10Jcrespo: "This can be deployed at any time, as it requires mysql reboot to take effect, but lest wait for a better window for me." [puppet] - 10https://gerrit.wikimedia.org/r/249480 (owner: 10Jcrespo) [19:22:01] (03CR) 1020after4: [C: 032] "Straightforward enough." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [19:22:19] PROBLEM - RAID on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:31] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint, 5Patch-For-Review: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1762982 (10chasemp) [19:22:39] PROBLEM - Check size of conntrack table on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:40] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:59] PROBLEM - configured eth on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:59] PROBLEM - DPKG on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:10] (03CR) 10Andrew Bogott: [C: 032] Nova.conf: Catch up with some changes to the [libvirt] section. [puppet] - 10https://gerrit.wikimedia.org/r/249482 (https://phabricator.wikimedia.org/T116935) (owner: 10Andrew Bogott) [19:23:10] PROBLEM - dhclient process on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:14] 10Ops-Access-Requests, 6operations: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1762984 (10Dzahn) it's the actual request, it's a follow-up to T115718 . that can be considered the parent task while this here is an addition that is needed and should have... [19:23:20] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint, 5Patch-For-Review: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1473834 (10chasemp) There is no attached actionable access to grant here. It seems we are in a discussion phase. I am removing the tag for th... [19:23:20] PROBLEM - Hadoop DataNode on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:30] PROBLEM - puppet last run on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:34] hm [19:23:39] PROBLEM - SSH on analytics1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:39] 10Ops-Access-Requests, 6operations: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1762988 (10Dzahn) [19:23:40] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1730969 (10Dzahn) [19:23:40] PROBLEM - salt-minion processes on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:40] PROBLEM - Hadoop NodeManager on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:42] 6operations, 7Easy, 7HTTPS: WMF-Last-Access cookies doesn't set Secure flag - https://phabricator.wikimedia.org/T105451#1762990 (10chasemp) p:5Triage>3Normal [19:23:51] PROBLEM - Disk space on Hadoop worker on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:51] 6operations, 10Traffic, 7Easy, 7HTTPS: WMF-Last-Access cookies doesn't set Secure flag - https://phabricator.wikimedia.org/T105451#1444142 (10chasemp) [19:24:15] woo wee cluster is busy backfilling [19:24:27] joal: look at that busy hadoop [19:24:27] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Analytics%2520cluster%2520eqiad&tab=m&vn=&hide-hf=false [19:24:40] 6operations, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1762997 (10chasemp) p:5Triage>3Normal [19:24:49] PROBLEM - DPKG on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:24:49] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [19:24:49] PROBLEM - configured eth on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:24:50] PROBLEM - Check size of conntrack table on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:24:58] 6operations, 10Deployment-Systems: Investigate whether mod_dav needs to stay enabled on tin/terbium - https://phabricator.wikimedia.org/T116823#1763007 (10chasemp) p:5Triage>3Normal [19:25:02] That's what we want :) machines to do stuff ottomata :) [19:25:13] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: Investigate whether mod_dav needs to stay enabled on tin/terbium - https://phabricator.wikimedia.org/T116823#1759278 (10chasemp) [19:25:31] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: Investigate whether mod_dav needs to stay enabled on tin/terbium - https://phabricator.wikimedia.org/T116823#1759278 (10chasemp) at #release-Engineering-Team please advise :) [19:25:47] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: PID not expanded in heap dumps - https://phabricator.wikimedia.org/T116814#1763014 (10chasemp) p:5Triage>3Normal [19:25:57] ottomata: anything you want me to stop ? [19:25:57] 6operations, 10Traffic, 5Patch-For-Review: Split HTCP multicast addresses - https://phabricator.wikimedia.org/T116752#1763016 (10chasemp) p:5Triage>3Normal [19:26:03] joal: naw, its fine [19:26:08] 6operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750#1763020 (10chasemp) p:5Triage>3Normal [19:26:17] we just get some (obviously) false alarms because icinga/nrpe timesout the checks [19:26:17] ok cool :) [19:26:19] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [19:26:19] when the cluster is busy [19:26:20] 6operations: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747#1763022 (10chasemp) p:5Triage>3Normal [19:26:21] RECOVERY - DPKG on analytics1032 is OK: All packages OK [19:26:29] RECOVERY - configured eth on analytics1032 is OK: OK - interfaces up [19:26:29] RECOVERY - Check size of conntrack table on analytics1032 is OK: OK: nf_conntrack is 0 % full [19:26:37] 6operations: Track amount of package updates on systems - https://phabricator.wikimedia.org/T116742#1763024 (10chasemp) p:5Triage>3Normal [19:26:50] 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown, 10Wikidata: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1763027 (10chasemp) p:5Triage>3Normal [19:26:50] PROBLEM - Disk space on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:27:45] 6operations, 10ops-eqiad, 5Continuous-Integration-Scaling: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#1763049 (10chasemp) 5Open>3stalled Let's wait on this until we have a fully realized and migrated solution here just in case so we don't end up in a "oh wait... [19:27:51] 6operations, 10ops-eqiad, 5Continuous-Integration-Scaling: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#1763052 (10chasemp) p:5Triage>3Lowest [19:29:38] (03CR) 10Rush: "thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [19:31:08] (03PS1) 10Dzahn: admin: let dc-ops sign puppet certs, add salt keys [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) [19:31:21] (03PS2) 10Yuvipanda: admin: Prune unused key [puppet] - 10https://gerrit.wikimedia.org/r/249468 [19:31:23] (03PS1) 10Yuvipanda: labstore: Include LDAP only in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/249484 [19:32:00] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:33:49] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [19:34:09] (03PS2) 10Dzahn: admin: let dc-ops sign puppet certs, add salt keys [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) [19:35:03] 10Ops-Access-Requests, 6operations: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1763098 (10Dzahn) [19:35:33] 10Ops-Access-Requests, 6operations: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1730969 (10Dzahn) continued here: T116884: let datacenter-ops sign puppet certs and accept salt keys [19:36:39] Undefined index: timeout in /srv/mediawiki/php-1.27.0-wmf.4/includes/objectcache/MemcachedPeclBagOStuff.php on line 86 [19:36:55] anyone know what might have changed that would cause the objectcache to get initialized without the timeout arg? [19:38:33] (03CR) 10Papaul: [C: 031] admin: let dc-ops sign puppet certs, add salt keys [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) (owner: 10Dzahn) [19:39:19] (03CR) 10Yuvipanda: [C: 032] admin: Prune unused key [puppet] - 10https://gerrit.wikimedia.org/r/249468 (owner: 10Yuvipanda) [19:39:30] RECOVERY - configured eth on analytics1038 is OK: OK - interfaces up [19:39:30] RECOVERY - DPKG on analytics1038 is OK: All packages OK [19:39:50] RECOVERY - Disk space on analytics1038 is OK: DISK OK [19:39:50] RECOVERY - dhclient process on analytics1038 is OK: PROCS OK: 0 processes with command name dhclient [19:40:11] RECOVERY - SSH on analytics1038 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [19:40:12] RECOVERY - salt-minion processes on analytics1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:40:12] RECOVERY - Hadoop NodeManager on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [19:40:30] RECOVERY - Disk space on Hadoop worker on analytics1038 is OK: DISK OK [19:41:09] RECOVERY - Check size of conntrack table on analytics1038 is OK: OK: nf_conntrack is 0 % full [19:41:10] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:41:50] RECOVERY - Hadoop DataNode on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [19:41:59] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 43 minutes ago with 0 failures [19:42:30] RECOVERY - RAID on analytics1038 is OK: OK: optimal, 13 logical, 14 physical [19:43:01] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [19:45:39] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:48:29] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [19:49:28] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint, 5Patch-For-Review: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1763165 (10Yurik) @dzahn, I am very happy to get all the help I could get from Ops, but as our platform growth, each service will require ops a... [19:50:14] (03PS3) 10BBlack: cipher_sim utility script for manual use [puppet] - 10https://gerrit.wikimedia.org/r/249477 [19:51:13] ok train for group1 is about to go out [19:51:20] (03PS1) 1020after4: group1 wikis to 1.27.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249486 [19:51:31] (03PS1) 10John F. Lewis: mailman: reject subscriptions from disabled list [puppet] - 10https://gerrit.wikimedia.org/r/249487 [19:51:32] 6operations, 10Traffic, 10netops: drain ULSFO of all traffic on 2015-11-02 before 0900 PST - https://phabricator.wikimedia.org/T116928#1763167 (10faidon) [19:51:45] (03CR) 1020after4: [C: 032] group1 wikis to 1.27.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249486 (owner: 1020after4) [19:51:51] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249486 (owner: 1020after4) [19:52:04] (03PS2) 10John F. Lewis: mailman: reject subscriptions from disabled list [puppet] - 10https://gerrit.wikimedia.org/r/249487 [19:52:07] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.4 [19:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:53:19] (03PS4) 10BBlack: cipher_sim utility script for manual use [puppet] - 10https://gerrit.wikimedia.org/r/249477 [19:54:41] (03PS3) 10John F. Lewis: mailman: reject subscriptions from disabled list [puppet] - 10https://gerrit.wikimedia.org/r/249487 (https://phabricator.wikimedia.org/T116560) [19:54:53] (03PS4) 10John F. Lewis: mailman: reject subscriptions from disabled list [puppet] - 10https://gerrit.wikimedia.org/r/249487 (https://phabricator.wikimedia.org/T116560) [19:56:30] 6operations, 10Traffic, 10netops: drain ULSFO of all traffic on 2015-11-02 before 0900 PST - https://phabricator.wikimedia.org/T116928#1763183 (10faidon) Confirmed that 2015-11-02 17:00 UTC works. I'll depool ulsfo a few hours earlier. Please confirm with me before the start of the window and beginning the... [20:00:01] (03PS5) 10John F. Lewis: mailman: reject subscriptions to disabled list [puppet] - 10https://gerrit.wikimedia.org/r/249487 (https://phabricator.wikimedia.org/T116560) [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151028T2000). [20:00:28] nothing to deploy today for parsoid [20:00:40] no mobileapps deploy today [20:03:09] (03CR) 10Rush: [C: 032] "seems sound" [puppet] - 10https://gerrit.wikimedia.org/r/249487 (https://phabricator.wikimedia.org/T116560) (owner: 10John F. Lewis) [20:03:14] (03PS1) 10Merlijn van Deen: toollabs: use 'fastapt' provider for packages [puppet] - 10https://gerrit.wikimedia.org/r/249489 (https://phabricator.wikimedia.org/T116813) [20:03:17] YuviPanda: ^ :-) [20:03:35] (will probably get a -1 from jenkins, but hey) [20:04:18] :D [20:05:04] valhallasw`cloud: we should probably not make it default for anything, and explicitly specify provider in both exec and dev environ to start with [20:05:25] Package { :provider => 'fastapt' } ? sure. [20:05:34] yeah [20:05:46] 6operations, 10Wikimedia-Mailing-lists: Let public archives be indexed and archived - https://phabricator.wikimedia.org/T90407#1763207 (10Dzahn) I would like to add this: I went to the upstream mailman IRC channel and asked about a feature that just let's list admins download the .mbox files of their own list... [20:05:54] oh, that also makes it easier to test on toolsbeta [20:05:56] ok, good idea [20:07:27] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1763210 (10saper) As per https://lists.wikimedia.org/pipermail/wikitech-l/2015-October/083752.html I'd like to sort #mediawiki-installer bugs... [20:08:00] let me also add some docs [20:09:20] valhallasw`cloud: \o/ cool [20:09:23] (03PS1) 10Hashar: cache: vary statsd_server with hiera [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) [20:10:49] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1763228 (10chasemp) @saper sprint projects specifically being short lived and hopefully archived after a set period have been a bit of a grey... [20:10:51] (03CR) 10Hashar: "I am not sure how many metrics it is going to add to labmon1001.eqiad.wmnet :-\" [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [20:11:05] 6operations: Reprepro should bail if it can't read and sign using the root keys - https://phabricator.wikimedia.org/T116951#1763230 (10MoritzMuehlenhoff) An annoying side aspect is that the files get still copied to the pool, but the Packages file isn't updated along. [20:11:09] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [20:11:39] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [20:12:51] 6operations, 10ops-eqiad, 5Continuous-Integration-Scaling: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#1763238 (10hashar) Good call. We never know :-) [20:13:11] (03PS1) 10coren: Labs: remove admin from labstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/249493 [20:13:15] YuviPanda: ^^ [20:13:46] chasemp: paravoid ^ [20:13:55] also I think that'll need manual cleanup too [20:14:27] YuviPanda: I suggest testing on toolsbeta first [20:14:29] Aw, you're probably right. I'm pretty sure that there is no provision to => absent the passwd entries. [20:14:39] Coren: no [20:14:54] YuviPanda: Coren is this still the problem I proposed https://docs.puppetlabs.com/references/latest/type.html#user-attribute-forcelocal for previously? [20:14:56] it's more complicated than that [20:15:01] # FIXME: this is an intentional hard stop as before T84032 [20:15:02] if [[ `hostname -s` =~ ^labstore100 ]]; then exit 1 [20:15:02] fi [20:15:25] Ah. [20:15:27] are we ever realistically going to use 2001 as an NFS server to labs-eqiad? [20:15:39] I don't actually think so. [20:15:43] it's just our backup dest [20:15:46] right [20:15:48] should probably not even be using the same roles [20:15:54] or be called the same thing [20:15:54] we should strive not to tbh [20:15:55] so let's just remove all this ldap crap [20:16:00] yeah [20:16:02] paravoid: I don't believe we will. It's a backup target, and may become useful if we ever have labs-like in codfw [20:16:17] so that's what https://gerrit.wikimedia.org/r/#/c/249484/ does but I'm not sure what all needs to be manually removed [20:16:35] YuviPanda: I've pulled out a list of packages in -operations not long ago. [20:16:50] (03PS2) 10Merlijn van Deen: toollabs: install 'fastapt' provider for packages [puppet] - 10https://gerrit.wikimedia.org/r/249489 (https://phabricator.wikimedia.org/T116813) [20:16:56] better check 'grep puppet /var/log/syslog' for what exactly it did and undo it [20:16:59] 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1763240 (10JanZerebecki) Yes. (It is a superset of the group statistics-users.) [20:16:59] (03Abandoned) 10coren: Labs: remove admin from labstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/249493 (owner: 10coren) [20:17:45] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: Investigate whether mod_dav needs to stay enabled on tin/terbium - https://phabricator.wikimedia.org/T116823#1763243 (10hashar) The original commit is from Nov 30, 2012. I think at one point the idea was to publish the state of the repos on the de... [20:18:53] ah yeah I see diffs [20:18:56] so we can undo those [20:19:00] not sure what to do about the local accounts [20:19:10] the local accounts were there before [20:19:12] and should stay [20:19:16] there is a last run report in /var/lib/puppet [20:19:22] I tink [20:19:24] (03CR) 10coren: [C: 031] "Yep." [puppet] - 10https://gerrit.wikimedia.org/r/249484 (owner: 10Yuvipanda) [20:19:28] oh right ofc [20:20:05] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: Investigate whether mod_dav needs to stay enabled on tin/terbium - https://phabricator.wikimedia.org/T116823#1763254 (10chasemp) 5Open>3Resolved a:3chasemp [20:22:01] 6operations, 10Beta-Cluster-Infrastructure, 7Blocked-on-RelEng, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1763256 (10chasemp) [20:23:25] chasemp: yea but there's been more runs since then I guess [20:23:32] chasemp: I found diffs in /var/log/puppet.log [20:23:46] 6operations: upgrade radium to jessie - https://phabricator.wikimedia.org/T116963#1763259 (10Dzahn) 3NEW a:3Dzahn [20:23:54] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1763266 (10saper) In general I find it difficult to grasp the project with more than > 50 tasks on the board. (Maybe others can:) By grasping... [20:24:15] 6operations: build newer tor packages - https://phabricator.wikimedia.org/T116964#1763267 (10Dzahn) 3NEW a:3Dzahn [20:27:39] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [20:28:06] hashar: "end of the day" extended?:) [20:28:19] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:28:21] (03PS2) 10Yuvipanda: labstore: Include LDAP only in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/249484 [20:29:03] !log disable puppet on labstore1001 [20:29:06] mutante: yeah now it is the start of the evening :-} [20:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:29:30] 6operations, 10Architecture, 10Incident-20150423-Commons, 10MediaWiki-RfCs, and 6 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1763296 (10Krinkle) [20:29:32] (03CR) 10Yuvipanda: [C: 032] labstore: Include LDAP only in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/249484 (owner: 10Yuvipanda) [20:29:51] hashar: should we do https://gerrit.wikimedia.org/r/#/c/244498/ ? [20:30:02] because of the restart part etc [20:30:14] "reload of the replication plugin" [20:31:40] (03CR) 10Dereckson: "Currently, every key of this array are in CommonSettings.php, except one in proofreadpage.php." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson) [20:39:18] (03PS4) 10Dzahn: openstack: add links to docs for components, lint [puppet] - 10https://gerrit.wikimedia.org/r/249342 [20:40:44] (03CR) 10Dzahn: [C: 032] "most of it are only comments, and the rest are harmless changes like whitespace and alignment. it does fix a whole bunch of warnings thoug" [puppet] - 10https://gerrit.wikimedia.org/r/249342 (owner: 10Dzahn) [20:46:11] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:43] (03CR) 10Krinkle: "Hm.. wouldn't the following work without the additional non-standard wmg logic?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson) [20:47:59] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [20:48:24] (03PS1) 10Yurik: Allow same perms to tileratorui as tilerator [puppet] - 10https://gerrit.wikimedia.org/r/249501 [20:55:48] !log reverted changes to nsswitch.conf from puppet run manually on labstore2001 [20:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:57:00] PROBLEM - Check size of conntrack table on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:11] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:41] jouncebot: next [20:57:41] In 2 hour(s) and 2 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151028T2300) [20:59:19] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:01:02] PROBLEM - DPKG on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:01:02] PROBLEM - configured eth on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:01:21] PROBLEM - Disk space on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:01:21] PROBLEM - dhclient process on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:01:39] PROBLEM - Hadoop DataNode on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:01:50] PROBLEM - puppet last run on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:01:51] PROBLEM - salt-minion processes on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:01:51] PROBLEM - Hadoop NodeManager on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:01:59] PROBLEM - SSH on analytics1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:02:01] PROBLEM - Disk space on Hadoop worker on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:02:39] RECOVERY - Check size of conntrack table on analytics1038 is OK: OK: nf_conntrack is 0 % full [21:02:50] RECOVERY - configured eth on analytics1038 is OK: OK - interfaces up [21:02:50] RECOVERY - DPKG on analytics1038 is OK: All packages OK [21:03:09] RECOVERY - Disk space on analytics1038 is OK: DISK OK [21:03:10] RECOVERY - dhclient process on analytics1038 is OK: PROCS OK: 0 processes with command name dhclient [21:03:20] RECOVERY - Hadoop DataNode on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [21:03:30] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 35 minutes ago with 0 failures [21:03:39] RECOVERY - salt-minion processes on analytics1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:03:40] RECOVERY - Hadoop NodeManager on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [21:06:39] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [21:07:59] PROBLEM - RAID on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:08:21] !log enable puppet on labstore1001 [21:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:20] RECOVERY - Disk space on Hadoop worker on analytics1038 is OK: DISK OK [21:09:40] RECOVERY - RAID on analytics1038 is OK: OK: optimal, 13 logical, 14 physical [21:10:11] 6operations, 6Labs: Cleanup / clarify labstore2001 - https://phabricator.wikimedia.org/T116972#1763469 (10yuvipanda) 3NEW [21:10:44] (03CR) 10JanZerebecki: [C: 031] admin: hoo and jzerebecki for wdqs admins [puppet] - 10https://gerrit.wikimedia.org/r/249027 (https://phabricator.wikimedia.org/T116702) (owner: 10Dzahn) [21:11:01] RECOVERY - SSH on analytics1038 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [21:11:33] !log legoktm@tin Synchronized php-1.27.0-wmf.4/includes/changes/EnhancedChangesList.php: Fix diff/history links not showing up for ungrouped enhanced RC - https://gerrit.wikimedia.org/r/#/c/249556/ (duration: 00m 19s) [21:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:39] (03CR) 10Dzahn: "i don't know how we can achieve these goals at the same time: a) kill global ./files/ b) move everything into a module structure c) not" [puppet] - 10https://gerrit.wikimedia.org/r/249056 (owner: 10Dzahn) [21:16:25] (03Abandoned) 10Dzahn: (WIP) maps: move roles into autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [21:17:41] (03Abandoned) 10Dzahn: logstash: move files from ./files to module [puppet] - 10https://gerrit.wikimedia.org/r/249062 (owner: 10Dzahn) [21:20:23] (03CR) 10Dzahn: "ok, after reading the comments on the other similar change... move them all into the role module is the right answer i suppose" [puppet] - 10https://gerrit.wikimedia.org/r/249056 (owner: 10Dzahn) [21:23:48] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1763543 (10MeganHernandez_WMF) We have used landing page impressions in the past and I think we do use this for e... [21:30:03] 6operations, 10Traffic, 5Patch-For-Review: Split HTCP multicast addresses - https://phabricator.wikimedia.org/T116752#1763580 (10BBlack) Just status update on where this is at: maps is ready to take off using its own multicast address. Upload is listening to both its old (legacy shared) and new multicast fo... [21:33:20] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, and 3 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1763585 (10BBlack) p:5Triage>3Normal [21:33:54] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, and 3 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1763588 (10BBlack) a:5BBlack>3Yurik re-assigning to Yuri to actually implement the HTCP-sending part so we can observe whether it works [21:34:19] 6operations, 10Traffic, 5Patch-For-Review: Split HTCP multicast addresses - https://phabricator.wikimedia.org/T116752#1763596 (10BBlack) [21:34:20] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, and 3 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1763595 (10BBlack) [21:34:23] (03CR) 10Tim Landscheidt: "This makes the setup more complicated for 50 seconds saved time. Puppet is run automatically in the background and its runtime not that i" [puppet] - 10https://gerrit.wikimedia.org/r/249489 (https://phabricator.wikimedia.org/T116813) (owner: 10Merlijn van Deen) [21:34:38] 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1763599 (10Nemo_bis) 5declined>3Open There is no need of ops manpower for this task. [21:34:42] Would anyone be able to tell me what the version of the vips binary is on the image scalars? [21:35:46] bawolff: which servers are those? [21:36:42] mw1153, mw1154 I think [21:37:21] mw1153 is role::mediawiki::imagescaler [21:37:30] legoktm@mw1153:~$ vips --version [21:37:30] vips-7.38.5-Sat Apr 5 11:17:49 UTC 2014 [21:38:31] Thanks [21:38:45] oddly enough, that's the same version I have, but yet everything works on my computer [21:38:55] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1763618 (10Ottomata) >> If we adopt a convention of always storing schema name and/or revision in the schemas themselves, then we can do like EventLo... [21:39:41] 6operations, 5Continuous-Integration-Scaling: Allow network flow between labs instance and scandium - https://phabricator.wikimedia.org/T116975#1763623 (10hashar) 3NEW a:3hashar [21:41:54] (03PS5) 10Dzahn: osm/maps/postgres: move tuning.conf out of /files/ [puppet] - 10https://gerrit.wikimedia.org/r/249056 [21:43:06] (03CR) 10Dzahn: "@Alex better like this?" [puppet] - 10https://gerrit.wikimedia.org/r/249056 (owner: 10Dzahn) [21:50:09] (03PS7) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 [21:51:04] (03PS8) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 [21:52:35] (03PS9) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 [21:54:15] (03PS1) 10JanZerebecki: webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) [21:55:01] (03CR) 10jenkins-bot: [V: 04-1] webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) (owner: 10JanZerebecki) [21:57:29] (03PS1) 10BBlack: cipher_sim: fix match ordering (server pref first) [puppet] - 10https://gerrit.wikimedia.org/r/249577 [21:57:51] (03CR) 10BBlack: [C: 032 V: 032] cipher_sim: fix match ordering (server pref first) [puppet] - 10https://gerrit.wikimedia.org/r/249577 (owner: 10BBlack) [21:58:03] (03PS4) 10MaxSem: Beta: use final agreed upon deployment scheme [puppet] - 10https://gerrit.wikimedia.org/r/248374 [21:58:11] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:00:01] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [22:03:30] (03PS2) 10JanZerebecki: webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) [22:09:48] 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1763832 (10mmodell) #scap3 is going to provide a multi-paned terminal layout which monitors logstash in the same terminal window that is running scap. It will be a tremendous improvement over `fatalmonitor` comm... [22:19:36] (03CR) 10Merlijn van Deen: "I agree it's not that important for noninteractive use. I'm more thinking of the test/deploy/revert cases, where reducing the time with 50" [puppet] - 10https://gerrit.wikimedia.org/r/249489 (https://phabricator.wikimedia.org/T116813) (owner: 10Merlijn van Deen) [22:21:09] PROBLEM - Hadoop NodeManager on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:21:30] PROBLEM - SSH on analytics1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:21:30] PROBLEM - RAID on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:22:00] PROBLEM - configured eth on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:22:01] PROBLEM - DPKG on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:22:10] PROBLEM - dhclient process on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:22:10] PROBLEM - Check size of conntrack table on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:22:10] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:22:25] (03PS1) 10John F. Lewis: mailman: fix syntax in disable_list ban_list echo [puppet] - 10https://gerrit.wikimedia.org/r/249584 [22:22:36] (03PS2) 10John F. Lewis: mailman: fix syntax in disable_list ban_list echo [puppet] - 10https://gerrit.wikimedia.org/r/249584 [22:22:36] ottomata: analytics1032? [22:22:51] RECOVERY - Hadoop NodeManager on analytics1032 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [22:23:11] RECOVERY - SSH on analytics1032 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [22:23:11] RECOVERY - RAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical [22:23:41] RECOVERY - configured eth on analytics1032 is OK: OK - interfaces up [22:23:49] RECOVERY - DPKG on analytics1032 is OK: All packages OK [22:23:50] RECOVERY - dhclient process on analytics1032 is OK: PROCS OK: 0 processes with command name dhclient [22:23:51] RECOVERY - Check size of conntrack table on analytics1032 is OK: OK: nf_conntrack is 0 % full [22:24:00] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [22:28:24] 6operations: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1763907 (10Krenair) 3NEW [22:33:38] (03PS3) 10John F. Lewis: mailman: fix syntax in disable_list ban_list echo [puppet] - 10https://gerrit.wikimedia.org/r/249584 (https://phabricator.wikimedia.org/T116560) [22:34:29] (03CR) 10Rush: [C: 032] mailman: fix syntax in disable_list ban_list echo [puppet] - 10https://gerrit.wikimedia.org/r/249584 (https://phabricator.wikimedia.org/T116560) (owner: 10John F. Lewis) [22:35:10] (03PS3) 10JanZerebecki: webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) [22:36:37] (03CR) 10jenkins-bot: [V: 04-1] webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) (owner: 10JanZerebecki) [22:38:20] !log Cassandra cleanup on restbase-test2001-a [22:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:38:41] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1763953 (10GWicke) @ottomata, I think understanding the semantics of an event primarily requires knowledge of the topic. The topic in turn provides a... [22:42:16] 6operations: Track amount of package updates on systems - https://phabricator.wikimedia.org/T116742#1763985 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [22:50:23] (03PS4) 10JanZerebecki: webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) [22:51:49] PROBLEM - salt-minion processes on labvirt1010 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:57:25] !log ori@tin Synchronized php-1.27.0-wmf.4/extensions/CirrusSearch/includes/Hooks.php: I0e5f2d3b2 (duration: 00m 18s) [22:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:59:44] ori: https://gerrit.wikimedia.org/r/#/c/249197/ [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151028T2300). [23:00:05] ebernhardson jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:30] busy with uni stuff at the moment, can someone else take this? [23:00:44] yea i can [23:01:13] \o [23:01:28] jdlrobson: your two patches, one is abandoned and the other has a -1 [23:01:43] !log ori@tin Synchronized php-1.27.0-wmf.4/extensions/WikimediaEvents/WikimediaEventsHooks.php: I0e5f2d3b2 (duration: 00m 18s) [23:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:02:28] jdlrobson: it also makes sense to me to apply that change from mediawiki-config, its the standard way. but up to you [23:02:45] ebernhardson: oops lemme find the merged one [23:03:44] MaxSem: you -2'd your patch, presumably to prevent merging? [23:03:46] ebernhardson: so the first one should be swatted and for the -1 yeh a config change would be better [23:04:11] ebernhardson, removed [23:04:12] jdlrobson: so i should un-abandon the patch? [23:04:20] ebernhardson: but it's out of date https://gerrit.wikimedia.org/r/#/c/249585/1/tests/browser/features/support/pages/article_page.rb [23:04:29] i swatted too early.. :-S [23:04:35] jdlrobson: ok, that makes sense now :) [23:04:44] ebernhardson: https://gerrit.wikimedia.org/r/#/c/249579/ is the one that needs to be swatted [23:05:51] (03CR) 10EBernhardson: [C: 032] Switch www portals to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [23:05:59] (03Merged) 10jenkins-bot: Switch www portals to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem) [23:06:22] ori: done on tin? [23:06:24] ebernhardson: want me to prepare the config change? [23:06:29] yes [23:06:29] jdlrobson: please [23:08:19] !log ebernhardson@tin Synchronized portals: Switch www portals to be deployed from Git, but not being served from anywhere yet (duration: 00m 18s) [23:08:21] MaxSem: ^^ [23:08:22] (03PS1) 10Jdlrobson: Enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249603 (https://phabricator.wikimedia.org/T101719) [23:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:21] jdlrobson: which patch should be deployed first? [23:10:04] jdlrobson: also, there might be something wrong with this commit message: https://gerrit.wikimedia.org/r/#/c/249603/ [23:10:44] ebernhardson, lgtm, thanks [23:11:13] (03CR) 10Florianschmidtwelzow: [C: 04-1] Enable Wikidata descriptions (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249603 (https://phabricator.wikimedia.org/T101719) (owner: 10Jdlrobson) [23:11:17] ebernhardson: MobileFrontned one [23:11:32] ebernhardson: and wooaah on that commit msg wtf [23:12:40] (03PS2) 10Jdlrobson: Enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249603 (https://phabricator.wikimedia.org/T101719) [23:12:44] fwiw cirrus also typically just sets variables to specific values if its the same everywhere [23:12:57] (in our case, things like list of endpoints to talk to, etc.) [23:13:20] ebernhardson: not even sure how that happened [23:14:09] ebernhardson: preferably config change can be deployed after i verify the MobileFrontend patch is working [23:14:12] is that okay? [23:14:20] jdlrobson: yea, just waiting on jenkins [23:14:28] oh its done [23:15:54] !log ebernhardson@tin Synchronized php-1.27.0-wmf.4/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/249585 (duration: 00m 19s) [23:15:58] jdlrobson: ^^ [23:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:17] ebernhardson: sweet thx [23:16:43] ebernhardson:mmm... not seeing it in production yet. guess i need to wait 5 mins [23:16:50] for javascript, yea usually [23:17:04] ebernhardson: works [23:17:05] boom! [23:17:07] thanx [23:17:12] so the config change can now be applied :) [23:17:32] (03CR) 10EBernhardson: [C: 032] Enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249603 (https://phabricator.wikimedia.org/T101719) (owner: 10Jdlrobson) [23:17:39] (03Merged) 10jenkins-bot: Enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249603 (https://phabricator.wikimedia.org/T101719) (owner: 10Jdlrobson) [23:18:45] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/249603 (duration: 00m 17s) [23:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:14] !log ebernhardson@tin Synchronized wmf-config/mobile.php: https://gerrit.wikimedia.org/r/249603 (duration: 00m 18s) [23:19:18] (03CR) 10Dzahn: [C: 031] "it looks reasonable since they are the same permissions as for the other service and these services belong together. i think we have to tr" [puppet] - 10https://gerrit.wikimedia.org/r/249501 (owner: 10Yurik) [23:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:50] ebernhardson: ori and I would look to do https://gerrit.wikimedia.org/r/#/c/249605/ soonish, let me know when thats ok [23:19:51] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/249603 (duration: 00m 17s) [23:19:52] jdlrobson: ok should be out now [23:20:11] (03CR) 10EBernhardson: [C: 032] Drop cirrussearch write jobs after 3 hours of failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [23:20:17] ebernhardson: not seeing it right now.. [23:20:35] (03Merged) 10jenkins-bot: Drop cirrussearch write jobs after 3 hours of failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [23:21:18] ebernhardson: boom! [23:21:18] working [23:21:23] thanks a bunch. you're a superstar [23:21:25] that's all from me! :) [23:21:26] (03CR) 10Yurik: "Dzahn, the tilerator was one service and was simply broken apart to simplify management. Its the same code repo. When i was making the pat" [puppet] - 10https://gerrit.wikimedia.org/r/249501 (owner: 10Yurik) [23:21:43] (03CR) 10Dzahn: "@chasemp: i think we should add this one to the next meeting too, with the access requests. do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/249501 (owner: 10Yurik) [23:21:52] AaronSchulz: sure done soon [23:22:07] (03CR) 10Rush: "task association?" [puppet] - 10https://gerrit.wikimedia.org/r/249501 (owner: 10Yurik) [23:22:44] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: (no message) (duration: 00m 17s) [23:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:58] (03PS2) 10Yurik: Allow same perms to tileratorui as tilerator [puppet] - 10https://gerrit.wikimedia.org/r/249501 (https://phabricator.wikimedia.org/T112914) [23:25:41] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-production.php: (no message) (duration: 00m 17s) [23:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:59] (03PS10) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 [23:29:19] !log ebernhardson@tin Synchronized php-1.27.0-wmf.4/extensions/WikimediaEvents/WikimediaEvents.php: https://gerrit.wikimedia.org/r/249642 (duration: 00m 17s) [23:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:29] AaronSchulz: ok should be done now [23:30:26] (03PS11) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 [23:30:48] (03CR) 10Dzahn: [C: 032] lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 (owner: 10Dzahn) [23:31:05] ebernhardson: thanks [23:31:16] !log ori@tin Synchronized php-1.27.0-wmf.4/extensions/Translate/tag/PageTranslationHooks.php: I0e5f2d3b2 (duration: 00m 18s) [23:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:37] 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1764315 (10chasemp) [23:34:43] 6operations: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1764316 (10Dzahn) I agree. I think we should: - create a dedicated ganeti VM for people.wm and move it there - let all shell users have access to that ("comes with bastio... [23:34:54] 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1433420 (10chasemp) >>! In T104967#1761540, @fgiunchedi wrote: > afaict our puppet hooks for jessie does include `thirdparty` > > ``` > pac... [23:37:06] JohnFLewis: i don't even see the fix in the diff :) ban_list? [23:37:14] cool though [23:38:57] ebernhardson: not seeing the move error manifest anymore \o/ [23:39:01] oh trailing ' .. that took me a while [23:39:33] AaronSchulz: excellent, it would be nice to someday remove all those php4 vestiges, some day :) [23:40:09] the problem with the user arg still needs investigating [23:40:29] some sort of esoteric bug going on in there [23:43:25] AaronSchulz: should probably merge https://gerrit.wikimedia.org/r/#/c/249644/ back then? [23:43:37] (03CR) 10Dzahn: "yea, but if you don't load the compat module (and if it disappears in a future release) then this would break and also working in 2.2 migh" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [23:44:29] ah ok :) [23:52:06] (03PS1) 10Dzahn: puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 [23:52:57] (03CR) 10jenkins-bot: [V: 04-1] puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 (owner: 10Dzahn) [23:54:27] (03CR) 10Dzahn: "hmm. there's 'No file(s) found for import of '../../../manifests/nagios.pp'" again" [puppet] - 10https://gerrit.wikimedia.org/r/249655 (owner: 10Dzahn) [23:54:49] (03PS2) 10Dzahn: puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 [23:55:24] (03CR) 10jenkins-bot: [V: 04-1] puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 (owner: 10Dzahn) [23:58:58] (03PS1) 10Dzahn: etc,redis,dynamicproxy: fix some lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/249658