[00:00:39] (03CR) 10Dzahn: [C: 032] ocg: install the right font packages on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288132 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [00:01:31] * MaxSem is done [00:03:31] (03CR) 10Dzahn: "no-op on ocg1001/1002, fixed issues on ocg1003" [puppet] - 10https://gerrit.wikimedia.org/r/288132 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [00:08:02] (03CR) 10Thcipriani: [C: 031] "working well in beta + email announce sent." [puppet] - 10https://gerrit.wikimedia.org/r/287918 (owner: 10Filippo Giunchedi) [00:17:42] (03PS2) 10Alex Monk: Try to separate trebuchet stuff from role::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/284851 [00:21:50] (03PS1) 10Dzahn: ocg: don't try to use syslog group on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288139 (https://phabricator.wikimedia.org/T84723) [00:24:09] (03PS2) 10Dzahn: ocg: don't try to use syslog group on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288139 (https://phabricator.wikimedia.org/T84723) [00:26:50] (03PS3) 10Dzahn: ocg: don't try to use syslog group on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288139 (https://phabricator.wikimedia.org/T84723) [00:27:38] (03CR) 10Dzahn: [C: 032] ocg: don't try to use syslog group on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288139 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [00:31:20] (03CR) 10Dzahn: "no-op on ocg1001/1002, fixed issue on ocg1003" [puppet] - 10https://gerrit.wikimedia.org/r/288139 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [00:36:32] (03PS1) 10Dzahn: ocg: use 'adm' group for log dir when on systemd [puppet] - 10https://gerrit.wikimedia.org/r/288141 [00:38:49] PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Puppet has 1 failures [00:40:33] (03CR) 10Dzahn: [C: 032] ocg: use 'adm' group for log dir when on systemd [puppet] - 10https://gerrit.wikimedia.org/r/288141 (owner: 10Dzahn) [00:54:52] (03PS1) 10Dzahn: ocg: set correct ImageMagick conf dir on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288142 (https://phabricator.wikimedia.org/T84723) [01:00:29] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Puppet has 1 failures [01:02:14] (03PS2) 10Dzahn: ocg: set correct ImageMagick conf dir on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288142 (https://phabricator.wikimedia.org/T84723) [01:04:14] (03PS3) 10Dzahn: ocg: set correct ImageMagick conf dir on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288142 (https://phabricator.wikimedia.org/T84723) [01:04:49] RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:05:41] (03PS4) 10Dzahn: ocg: set correct ImageMagick conf dir on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288142 (https://phabricator.wikimedia.org/T84723) [01:09:38] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2283661 (10GWicke) [01:09:57] (03PS5) 10Dzahn: ocg: set correct ImageMagick conf dir on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288142 (https://phabricator.wikimedia.org/T84723) [01:12:58] (03PS6) 10Dzahn: ocg: set correct ImageMagick conf dir on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288142 (https://phabricator.wikimedia.org/T84723) [01:14:03] (03PS7) 10Dzahn: ocg: set correct ImageMagick conf dir on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288142 (https://phabricator.wikimedia.org/T84723) [01:15:34] (03PS8) 10Dzahn: ocg: set correct ImageMagick conf dir on jessie [puppet] - 10https://gerrit.wikimedia.org/r/288142 (https://phabricator.wikimedia.org/T84723) [01:17:22] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/2748/" [puppet] - 10https://gerrit.wikimedia.org/r/288142 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [01:19:51] (03CR) 10Dzahn: "no-op on ocg1001/1002, /Ocg/Service[ocg]/ensure: ensure changed 'stopped' to 'running' on ocg1003" [puppet] - 10https://gerrit.wikimedia.org/r/288142 (https://phabricator.wikimedia.org/T84723) (owner: 10Dzahn) [01:19:59] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 707 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6157136 keys - replication_delay is 707 [01:20:23] 06Operations, 10OCG-General, 13Patch-For-Review: imagemagick::install refers to directory /etc/ImageMagic which does not exist on Jessie - https://phabricator.wikimedia.org/T134773#2276503 (10Dzahn) done for ocg on jessie on ocg1003 with https://gerrit.wikimedia.org/r/#/c/288142/ [01:21:17] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2283669 (10Dzahn) NOW: "Notice: /Stage[main]/Ocg/Service[ocg]/ensure: ensure changed 'stopped' to 'running' We still see some failures in puppet output related to apparmor but this part is fixed :) [01:21:58] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6148986 keys - replication_delay is 0 [01:23:41] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2283672 (10Dzahn) We are now down to this remaining issue: Could not open configuration file /etc/ocg/mw-ocg-service.js! Error: Module did not self-register. The config exists... [01:26:18] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:39:28] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [01:42:41] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2283676 (10Dzahn) i added "apparmor=1 security=apparmor" to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and rebooted the server. issue is fixed. @Joe @cscott see all the comments above and no... [01:45:49] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 617 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6150616 keys - replication_delay is 617 [01:49:48] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6150362 keys - replication_delay is 0 [02:15:06] PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: Puppet has 1 failures [02:18:26] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.003 second response time [02:20:26] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1686 bytes in 0.032 second response time [02:24:47] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.23) (duration: 09m 39s) [02:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:55] 06Operations, 10Graph, 10Graphoid, 10Traffic, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283687 (10BBlack) @Yurik - ok. FYI, we separately cap 4xx lifetimes anyways. Currently at 1 minute on all clusters, although IMHO we could probabl... [02:35:45] 06Operations, 10Graph, 10Graphoid, 10Traffic, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283690 (10Yurik) Thanks @bblack. Are there any headers I should add to the regurar 200 replies beyond setting maxage? [02:36:03] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2283693 (10Danny_B) For cross reference: - {T96848} - {T134817} [02:40:04] 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283697 (10Danny_B) [02:41:21] RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [02:42:12] 06Operations, 10Graph, 10Graphoid, 10Traffic, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2283715 (10BBlack) Well, that's a complex topic, but what you're doing now will work fine for now :) Someday we'll have a guideline document to poin... [02:46:08] !log mwdeploy@tin sync-l10n completed (1.28.0-wmf.1) (duration: 05m 54s) [02:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:55:56] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed May 11 02:55:55 UTC 2016 (duration 9m 47s) [02:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:18] 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283697 (10BBlack) I'm not sure we need a whole separate tag for HTTP/2. [03:10:08] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2283745 (10BBlack) While there's a possibility this is an HTTP/2 issue, it's also possible that's a number of other things, including some generic bug in Firefox that's not protocol specific. [03:17:33] (03CR) 10KartikMistry: "That works for me. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) (owner: 10KartikMistry) [03:29:37] 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283697 (10Krenair) I'm sceptical as well. What existing tasks would go into this project? [03:35:21] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: puppet fail [03:57:40] 06Operations, 06Project-Admins: Create #IRCecho project - https://phabricator.wikimedia.org/T134961#2283779 (10Danny_B) [04:01:11] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [05:17:22] 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2283861 (10Danny_B) I think all new(ly used) technologies deserve tag since they may be buggy or cause issues (typically compatibility). For this cases we have i.e. #newphp Also, we have #https too... AT... [05:41:12] PROBLEM - Disk space on elastic1031 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80638 MB (15% inode=99%) [05:56:42] RECOVERY - Disk space on elastic1031 is OK: DISK OK [06:06:39] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2283894 (10mobrovac) >>! In T134251#2281939, @JanZerebecki wrote: > Directed to you. You implied that this request is not related to any service but only to hosts. Why deviate from the norma... [06:23:11] (03PS2) 10Jcrespo: [WIP]New user for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/280939 (https://phabricator.wikimedia.org/T128185) [06:24:09] (03CR) 10Jcrespo: "I have fixed the typo, but are you already using $passwords::prometheus::db_pass or equivalent somewhere ?" [puppet] - 10https://gerrit.wikimedia.org/r/280939 (https://phabricator.wikimedia.org/T128185) (owner: 10Jcrespo) [06:24:28] (03PS1) 10Hoo man: Add a role description to snapshot::cron::wikidatadumps::json [puppet] - 10https://gerrit.wikimedia.org/r/288150 [06:30:23] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:52] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] (03CR) 10Yuvipanda: [C: 04-1] "From my understanding of how CSRF issues work, this is basically an open invitation for those. Any malicious user on *any* page can perfor" [puppet] - 10https://gerrit.wikimedia.org/r/287570 (owner: 10Ladsgroup) [06:31:52] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:03] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:12] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:22] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:33] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:33] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:42] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:52] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:43] PROBLEM - puppet last run on sca2001 is CRITICAL: CRITICAL: puppet fail [06:35:20] (03CR) 10Yuvipanda: "(By 'any page' I mean any page in the whole wide internet. You wouldn't expect clicking on a random link to be able to destroy your labeli" [puppet] - 10https://gerrit.wikimedia.org/r/287570 (owner: 10Ladsgroup) [06:39:31] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Everyone is complaining already that our hiera setup is too complex (with some merit), I would not go into the direction of making it more" [puppet] - 10https://gerrit.wikimedia.org/r/288106 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [06:41:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 1.0.2h [debs/openssl] - 10https://gerrit.wikimedia.org/r/286666 (owner: 10Muehlenhoff) [06:42:52] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [06:49:04] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 73.05 ms [06:56:23] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:43] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:53] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:57:13] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:57:33] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:43] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:41] (03PS1) 10Muehlenhoff: Cherrypick 8358b02bf67d3a5d8a825070e1aa73f25fb2e4c7 to address CVE-2016-4557 [debs/linux44] - 10https://gerrit.wikimedia.org/r/288151 [07:00:26] (03CR) 10Muehlenhoff: [C: 032 V: 032] Cherrypick 8358b02bf67d3a5d8a825070e1aa73f25fb2e4c7 to address CVE-2016-4557 [debs/linux44] - 10https://gerrit.wikimedia.org/r/288151 (owner: 10Muehlenhoff) [07:00:53] RECOVERY - puppet last run on sca2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:29] (03PS1) 10Muehlenhoff: Remove access credentials for kleduc [puppet] - 10https://gerrit.wikimedia.org/r/288153 [07:13:41] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "the code is ok now, but two things need to be done:" (032 comments) [software/kubernetes] - 10https://gerrit.wikimedia.org/r/287572 (owner: 10Yuvipanda) [07:14:49] (03CR) 10Muehlenhoff: [C: 04-2] "Only merge after the 13th" [puppet] - 10https://gerrit.wikimedia.org/r/288153 (owner: 10Muehlenhoff) [07:15:48] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Add ferm rules for pybal_conf / http [puppet] - 10https://gerrit.wikimedia.org/r/283175 (owner: 10Muehlenhoff) [07:19:34] (03CR) 10Giuseppe Lavagetto: [C: 031] Add the possibility to specify memcached's chunk growth factor. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [07:20:53] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [07:21:05] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [07:33:36] (03CR) 10Elukey: Add the possibility to specify memcached's chunk growth factor. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [07:35:27] (03PS4) 10Elukey: Add the possibility to specify memcached's chunk growth factor. [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) [07:37:21] (03PS5) 10Elukey: Add the possibility to specify memcached's chunk growth factor. [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) [07:41:05] _joe_ thanks for the review, should be good now. Even if the puppet compiler seems happy, I'd like to disable puppet on mc10XX and then merge, running the puppet agent on two random host before re-enabling. It should be a no-op but I don't want to wipe all memcached slabs :) [07:41:29] (no op except for mc1009) [07:45:08] (03PS1) 10Ema: git.w.o VTC tests: add XFP [puppet] - 10https://gerrit.wikimedia.org/r/288154 [07:45:15] <_joe_> ok [07:47:32] (03CR) 10Ema: [C: 032 V: 032] git.w.o VTC tests: add XFP [puppet] - 10https://gerrit.wikimedia.org/r/288154 (owner: 10Ema) [08:00:12] 06Operations, 10OCG-General, 13Patch-For-Review: imagemagick::install refers to directory /etc/ImageMagic which does not exist on Jessie - https://phabricator.wikimedia.org/T134773#2284021 (10hashar) 05Open>03Resolved OCG servers are being reinstalled apparently to Jessie (T84723) @Dzahn fixed the apparm... [08:02:05] good morning [08:03:22] !log puppet disabled on mc10XX hosts for https://gerrit.wikimedia.org/r/#/c/287913 [08:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:03:41] hashar: o/ [08:04:12] !log restarting apache on palladium for openssl update [08:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:04:53] (03PS6) 10Elukey: Add the possibility to specify memcached's chunk growth factor. [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) [08:06:10] (03CR) 10Elukey: [C: 032] Add the possibility to specify memcached's chunk growth factor. [puppet] - 10https://gerrit.wikimedia.org/r/287913 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [08:08:24] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: puppet fail [08:08:24] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: puppet fail [08:08:25] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: puppet fail [08:08:34] PROBLEM - puppet last run on mw2001 is CRITICAL: CRITICAL: puppet fail [08:08:53] PROBLEM - puppet last run on mc2010 is CRITICAL: CRITICAL: Puppet has 13 failures [08:09:03] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: puppet fail [08:09:13] PROBLEM - puppet last run on mw1160 is CRITICAL: CRITICAL: Puppet has 12 failures [08:09:23] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: puppet fail [08:09:33] PROBLEM - puppet last run on mw2007 is CRITICAL: CRITICAL: Puppet has 11 failures [08:09:34] PROBLEM - puppet last run on mw2128 is CRITICAL: CRITICAL: puppet fail [08:09:34] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: puppet fail [08:09:38] oops [08:09:43] PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: Puppet has 12 failures [08:09:43] PROBLEM - puppet last run on mw2115 is CRITICAL: CRITICAL: Puppet has 9 failures [08:09:43] PROBLEM - puppet last run on mw2180 is CRITICAL: CRITICAL: Puppet has 11 failures [08:09:50] elukey: puppet unhappy :-} [08:09:53] PROBLEM - puppet last run on wtp2018 is CRITICAL: CRITICAL: Puppet has 1 failures [08:09:53] PROBLEM - puppet last run on mw2213 is CRITICAL: CRITICAL: Puppet has 8 failures [08:10:03] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Puppet has 2 failures [08:10:03] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: puppet fail [08:10:04] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: puppet fail [08:10:04] PROBLEM - puppet last run on mw2013 is CRITICAL: CRITICAL: Puppet has 13 failures [08:10:14] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 20 failures [08:10:23] PROBLEM - puppet last run on mw2066 is CRITICAL: CRITICAL: Puppet has 6 failures [08:10:34] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Puppet has 1 failures [08:10:43] PROBLEM - puppet last run on mw1016 is CRITICAL: CRITICAL: Puppet has 38 failures [08:10:44] PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: Puppet has 1 failures [08:10:44] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 15 failures [08:10:47] hashar: I just ran it on mc1010 and all was good [08:10:53] hashar: I guess is more for the apache restart of moritzm [08:10:54] PROBLEM - puppet last run on mc2011 is CRITICAL: CRITICAL: puppet fail [08:10:55] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Puppet has 8 failures [08:10:58] on palladium [08:11:04] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [08:11:13] PROBLEM - puppet last run on mw2040 is CRITICAL: CRITICAL: Puppet has 10 failures [08:11:14] PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: Puppet has 1 failures [08:11:31] 06Operations: Investigate cp1044's strange Ganglia graphs - https://phabricator.wikimedia.org/T132859#2284116 (10ema) @elukey: the ganglia/varnish issue we've seen when upgrading cp1044 was due to VSM files not being readable by the ganglia user since the v4 upgrade. The solution is simple: add the ganglia user... [08:11:44] PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: Puppet has 1 failures [08:11:45] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: puppet fail [08:11:45] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [08:12:07] volans: yeah that explains it more thanks :) [08:14:27] elukey: are the memcached in prod being migrated to Jessie ? [08:14:41] Is this just the puppet failures that happen every day? [08:14:41] hashar: already on Jessie :) [08:14:44] * YuviPanda clock is out of whack [08:14:46] asking cause the beta cluster instances are still using Precise [08:14:52] with memcached 1.4.15-0wmf1 [08:15:24] hashar: we should probably upgrade them, but I am not sure if there is a plan or not [08:15:25] <_joe_> YuviPanda: no I think it's moritzm restarting apache on palladium [08:15:31] aaah ok [08:15:35] that makes sense [08:15:46] <_joe_> hashar: the memcached in prod are jessies [08:17:13] yeah, that's puppet agents which tried to connect to the puppetmaster while it was restarted, should all sort out within 30 minutes max. [08:18:03] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:18:04] I forced it on db2040 and went fine [08:18:18] thanks icinga :) [08:18:35] elukey: filled https://phabricator.wikimedia.org/T134974 I am going to do it right now. Should be quite fast :-} [08:18:41] !log memcached on mc1009 restarted with chunk size growth factor 1.15 (was: 1.05) [08:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:19:17] \o/ [08:21:43] given labs spawn instance and that everything is in puppet class role::memcached , that is pretty much straightforward [08:24:26] !log restbase deploy start of beaaa71 [08:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:28:32] !log bootstrap restbase2008-a T132976 [08:28:32] T132976: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976 [08:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:29:07] godog: \o/ [08:29:19] hehe \o/ indeed [08:34:21] !log restbase deploy end of beaaa71 [08:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:34:30] RECOVERY - puppet last run on mw2180 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:34:50] RECOVERY - puppet last run on wtp2018 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [08:34:50] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:34:59] RECOVERY - puppet last run on mw1160 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [08:35:09] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:35:10] RECOVERY - puppet last run on strontium is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [08:35:19] RECOVERY - puppet last run on mw2040 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:35:19] RECOVERY - puppet last run on mc2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:35:20] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:35:20] RECOVERY - puppet last run on mw2066 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:35:29] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:35:30] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:35:30] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:35:39] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:36:10] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [08:36:11] RECOVERY - puppet last run on mw2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [08:36:11] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [08:36:11] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:36:19] RECOVERY - puppet last run on mw2128 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:36:20] RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:36:20] RECOVERY - puppet last run on mw2115 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:36:30] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:36:49] RECOVERY - puppet last run on mw2213 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:36:50] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:37:10] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [08:37:20] PROBLEM - Restbase root url on restbase2008 is CRITICAL: Connection refused [08:37:20] RECOVERY - puppet last run on mw2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:37:29] RECOVERY - puppet last run on mc2011 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [08:37:30] RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:37:50] PROBLEM - cassandra-a CQL 10.192.32.143:9042 on restbase2008 is CRITICAL: Connection refused [08:37:51] RECOVERY - puppet last run on mw2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:37:52] (03CR) 10Alexandros Kosiaris: [C: 032] Change Prop: Tell RESTBase not to respond with redirects [puppet] - 10https://gerrit.wikimedia.org/r/287080 (https://phabricator.wikimedia.org/T134483) (owner: 10Mobrovac) [08:37:57] (03PS2) 10Alexandros Kosiaris: Change Prop: Tell RESTBase not to respond with redirects [puppet] - 10https://gerrit.wikimedia.org/r/287080 (https://phabricator.wikimedia.org/T134483) (owner: 10Mobrovac) [08:38:02] (03CR) 10Alexandros Kosiaris: [V: 032] Change Prop: Tell RESTBase not to respond with redirects [puppet] - 10https://gerrit.wikimedia.org/r/287080 (https://phabricator.wikimedia.org/T134483) (owner: 10Mobrovac) [08:38:20] RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:38:20] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:38:20] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.142, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:38:29] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:38:30] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:38:45] godog: i'll deploy to rb2008, no need to ack [08:39:16] ok thanks mobrovac ! [08:39:20] np [08:39:51] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.32.143:9042 on restbase2008 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping cassandra [08:42:30] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [08:43:19] RECOVERY - Restbase root url on restbase2008 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.148 second response time [08:48:48] (03CR) 10Hashar: [C: 031] "I have cherry picked that on the beta cluster puppet master. Force ran puppet on all instances having nutcracker installed which properly " [puppet] - 10https://gerrit.wikimedia.org/r/288156 (https://phabricator.wikimedia.org/T134974) (owner: 10Hashar) [08:49:04] elukey: I have switched beta cluster memcached to Jessie. https://gerrit.wikimedia.org/r/#/c/288156/ is good to merge :-} [08:50:18] (03CR) 10Elukey: [C: 032] beta: migrate memcached to new Jessie servers [puppet] - 10https://gerrit.wikimedia.org/r/288156 (https://phabricator.wikimedia.org/T134974) (owner: 10Hashar) [08:50:34] hashar: good for me, merging :) [08:50:38] thanks! [08:50:51] hashar, do you think if it is ok for me to create a ticket for releng to triage, even if in reality it should be done by others (I just do not know who, but potentially created because of train)? [08:50:53] puppet role::memcached combined with hiera makes it magic [08:51:03] and all MediaWiki config referencing the local nutcracker instances [08:51:08] all of that made it very trivial [08:51:18] jynus: yeah definitely [08:51:27] (03CR) 10Elukey: [V: 032] beta: migrate memcached to new Jessie servers [puppet] - 10https://gerrit.wikimedia.org/r/288156 (https://phabricator.wikimedia.org/T134974) (owner: 10Hashar) [08:51:41] jynus: and if there are some error logs you can add the project tag #wikimedia-log-errors to it [08:51:46] I just did not want to look that I "blamed" you [08:52:02] thanks, that is a really good idea that I can actually do [08:52:10] (03PS1) 10Ema: Upgrade misc esams to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/288157 (https://phabricator.wikimedia.org/T131501) [08:52:28] ema: \o/ [08:52:34] I dont think there is a point in having our DBAs waste their time figuring out who should own a task when releng can pretty much be abused to assume the triaging / reach devs etc ;-} [08:55:19] !log upgrading cp3008 to varnish 4 (T131501) [08:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:55:35] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [08:56:27] (03PS2) 10Ema: Upgrade misc esams to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/288157 (https://phabricator.wikimedia.org/T131501) [08:56:35] (03CR) 10Ema: [C: 032 V: 032] Upgrade misc esams to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/288157 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema) [09:00:51] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 2 failures [09:01:00] (03CR) 10Filippo Giunchedi: [C: 031] "no I'm not using the password anywhere yet, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/280939 (https://phabricator.wikimedia.org/T128185) (owner: 10Jcrespo) [09:01:11] (03PS1) 10Muehlenhoff: Assign salt grains for various labtest* hosts for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/288159 [09:02:37] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [09:02:50] (03PS2) 10Muehlenhoff: Assign salt grains for various labtest* hosts for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/288159 [09:03:37] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:03:55] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:03:55] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:03:56] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:04:05] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:04:05] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:04:05] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:05:16] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 86282.68 seconds [09:05:25] ? [09:05:31] overload [09:05:32] ? [09:05:37] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [09:05:37] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:05:46] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [09:05:55] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:05:55] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 85988.58 seconds [09:05:55] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [09:06:10] 06Operations: ircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2284223 (10fgiunchedi) thanks @Dzahn @Krenair ! I think ircecho should react better to exceptions, generally `except Exception` is a bad sign in python, possibly just log the exception and exit a... [09:06:21] jynus: pigz running :) [09:06:23] wait [09:06:31] today is wednesday [09:06:31] !log upgrading cp3009 to varnish 4 (T131501) [09:06:32] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [09:06:35] I always forget [09:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:12] I have to also check the new misc script [09:10:14] Looks good: https://phabricator.wikimedia.org/P3031 [09:10:44] one day dbstore1001 will explode [09:10:52] :-) [09:11:22] in other news, ones that are not horrible or boring [09:11:35] GTID enabled on es2019 [09:11:52] which hopefully will mean no longer corruption on crash [09:12:12] I want to crash it now [09:12:28] but I think I will chose a node that does not take 8 hours to recover [09:12:33] in case I am wrong [09:12:37] eheheh [09:12:59] but if it works, it will increase our quality of live 100 times [09:13:18] s/v/f/ [09:14:39] !log upgrading cp3010 to varnish 4 (T131501) [09:14:40] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [09:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:18:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for various labtest* hosts for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/288159 (owner: 10Muehlenhoff) [09:18:33] (03PS1) 10Filippo Giunchedi: graphite: deprecate 'carbonctl check' [puppet] - 10https://gerrit.wikimedia.org/r/288161 [09:19:18] 06Operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#2284244 (10akosiaris) [09:20:37] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 2 failures [09:22:36] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:22:57] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:23:05] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:23:06] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:24:24] I will ack those so they do not spam again [09:24:45] 06Operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#2284248 (10akosiaris) 05stalled>03Open p:05Normal>03Low We 've actually replaced quite a few misc servers already with VMs, including all the candidates liste... [09:24:55] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [09:24:56] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [09:24:56] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:25:00] but I think load was bad, but not that-bad that nrpe timeedout [09:25:06] something changed? [09:27:59] disk usage is at 100%... [09:29:27] let's compare it with last week [09:29:57] was kinda the same, network traffic was on the other side though [09:30:23] no something's wrong [09:30:42] previous backups were not deleted [09:31:13] this backup started earlier than usual [09:31:38] the spike in load/disk is the same of May 6th, not one week ago [09:31:45] or maybe last one started late [09:32:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:16] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:16] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:16] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:42] jynus: for https://phabricator.wikimedia.org/T134976 I guess I lack a lot of knowledge about how db works :/ [09:32:55] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:56] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:05] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:16] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:26] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:26] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:36] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:37] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:49] there is surely a bunch of SlowTimer entries reported by hhvm and I have no idea how much load SpecialRecentChangesLinked::doMainQuery put on the db [09:33:55] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:55] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:55] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:56] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:34:16] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:34:16] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:34:56] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:34:56] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:35:06] jynus: on dbstore1001 show slave status\G hangs, hence the timeout from NRPE [09:35:58] it should only hang because long runinng transactions [09:36:15] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:36:15] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:36:15] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:36:16] RECOVERY - MariaDB Slave Lag: s1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 88100.48 seconds [09:36:16] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:36:16] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:36:25] I've downtimed it for now [09:36:43] ok [09:36:46] RECOVERY - MariaDB Slave Lag: s5 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 84344.37 seconds [09:36:46] RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 84083.92 seconds [09:36:47] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:36:55] (03PS1) 10Ema: Upgrade misc ulsfo to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/288166 (https://phabricator.wikimedia.org/T131501) [09:36:58] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:36:58] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [09:37:16] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [09:37:25] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:37:25] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [09:37:35] RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 2.06 seconds [09:37:36] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [09:37:37] !log rolling restart of scb in eqiad to pick up openssl update [09:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:37:46] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 75548.89 seconds [09:37:55] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:37:56] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 86299.05 seconds [09:37:56] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [09:38:10] I am checking alert history [09:38:41] (03CR) 10Ema: [C: 032 V: 032] Upgrade misc ulsfo to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/288166 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema) [09:39:11] it happened before, just at night [09:39:52] I think the script would benefit from throttling [09:41:09] !log upgrading cp4001 to varnish 4 (T131501) [09:41:10] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [09:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:27] T134471 and T131705, plus the unusual time may contribute to higher load [09:46:28] T131705: dbstore1001 low available space - https://phabricator.wikimedia.org/T131705 [09:46:28] T134471: dbstore1001 degraded RAID - https://phabricator.wikimedia.org/T134471 [09:49:44] 06Operations, 07Tracking: Make ircecho much better (Tracking) - https://phabricator.wikimedia.org/T95052#2284262 (10Krenair) @fgiunchedi: kraz runs udpmxircecho, not ircecho [09:49:51] surely doesn't help [09:50:25] 06Operations, 10Wikimedia-IRC-RC-Server: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2284264 (10Krenair) [09:50:38] 06Operations, 10Wikimedia-IRC-RC-Server: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2280451 (10Krenair) [09:50:41] 06Operations, 07Tracking: Make ircecho much better (Tracking) - https://phabricator.wikimedia.org/T95052#2284266 (10Krenair) [09:51:37] 06Operations, 10DBA, 07Availability: Throttle mysql backups on dbstore1001 in order to not saturate the node - https://phabricator.wikimedia.org/T134977#2284269 (10jcrespo) [09:52:14] 06Operations, 10DBA, 07Availability: Throttle mysql backups on dbstore1001 in order to not saturate the node - https://phabricator.wikimedia.org/T134977#2284281 (10jcrespo) p:05Triage>03Low [09:52:54] 06Operations, 10DBA: Throttle mysql backups on dbstore1001 in order to not saturate the node - https://phabricator.wikimedia.org/T134977#2284269 (10jcrespo) [10:03:56] !log upgrading cp4002 to varnish 4 (T131501) [10:03:56] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [10:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:04:11] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2284310 (10Samtar) Hi @BBlack thanks again for the above - I did remember to turn this off, not keen to be surfing around without it! At work (Win 8.1 Pro + FF 46.0.1) the issue reappeared until changing to... [10:05:29] (03PS5) 10Volans: MariaDB: set $master true for codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/287144 (https://phabricator.wikimedia.org/T134481) [10:10:19] 06Operations, 07Tracking: Make ircecho much better (Tracking) - https://phabricator.wikimedia.org/T95052#2284337 (10fgiunchedi) @Krenair I don't know what the difference is, though the service is called ircecho and that spawns udpmxircecho ``` kraz:~$ grep -i exec /etc/systemd/system/ircecho.service ExecStart... [10:12:12] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 699 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6179090 keys - replication_delay is 699 [10:12:34] !log upgrading cp4003 to varnish 4 (T131501) [10:12:34] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [10:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:17:22] PROBLEM - HHVM rendering on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:18:45] 06Operations, 07Tracking: Make ircecho much better (Tracking) - https://phabricator.wikimedia.org/T95052#2284353 (10Krenair) They're completely separate scripts reading data to broadcast in different ways and are set up to send to different networks. There's also tcpircbot. ```alex@alex-laptop:~/Development/Wi... [10:19:12] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 68142 bytes in 0.124 second response time [10:19:27] Krenair: haahha /o\ re: T95052 [10:19:27] T95052: Make ircecho much better (Tracking) - https://phabricator.wikimedia.org/T95052 [10:20:10] godog, yeah we have three different IRC bots in puppet :( [10:20:17] !log upgrading cp4004 to varnish 4 (T131501) [10:20:18] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [10:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:21:45] <_joe_> Krenair: only 3? [10:21:52] <_joe_> and all invented here, I hope [10:22:00] 3 that I am aware of. [10:22:29] <_joe_> yeah I was just spewing my usual dose of sarcasm around :P [10:22:57] Hey there's probably more in frack or officeit or something [10:26:43] (03CR) 10Volans: "Sample of compiler: https://puppet-compiler.wmflabs.org/2753/db2068.codfw.wmnet/" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/285679 (https://phabricator.wikimedia.org/T133780) (owner: 10Volans) [10:27:03] (03PS1) 10Filippo Giunchedi: graphite: run graphite-index every five minutes [puppet] - 10https://gerrit.wikimedia.org/r/288169 [10:27:49] (03PS2) 10Hashar: contint: move File[/srv/localhost-worker] out of role [puppet] - 10https://gerrit.wikimedia.org/r/286869 [10:31:14] (03PS3) 10Hashar: contint: drop libcurl4-gnutls-dev [puppet] - 10https://gerrit.wikimedia.org/r/286837 (https://phabricator.wikimedia.org/T134378) [10:31:16] (03PS4) 10Hashar: contint: move package_builder setup to its own class [puppet] - 10https://gerrit.wikimedia.org/r/286873 (https://phabricator.wikimedia.org/T95545) [10:31:18] (03PS2) 10Hashar: contint: regroup PHP definitions in contint::packages::php [puppet] - 10https://gerrit.wikimedia.org/r/286879 [10:34:53] (03PS2) 10Filippo Giunchedi: graphite: run graphite-index every five minutes [puppet] - 10https://gerrit.wikimedia.org/r/288169 [10:35:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: run graphite-index every five minutes [puppet] - 10https://gerrit.wikimedia.org/r/288169 (owner: 10Filippo Giunchedi) [10:37:13] (03CR) 10Jcrespo: [C: 031] Avoid loading my.cnf twice [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/285679 (https://phabricator.wikimedia.org/T133780) (owner: 10Volans) [10:38:03] (03CR) 10Jcrespo: [C: 031] MariaDB: set $master true for codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/287144 (https://phabricator.wikimedia.org/T134481) (owner: 10Volans) [10:38:58] (03CR) 10Jcrespo: [C: 031] "Commit this, this is already very useful." [software] - 10https://gerrit.wikimedia.org/r/283946 (https://phabricator.wikimedia.org/T130702) (owner: 10Volans) [10:41:32] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [10:42:02] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [10:42:42] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [10:42:54] (03CR) 10Volans: [C: 032] DBtools: add script to check external storage [software] - 10https://gerrit.wikimedia.org/r/283946 (https://phabricator.wikimedia.org/T130702) (owner: 10Volans) [10:43:05] !log elukey@palladium conftool action : set/pooled=no; selector: kafka1001.eqiad.wmnet [10:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:44:42] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6150537 keys - replication_delay is 0 [10:44:47] !log gradually disabling unprivileged bpf on Linux 4.4 hosts via sysctl (once completed this will be puppetised, but the sysctl can't be reverted without a reboot so be careful for the initial activation) [10:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:57] !log elukey@palladium conftool action : set/pooled=yes; selector: kafka1001.eqiad.wmnet [10:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:48:40] !log restarting eventbus on kafka100[12] for security upgrades [10:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:50:14] !log elukey@palladium conftool action : set/pooled=no; selector: kafka1002.eqiad.wmnet [10:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:50:37] (03CR) 10Volans: [V: 032] DBtools: add script to check external storage [software] - 10https://gerrit.wikimedia.org/r/283946 (https://phabricator.wikimedia.org/T130702) (owner: 10Volans) [10:51:58] !log elukey@palladium conftool action : set/pooled=yes; selector: kafka1002.eqiad.wmnet [10:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:58:33] (03CR) 10Hashar: "recheck" [software] - 10https://gerrit.wikimedia.org/r/283946 (https://phabricator.wikimedia.org/T130702) (owner: 10Volans) [10:59:14] (03CR) 10Hashar: "Zuul now blindly vote V+2 :-)" [software] - 10https://gerrit.wikimedia.org/r/283946 (https://phabricator.wikimedia.org/T130702) (owner: 10Volans) [11:15:51] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2284389 (10JanZerebecki) > Because the concept of SC(A|B) is already a deviation from that: How so? The concept is designed to allow multiple roles/services per host. [11:16:12] PROBLEM - HHVM rendering on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:16:53] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:17:37] 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2284392 (10BBlack) #HTTPS is a little different: that transition is a huge, long-term effort spanning years. HTTP/2 is only a slight change from the SPDY/3 that we supported for quite a while before it.... [11:18:22] PROBLEM - SSH on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:18:32] PROBLEM - RAID on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:18:41] PROBLEM - dhclient process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:18:52] PROBLEM - puppet last run on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:01] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2284394 (10mobrovac) @JanZerebecki Puppet runs on a per-host basis, not per-service. Hence, I'm not asking for a specific service admin group to be able to run it, but I'm asking that I can... [11:19:02] PROBLEM - nutcracker process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:13] PROBLEM - Check size of conntrack table on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:21] PROBLEM - configured eth on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:22] PROBLEM - nutcracker port on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:22] PROBLEM - salt-minion processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:32] PROBLEM - DPKG on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:43] PROBLEM - Disk space on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:03] PROBLEM - HHVM processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:22:33] <_joe_> looking at it [11:23:21] RECOVERY - configured eth on mw1142 is OK: OK - interfaces up [11:23:22] RECOVERY - nutcracker port on mw1142 is OK: TCP OK - 0.000 second response time on port 11212 [11:23:22] RECOVERY - salt-minion processes on mw1142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:23:31] RECOVERY - DPKG on mw1142 is OK: All packages OK [11:23:41] RECOVERY - Disk space on mw1142 is OK: DISK OK [11:24:01] RECOVERY - HHVM processes on mw1142 is OK: PROCS OK: 6 processes with command name hhvm [11:24:22] RECOVERY - SSH on mw1142 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [11:24:33] RECOVERY - RAID on mw1142 is OK: OK: no RAID installed [11:24:41] RECOVERY - dhclient process on mw1142 is OK: PROCS OK: 0 processes with command name dhclient [11:24:52] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:25:02] RECOVERY - nutcracker process on mw1142 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:25:12] RECOVERY - Check size of conntrack table on mw1142 is OK: OK: nf_conntrack is 0 % full [11:27:18] 06Operations, 10Traffic, 07HTTPS: Secure connection failed - https://phabricator.wikimedia.org/T134869#2284408 (10BBlack) We may have been seeing the same bug here as https://bugzilla.mozilla.org/show_bug.cgi?id=1271301 - you might want to see if the test in that bug work from FF 47 but not 46 as well. Ther... [11:34:34] <_joe_> !log restarting mw1142 [11:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:34:50] <_joe_> !log restarting just HHVM on mw1142 [11:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:35:06] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 68117 bytes in 5.618 second response time [11:36:07] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.061 second response time [11:40:16] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:40:48] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:27] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:27] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:44:58] 06Operations, 10Graph, 10Graphoid, 10Traffic, 13Patch-For-Review: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2284438 (10Yurik) 05Open>03Resolved a:03Yurik ok, will close for now, thx for your help! [11:47:52] 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2284449 (10Danny_B) OK, another: {T117682} There is bunch of tasks regarding SPDY, so maybe #HTTP2-SPDY then? [11:51:57] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 712 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6155971 keys - replication_delay is 712 [11:52:04] (03Abandoned) 10Gehel: WIP - Allow host specific private configuration [puppet] - 10https://gerrit.wikimedia.org/r/288106 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [11:52:31] (03PS1) 10Ema: Upgrade misc codfw to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/288172 (https://phabricator.wikimedia.org/T131501) [11:53:52] (03CR) 10Ema: [C: 032 V: 032] Upgrade misc codfw to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/288172 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema) [11:55:42] !log upgrading cp2006 to varnish 4 (T131501) [11:55:43] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [11:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:58:33] 06Operations, 10Traffic, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2284470 (10BBlack) So, recapping the strange evidence above, because I've been thinking about this off and on all night and it still makes no sense: * The only change in... [12:00:16] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6148622 keys - replication_delay is 0 [12:01:56] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:02:46] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [12:02:56] 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2284474 (10BBlack) SPDY is dead. It was an experimental protocol that predated HTTP/2, and HTTP/2 was derived from it. We dropped support for SPDY when we turned on HTTP/2 last week. Chrome's dropping... [12:03:18] mobrovac: We're set for 15.30 UTC for scap3, right? [12:03:26] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:03:40] mobrovac: that is also SWAT time, not affecting anything with it? [12:03:58] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:04:58] jouncebot: next [12:04:58] In 2 hour(s) and 55 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160511T1500) [12:05:06] !log upgrading cp2012 to varnish 4 (T131501) [12:05:07] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [12:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:06:42] thcipriani|afk mobrovac FYI I'm going to merge https://gerrit.wikimedia.org/r/#/c/287918/ in an hour or so [12:09:11] !log installing libarchive security updates [12:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:10:01] !log upgrading cp2018 to varnish 4 (T131501) [12:10:02] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [12:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:10:16] PROBLEM - Varnishkafka log producer on cp2006 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [12:10:45] uh, what's up varnishkafka? [12:10:49] :( [12:12:16] RECOVERY - Varnishkafka log producer on cp2006 is OK: PROCS OK: 1 process with command name varnishkafka [12:17:08] ema: did you restart it? --^ [12:17:17] yes I did [12:17:27] PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 2 failures [12:18:28] 06Operations, 10Traffic, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2284589 (10BBlack) Trying a patch that will test the first two ideas above... [12:18:44] !log upgrading cp2025 to varnish 4 (T131501) [12:18:45] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [12:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:19:24] (03PS1) 10BBlack: tlsproxy: protocol 1.1 + ignore_client_abort [puppet] - 10https://gerrit.wikimedia.org/r/288177 (https://phabricator.wikimedia.org/T107749) [12:19:50] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: protocol 1.1 + ignore_client_abort [puppet] - 10https://gerrit.wikimedia.org/r/288177 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [12:25:46] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:26:16] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:27:06] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:28:18] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:49] (03PS1) 10BBlack: tlsproxy: remove ignore_client_abort [puppet] - 10https://gerrit.wikimedia.org/r/288179 (https://phabricator.wikimedia.org/T107749) [12:31:06] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: remove ignore_client_abort [puppet] - 10https://gerrit.wikimedia.org/r/288179 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [12:31:50] (03PS1) 10Ema: Upgrade misc eqiad to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/288180 (https://phabricator.wikimedia.org/T131501) [12:32:56] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 601 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6151445 keys - replication_delay is 601 [12:32:57] (03CR) 10Ema: [C: 032 V: 032] Upgrade misc eqiad to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/288180 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema) [12:33:52] !log upgrading cp1045 to varnish 4 (T131501) [12:33:53] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [12:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:34:37] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:34:37] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:36:39] godog: kk will need to run on all targets pre-SWAT. Or it can be installed on tin *last* and then it shouldn't matter. [12:37:37] (03PS1) 10BBlack: tlsproxy: remove protocol 1.1 [puppet] - 10https://gerrit.wikimedia.org/r/288182 (https://phabricator.wikimedia.org/T107749) [12:37:55] kart_: no, it's not related to the MW side, so we are safe to proceed at that time [12:37:57] (03PS2) 10BBlack: tlsproxy: remove protocol 1.1 [puppet] - 10https://gerrit.wikimedia.org/r/288182 (https://phabricator.wikimedia.org/T107749) [12:39:23] !log upgrading cp1051 to varnish 4 (T131501) [12:39:24] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [12:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:39:33] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: remove protocol 1.1 [puppet] - 10https://gerrit.wikimedia.org/r/288182 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [12:39:59] I could use some op help to land four puppet patches for CI. None impacting prod and all being applied on the CI puppetmaster. Should be straightforward :D First of the serie being https://gerrit.wikimedia.org/r/#/c/286869/ [12:40:06] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:40:46] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:40:46] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:41:06] thcipriani|afk: ok thanks, I'll merge now [12:41:18] (03PS2) 10Filippo Giunchedi: scap: update to 3.2.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/287918 [12:41:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] scap: update to 3.2.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/287918 (owner: 10Filippo Giunchedi) [12:45:16] (03PS1) 10Elukey: Add partman receipe for new AQS hosts with SSDs. [puppet] - 10https://gerrit.wikimedia.org/r/288184 (https://phabricator.wikimedia.org/T133785) [12:45:47] PROBLEM - HHVM rendering on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:56] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:46:16] PROBLEM - Check size of conntrack table on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:27] PROBLEM - DPKG on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:47] PROBLEM - RAID on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:54] job runner in trouble? [12:47:05] !log upgrading cp1058 to varnish 4 (T131501) [12:47:06] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [12:47:07] PROBLEM - SSH on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:16] PROBLEM - salt-minion processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:17] PROBLEM - configured eth on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:27] PROBLEM - dhclient process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:40] ah yes memory saturated in server-board [12:47:44] going to restart it [12:47:47] PROBLEM - nutcracker port on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:47] PROBLEM - nutcracker process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:37] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet has 2 failures [12:49:58] (03CR) 10Hashar: "Jobs got migrated to Jessie instances and Trusty ones have been garbage collected." [puppet] - 10https://gerrit.wikimedia.org/r/286873 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [12:50:00] !log upgrading cp1061 to varnish 4 (T131501) [12:50:01] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [12:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:50:23] !log mw1147 powercycled due to unresponsiveness (not able to login as root) [12:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:50:48] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:50:48] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:50:56] PROBLEM - Disk space on mw1147 is CRITICAL: Timeout while attempting connection [12:50:56] PROBLEM - HHVM processes on mw1147 is CRITICAL: Timeout while attempting connection [12:51:15] (03PS1) 10Ema: Set misc as varnish4-only [puppet] - 10https://gerrit.wikimedia.org/r/288185 (https://phabricator.wikimedia.org/T131501) [12:51:36] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [12:51:37] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [12:52:07] RECOVERY - Check size of conntrack table on mw1147 is OK: OK: nf_conntrack is 0 % full [12:52:26] RECOVERY - DPKG on mw1147 is OK: All packages OK [12:52:30] (03CR) 10Ema: [C: 032 V: 032] Set misc as varnish4-only [puppet] - 10https://gerrit.wikimedia.org/r/288185 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema) [12:52:38] RECOVERY - Disk space on mw1147 is OK: DISK OK [12:52:38] RECOVERY - RAID on mw1147 is OK: OK: no RAID installed [12:52:38] RECOVERY - HHVM processes on mw1147 is OK: PROCS OK: 6 processes with command name hhvm [12:52:58] RECOVERY - SSH on mw1147 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [12:53:07] RECOVERY - salt-minion processes on mw1147 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:53:08] RECOVERY - configured eth on mw1147 is OK: OK - interfaces up [12:53:17] RECOVERY - dhclient process on mw1147 is OK: PROCS OK: 0 processes with command name dhclient [12:53:38] 06Operations, 07Tracking: Make ircecho much better (Tracking) - https://phabricator.wikimedia.org/T95052#2284875 (10faidon) Perhaps we should rename some of them to avoid confusion (e.g. "udpmxircecho" as "rcbot"?) and start being consistent in our naming? [12:53:46] RECOVERY - nutcracker process on mw1147 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:53:46] RECOVERY - nutcracker port on mw1147 is OK: TCP OK - 0.000 second response time on port 11212 [12:53:47] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 68157 bytes in 3.700 second response time [12:53:47] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.413 second response time [12:55:21] 06Operations, 10Traffic: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2284895 (10ema) [12:55:23] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Wikidata Query Service REST endpoint returns truncated results - https://phabricator.wikimedia.org/T133490#2284894 (10ema) [12:55:26] 06Operations, 10Traffic, 13Patch-For-Review: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2284896 (10ema) [12:55:29] 06Operations, 10Traffic, 13Patch-For-Review: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2284893 (10ema) 05Open>03Resolved [12:56:06] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [12:56:42] 06Operations, 10Traffic, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2284900 (10BBlack) I'm out of good ideas for now. I've tried both of the obvious alternatives (1.1 w/o explicit Connection setting, and 1.1 w/o explicit Connection setti... [12:57:57] (03PS2) 10Elukey: Add partman receipe for new AQS hosts with SSDs. [puppet] - 10https://gerrit.wikimedia.org/r/288184 (https://phabricator.wikimedia.org/T133785) [13:00:33] (03PS2) 10Faidon Liambotis: base: add gdisk to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/287975 [13:00:56] (03CR) 10Faidon Liambotis: [C: 032] base: add gdisk to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/287975 (owner: 10Faidon Liambotis) [13:02:11] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 4 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (10Yann) I had the same issue with https://commons.wikimedia.org/wiki/File:... [13:07:40] (03PS6) 10Volans: MariaDB: set $master true for codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/287144 (https://phabricator.wikimedia.org/T134481) [13:09:11] godog: thcipriani|afk: fyi, we have scheduled to move cxserver and mobileapps with _joe_ @ 15: [13:09:16] 15:30 UTC [13:09:49] godog: i trust the new scap pkg will be available everywhere by then? [13:10:08] (03PS4) 10Hashar: contint: drop libcurl4-gnutls-dev [puppet] - 10https://gerrit.wikimedia.org/r/286837 (https://phabricator.wikimedia.org/T134378) [13:10:10] (03PS5) 10Hashar: contint: move package_builder setup to its own class [puppet] - 10https://gerrit.wikimedia.org/r/286873 (https://phabricator.wikimedia.org/T95545) [13:10:13] (03PS3) 10Hashar: contint: regroup PHP definitions in contint::packages::php [puppet] - 10https://gerrit.wikimedia.org/r/286879 [13:10:14] (03PS3) 10Hashar: contint: move File[/srv/localhost-worker] out of role [puppet] - 10https://gerrit.wikimedia.org/r/286869 [13:11:01] (03CR) 10Faidon Liambotis: [C: 032] contint: move File[/srv/localhost-worker] out of role [puppet] - 10https://gerrit.wikimedia.org/r/286869 (owner: 10Hashar) [13:11:19] (03CR) 10Faidon Liambotis: [C: 032] contint: move package_builder setup to its own class [puppet] - 10https://gerrit.wikimedia.org/r/286873 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [13:11:32] (03CR) 10Faidon Liambotis: [C: 032] contint: regroup PHP definitions in contint::packages::php [puppet] - 10https://gerrit.wikimedia.org/r/286879 (owner: 10Hashar) [13:11:46] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6144704 keys - replication_delay is 0 [13:12:03] (03CR) 10Faidon Liambotis: [C: 032] contint: drop libcurl4-gnutls-dev [puppet] - 10https://gerrit.wikimedia.org/r/286837 (https://phabricator.wikimedia.org/T134378) (owner: 10Hashar) [13:15:03] (03Draft1) 10Addshore: Dont log dewiki_diffstats to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 [13:15:14] (03PS2) 10Addshore: Don't log dewiki_diffstats to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) [13:16:06] (03PS7) 10Volans: MariaDB: set $master true for codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/287144 (https://phabricator.wikimedia.org/T134481) [13:18:38] (03PS3) 10Addshore: Don't log dewiki_diffstats to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) [13:18:45] (03CR) 10Volans: [C: 032] MariaDB: set $master true for codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/287144 (https://phabricator.wikimedia.org/T134481) (owner: 10Volans) [13:18:59] 06Operations, 06Performance-Team, 05codfw-rollout: test2wiki has no recent changes before the 20 april - https://phabricator.wikimedia.org/T133225#2284980 (10faidon) [13:19:23] 06Operations, 06Performance-Team, 05codfw-rollout: test2wiki has no recent changes before the 20 april - https://phabricator.wikimedia.org/T133225#2225647 (10faidon) Adding #Performance-Team. @aaron, is this something that you could investigate further? [13:21:33] 06Operations: Puppet-manage redis.conf - https://phabricator.wikimedia.org/T134400#2285006 (10faidon) a:03Joe [13:24:58] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2285011 (10BBlack) One more thing to remember when upgrading nodes above: ``` chmod 644 /var/lib/varnish/*/*.vsm service ganglia-monitor restart ``` [13:29:02] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10Gehel) [13:31:48] !log camus + puppet disabled on analytics1027 as prep step for kafka cluster migration to 0.9 [13:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:00] (03PS1) 10Ottomata: Kafka 0.9 on kafka1012 [puppet] - 10https://gerrit.wikimedia.org/r/288189 (https://phabricator.wikimedia.org/T121562) [13:33:16] !log starting kafka 0.9 upgrade of analytics-eqiad cluster [13:33:23] 06Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989#2285067 (10faidon) Another idea (Ori's) is to use Kafka for log shipping. Kafka 0.9 (deployed at Wikimedia as we speak) has encryption/authentication/authorization features... [13:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:07] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 670 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6145623 keys - replication_delay is 670 [13:35:07] akosiaris, is it possible to try the new version of osm2pgsql on one of the non-masters, and if everything is good, resync it back? [13:35:38] although it is probably not worth it because gehel will install the new db on the new machines, so no point in waisting time [13:35:52] (03CR) 10Ottomata: [C: 032] Kafka 0.9 on kafka1012 [puppet] - 10https://gerrit.wikimedia.org/r/288189 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [13:35:54] yeah, sorry, spoke too soon, no rush on this [13:36:55] yurik: let me know if you want me to deploy this new osm2pgsql, it should be no trouble... [13:37:19] (03PS1) 10Ottomata: Kafka 0.9 on kafka1013 [puppet] - 10https://gerrit.wikimedia.org/r/288190 (https://phabricator.wikimedia.org/T121562) [13:37:30] gehel, lets not update maps-tests - no point in experimenting with it. Lets from the start use the new osm2pgsql on the new maps cluster [13:39:35] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10BBlack) Assuming there was no transient issue (which became cached) on the wdqs end of things, then t... [13:45:57] (03PS1) 10Jdrewniak: T128546 updating portal stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288192 (https://phabricator.wikimedia.org/T128546) [13:51:26] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285161 (10Gehel) p:05Triage>03High a:03Gehel [13:52:36] (03CR) 10Ottomata: [C: 032] Kafka 0.9 on kafka1013 [puppet] - 10https://gerrit.wikimedia.org/r/288190 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [13:53:31] (03PS1) 10Ottomata: Kafka 0.9 on kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/288194 (https://phabricator.wikimedia.org/T121562) [13:54:15] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285184 (10Gehel) I can't reproduce the issue. Since it was random, I might just be lucky. Closing this anyway f... [13:54:26] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285185 (10Gehel) 05Open>03Resolved [13:55:02] (03PS1) 10Volans: Add the replication_role parameter [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/288195 (https://phabricator.wikimedia.org/T133333) [13:55:31] (03PS1) 10Giuseppe Lavagetto: Initial debianization [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/288196 [13:56:32] 06Operations, 10Traffic, 13Patch-For-Review: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2285204 (10ema) 05Open>03Resolved a:03ema varnishstatsd has been behaving properly for a while now, logging a message in case... [13:58:30] (03PS3) 10Elukey: Add partman receipe for new AQS hosts with SSDs. [puppet] - 10https://gerrit.wikimedia.org/r/288184 (https://phabricator.wikimedia.org/T133785) [13:58:32] (03CR) 10Volans: "To be used for things like https://gerrit.wikimedia.org/r/#/c/287394/ and future customizations without being too specific or adding too m" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/288195 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [13:58:51] (03PS2) 10Giuseppe Lavagetto: Initial debianization [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/288196 (https://phabricator.wikimedia.org/T132317) [14:02:56] (03CR) 10Ottomata: [C: 032] Kafka 0.9 on kafka1014 [puppet] - 10https://gerrit.wikimedia.org/r/288194 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [14:05:51] (03PS1) 10Ottomata: Kafka 0.9 on kafka1022 [puppet] - 10https://gerrit.wikimedia.org/r/288200 (https://phabricator.wikimedia.org/T121562) [14:05:53] (03PS1) 10Muehlenhoff: Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 [14:07:08] (03CR) 10jenkins-bot: [V: 04-1] Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 (owner: 10Muehlenhoff) [14:09:35] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285279 (10Gehel) 05Resolved>03Open [14:09:51] (03PS2) 10Muehlenhoff: Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 [14:11:21] (03CR) 10jenkins-bot: [V: 04-1] Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 (owner: 10Muehlenhoff) [14:12:02] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285284 (10Gehel) Seems the issue (or a similar one is still happening), I'm trying to reproduce... [14:12:09] (03PS2) 10Ottomata: Kafka 0.9 on kafka1018 [puppet] - 10https://gerrit.wikimedia.org/r/288200 (https://phabricator.wikimedia.org/T121562) [14:12:27] (03CR) 10Ottomata: [C: 032 V: 032] Kafka 0.9 on kafka1018 [puppet] - 10https://gerrit.wikimedia.org/r/288200 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [14:17:27] PROBLEM - statsv process on hafnium is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args statsv [14:20:46] (03PS1) 10Ema: varnishprocessor.py: stop spamming logs [puppet] - 10https://gerrit.wikimedia.org/r/288204 (https://phabricator.wikimedia.org/T132474) [14:21:32] (03PS1) 10Volans: MariaDB: Change default ssl mode to puppet-cert [puppet] - 10https://gerrit.wikimedia.org/r/288205 (https://phabricator.wikimedia.org/T111654) [14:21:41] 06Operations, 07Tracking: Make ircecho much better (Tracking) - https://phabricator.wikimedia.org/T95052#2285307 (10Krenair) Yes, and also merge ircecho with tcpircbot. I've had patches open to work on that for over 6 months (one of which is still WIP waiting for the dependency). [14:23:01] (03CR) 10BBlack: [C: 031] varnishprocessor.py: stop spamming logs [puppet] - 10https://gerrit.wikimedia.org/r/288204 (https://phabricator.wikimedia.org/T132474) (owner: 10Ema) [14:23:55] (03PS1) 10Ottomata: Kafka 0.9 on kafka1020 [puppet] - 10https://gerrit.wikimedia.org/r/288206 (https://phabricator.wikimedia.org/T121562) [14:24:58] 06Operations, 10Traffic, 13Patch-For-Review: varnishmedia: repeated calls to flush_stats() - https://phabricator.wikimedia.org/T132474#2285323 (10ema) If flush_stats is not 'fast enough' sending stats to statsd, it happens that other transactions call it again, thus leading to the 'Flush' message flood. We s... [14:25:13] (03CR) 10Ottomata: [C: 032 V: 032] Kafka 0.9 on kafka1020 [puppet] - 10https://gerrit.wikimedia.org/r/288206 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [14:25:15] (03CR) 10Ema: [C: 032 V: 032] varnishprocessor.py: stop spamming logs [puppet] - 10https://gerrit.wikimedia.org/r/288204 (https://phabricator.wikimedia.org/T132474) (owner: 10Ema) [14:25:21] (03CR) 10Muehlenhoff: "Looks good to me (but didn't try to build it)" (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/288196 (https://phabricator.wikimedia.org/T132317) (owner: 10Giuseppe Lavagetto) [14:25:48] (03PS2) 10Ema: varnishprocessor.py: stop spamming logs [puppet] - 10https://gerrit.wikimedia.org/r/288204 (https://phabricator.wikimedia.org/T132474) [14:25:55] (03CR) 10Ema: [V: 032] varnishprocessor.py: stop spamming logs [puppet] - 10https://gerrit.wikimedia.org/r/288204 (https://phabricator.wikimedia.org/T132474) (owner: 10Ema) [14:27:33] (03PS3) 10Muehlenhoff: Disable unprivileged bpf on Linux >= 4.4 [puppet] - 10https://gerrit.wikimedia.org/r/288201 [14:28:34] (03PS2) 10Volans: MariaDB: Change mariadb::core parameters [puppet] - 10https://gerrit.wikimedia.org/r/288205 (https://phabricator.wikimedia.org/T111654) [14:31:46] RECOVERY - statsv process on hafnium is OK: PROCS OK: 13 processes with command name python, args statsv [14:32:14] ---^ this one was probably due to the kafka upgrade [14:36:33] !log wiping caches on cache_misc... [14:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:36:54] (03PS4) 10Elukey: Add partman receipe for new AQS hosts with SSDs. [puppet] - 10https://gerrit.wikimedia.org/r/288184 (https://phabricator.wikimedia.org/T133785) [14:36:58] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: install two Intel 320 Series SSDSA2CW300G3 2.5" 300GB each in wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T120712#2285353 (10Cmjohnson) @gehel let's schedule a time to fix this [14:37:26] cmjohnson1: yes, what time is good for you? [14:37:43] want to do today? [14:38:02] (03PS3) 10Giuseppe Lavagetto: nagios_common: Add command for using service_checker [puppet] - 10https://gerrit.wikimedia.org/r/287907 (https://phabricator.wikimedia.org/T134551) [14:38:28] RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:38:30] cmjohnson1: Honnestly, with all the problems I've already had with WDQS, I don't want to touch it at all... [14:38:50] okay...it can wait..i am not in a hurry just want to make sure you get what you paid for ;-) [14:39:05] can do anytime [14:39:47] (03PS1) 10Andrew Bogott: bootstrap-vz: more special-case handling for labtest [puppet] - 10https://gerrit.wikimedia.org/r/288207 [14:39:56] cmjohnson1: but it has to be done, so let's get it out of the way. I'd be happier to do it tomorrow and send a warning before. We'll need to run on a single server for a few days during data import and I've seen how things can go wrong... [14:40:01] (03CR) 10Giuseppe Lavagetto: [C: 032] nagios_common: Add command for using service_checker [puppet] - 10https://gerrit.wikimedia.org/r/287907 (https://phabricator.wikimedia.org/T134551) (owner: 10Giuseppe Lavagetto) [14:40:05] cmjohnson1: would tomorrow be good for you? [14:40:50] @gehel sure....1400UTC ok with you? [14:41:09] (03PS2) 10Andrew Bogott: bootstrap-vz: more special-case handling for labtest [puppet] - 10https://gerrit.wikimedia.org/r/288207 [14:41:35] cmjohnson1: great! Thanks for making me move on this! [14:42:24] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: install two Intel 320 Series SSDSA2CW300G3 2.5" 300GB each in wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T120712#2285373 (10Gehel) Replacement is scheduled for 1400 UTC, Thursday May 12. [14:43:49] (03PS3) 10Andrew Bogott: bootstrap-vz: more special-case handling for labtest [puppet] - 10https://gerrit.wikimedia.org/r/288207 [14:45:47] PROBLEM - Varnishkafka log producer on cp1045 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:45:47] PROBLEM - Varnishkafka log producer on cp1051 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:45:48] PROBLEM - Varnishkafka log producer on cp4004 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:46:08] PROBLEM - Varnishkafka log producer on cp2018 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:46:08] PROBLEM - Varnishkafka log producer on cp3007 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:46:32] (03CR) 10Andrew Bogott: [C: 032] bootstrap-vz: more special-case handling for labtest [puppet] - 10https://gerrit.wikimedia.org/r/288207 (owner: 10Andrew Bogott) [14:46:37] PROBLEM - Varnishkafka log producer on cp3009 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:46:46] PROBLEM - Varnishkafka log producer on cp1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:46:57] PROBLEM - Varnishkafka log producer on cp2006 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:46:57] PROBLEM - Varnishkafka log producer on cp4001 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:46:57] PROBLEM - Varnishkafka log producer on cp3008 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:47:36] PROBLEM - Varnishkafka log producer on cp1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:47:46] RECOVERY - Varnishkafka log producer on cp1045 is OK: PROCS OK: 1 process with command name varnishkafka [14:47:46] RECOVERY - Varnishkafka log producer on cp1051 is OK: PROCS OK: 1 process with command name varnishkafka [14:47:47] RECOVERY - Varnishkafka log producer on cp4004 is OK: PROCS OK: 1 process with command name varnishkafka [14:47:48] bblack it's you? ^^^ [14:47:56] volans: yes it's all good [14:47:58] well, not intentionally, but yes [14:48:08] RECOVERY - Varnishkafka log producer on cp2018 is OK: PROCS OK: 1 process with command name varnishkafka [14:48:08] RECOVERY - Varnishkafka log producer on cp3007 is OK: PROCS OK: 1 process with command name varnishkafka [14:48:09] good :) [14:48:14] apparently multiple of our stats/logging daemons don't survive varnishd restarts :P [14:48:38] RECOVERY - Varnishkafka log producer on cp3009 is OK: PROCS OK: 1 process with command name varnishkafka [14:48:47] RECOVERY - Varnishkafka log producer on cp1058 is OK: PROCS OK: 1 process with command name varnishkafka [14:48:53] there are also a lot of kafka related messages, we just upgraded all the brokers to 0.9 [14:48:57] RECOVERY - Varnishkafka log producer on cp2006 is OK: PROCS OK: 1 process with command name varnishkafka [14:49:06] RECOVERY - Varnishkafka log producer on cp4001 is OK: PROCS OK: 1 process with command name varnishkafka [14:49:06] RECOVERY - Varnishkafka log producer on cp3008 is OK: PROCS OK: 1 process with command name varnishkafka [14:49:37] RECOVERY - Varnishkafka log producer on cp1061 is OK: PROCS OK: 1 process with command name varnishkafka [14:57:32] (03PS1) 10Giuseppe Lavagetto: nagios_common: rename resource [puppet] - 10https://gerrit.wikimedia.org/r/288208 [14:58:02] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2285393 (10Ottomata) All analytics brokers are now upgraded to confluent 0.9! Tomorrow we will switch off 0.8 inter broker... [14:59:07] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [14:59:39] (03CR) 10Giuseppe Lavagetto: [C: 032] nagios_common: rename resource [puppet] - 10https://gerrit.wikimedia.org/r/288208 (owner: 10Giuseppe Lavagetto) [14:59:58] (03PS5) 10Elukey: Add partman receipe for new AQS hosts with SSDs. [puppet] - 10https://gerrit.wikimedia.org/r/288184 (https://phabricator.wikimedia.org/T133785) [15:00:05] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160511T1500). [15:00:05] Glaisher jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:33] I can SWAT today. [15:01:02] Glaisher: jan_drewniak ping me if/when you're around. [15:02:11] thcipriani: o/ [15:02:51] thcipriani: here [15:03:11] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288192 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:03:26] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:04:47] mobrovac: I'm around for scap3. [15:04:51] (03Merged) 10jenkins-bot: T128546 updating portal stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288192 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:07:11] (03CR) 10Elukey: [C: 032] "Trying it out with an install." [puppet] - 10https://gerrit.wikimedia.org/r/288184 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [15:10:21] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: T128546 updating portal stats [[gerrit:288192]] (duration: 00m 43s) [15:10:22] T128546: [Recurring Task] Update Wikipedia.org Portal and sister Wiki's statistics - https://phabricator.wikimedia.org/T128546 [15:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:05] !log thcipriani@tin Synchronized portals: SWAT: T128546 updating portal stats [[gerrit:288192]] (duration: 00m 33s) [15:11:09] T128546: [Recurring Task] Update Wikipedia.org Portal and sister Wiki's statistics - https://phabricator.wikimedia.org/T128546 [15:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:32] ^ jan_drewniak check please [15:12:32] thcipriani: looks good :) [15:13:31] jan_drewniak: awesome! I ran the sync-portals script manually, basically, I have an update for you folks on that. We've moved to using: scap sync-dir, rather than just sync-dir. All scap commands are now subcommands. [15:13:51] I can make a patch for that repo if it's helpful. [15:14:37] (03PS3) 10Volans: MariaDB: Change mariadb::core parameters [puppet] - 10https://gerrit.wikimedia.org/r/288205 (https://phabricator.wikimedia.org/T111654) [15:16:00] thcipriani: thanks for that! yeah, my knowledge of what that script does is minimal :P If it's an easy fix then a patch would be appreciated :) https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/portals [15:16:31] jan_drewniak: yup, it's a trivial change. I'll make a patch here shortly :) [15:20:20] mobrovac: ping :) [15:20:34] kart_: pong [15:20:42] !log thcipriani@tin Synchronized php-1.28.0-wmf.1/extensions/Translate/tag/PageTranslationHooks.php: SWAT: On documentation unit deletion, dont attempt to re-render translation page [[gerrit:288165]] (duration: 00m 27s) [15:20:48] kart_: we are waiting on _joe_ and bearND to proceed [15:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:50] ^ Glaisher check please [15:20:57] looking [15:21:00] mobrovac: oh. OK! [15:21:39] thcipriani: tested, works [15:21:41] thanks [15:21:54] Glaisher: thanks for checking [15:22:22] !log first SWAT with scap subcommands (3.2.0-1) complete [15:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:32] thcipriani: \o/ [15:22:43] :D [15:25:25] mobrovac: I'm here [15:26:43] thcipriani: bd [15:27:39] greg-g: bd? (my acronym-fu is weak) [15:27:54] two thumbs up [15:28:03] ohhhhh [15:28:06] :) [15:28:08] thcipriani: nice! bd :D [15:28:43] thanks :) [15:28:45] it's confusing in our world due to the 808 variant [15:29:12] Imagine bd808 deploys. 808 thumbs up. [15:29:39] that would be 1616 thumbs up actually [15:29:44] :P [15:29:44] heh [15:30:04] * bd808 approves of this meme [15:30:41] _joe_: ping when you are available to proceed with the switch to scap [15:33:02] <_joe_> mobrovac: I'm here [15:33:25] <_joe_> so what's the procedure? It's a first for me :P [15:33:50] <_joe_> also, do we want to start with cxserver or mobileapps? [15:33:56] (03PS1) 10BBlack: tlsproxy: use http/1.1 only in conjunction with varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/288213 (https://phabricator.wikimedia.org/T107749) [15:34:45] _joe_: i think the best way would be: disable puppet on scb*, merge the two changes, run puppet on tin, issue an empty deploy from there to get the scap stuff set up on tin, and then enable puppet back and do a real deploy [15:34:58] <_joe_> ok [15:35:01] <_joe_> on it [15:35:01] _joe_: we can do them together as far as ops/puppet is concerned [15:35:07] <_joe_> cool [15:35:55] <_joe_> mobrovac: isn't cxserver still on sca though? [15:35:59] kart_: bearND: please stand by [15:36:04] _joe_: no, just apertium is [15:36:16] <_joe_> ok [15:36:22] what will I have to do? [15:36:39] the deploy from tin :) [15:36:43] ok [15:36:46] but don't log in there just yet [15:36:59] I'm around. [15:37:04] (03PS2) 10Giuseppe Lavagetto: Deploy mobileapps using scap3, 2nd try [puppet] - 10https://gerrit.wikimedia.org/r/287112 (https://phabricator.wikimedia.org/T129147) (owner: 10BearND) [15:37:14] mdholloway: ^ fyi [15:37:59] kart_: same for you, if you are logged into tin, log out [15:38:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/287112 (https://phabricator.wikimedia.org/T129147) (owner: 10BearND) [15:38:37] thcipriani: i already have a feature req for the shiny new scap - bash command completion :P [15:38:51] * mdholloway is lurking [15:38:54] <_joe_> argh cxserver change needs a manual rebase [15:39:07] yup, _joe_, changing the same line in data.yaml [15:39:12] <_joe_> wait on it [15:39:56] <_joe_> gerrit is very slow today [15:40:03] mobrovac: indeed, my normal rabid mashing of tab proved unfruitful during morning SWAT :) [15:40:11] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2285479 (10GWicke) [15:40:35] 06Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989#2285481 (10bd808) >>! In T126989#2285067, @faidon wrote: > Another idea (Ori's) is to use Kafka for log shipping. Kafka 0.9 (deployed at Wikimedia as we speak) has encrypti... [15:40:38] 06Operations, 10ops-esams, 10hardware-requests: replace bast3001 with newer hardware - https://phabricator.wikimedia.org/T131562#2285483 (10RobH) a:03mark We don't have any tracking for spare servers in esams, I'm not sure if there are any. However, there is a good chance we aren't tracking them due to th... [15:40:44] (03CR) 10Ema: [C: 031] "Let's try this!" [puppet] - 10https://gerrit.wikimedia.org/r/288213 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [15:41:47] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6139766 keys - replication_delay is 0 [15:41:55] (03PS6) 10Giuseppe Lavagetto: cxserver: scap3 migration [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) (owner: 10KartikMistry) [15:42:02] <_joe_> bblack: don't merge now please, wait 1 sec [15:42:34] (03PS1) 10Gehel: Create necessary folders for Postgresql and Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) [15:42:37] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] cxserver: scap3 migration [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) (owner: 10KartikMistry) [15:42:42] _joe_: ok [15:42:59] <_joe_> mobrovac, bearND kart_ now merging and running puppet on tin [15:43:04] <_joe_> I'll tell you when I'm done [15:43:27] _joe_: puppet already disabled on scb* ? [15:43:43] <_joe_> mobrovac: yes [15:43:45] kk [15:43:47] <_joe_> bblack: you're GTG [15:44:48] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2285490 (10GWicke) [15:45:08] _joe_: ok [15:45:10] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2285492 (10cscott) Maybe we should be doing this on a labs machine first so I can log in and take a look at things. I don't have root on the production machines, so poking around is hard. It *seem... [15:45:17] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2285494 (10Nuria) [15:45:46] _joe_: thanks! [15:45:48] <_joe_> puppet is painfully slow on tin, sigh [15:45:50] (03PS2) 10BBlack: tlsproxy: use http/1.1 only in conjunction with varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/288213 (https://phabricator.wikimedia.org/T107749) [15:46:09] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: use http/1.1 only in conjunction with varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/288213 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [15:46:32] even puppet-merge has been slow today I think I've noticed [15:46:34] something's odd [15:46:37] <_joe_> puppet did run on tin [15:46:38] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Spark yarn in client mode is never moved from ACCEPTED to RUNNING - https://phabricator.wikimedia.org/T134422#2265192 (10Nuria) @ottomatta: can this be closed? there is an open question from @MoritzMuehlenhoff [15:46:41] <_joe_> bblack: it's gerrit [15:46:48] <_joe_> bblack: varnish 4 conversion? :P [15:47:05] gerrit doesn't go through varnish :) [15:47:11] <_joe_> mobrovac, you can go on with the empty deploy on tin, whatever that means [15:47:17] it's java though, maybe it needs a kick in the head [15:47:19] <_joe_> bblack: just spreading FUD [15:47:49] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Spark yarn in client mode is never moved from ACCEPTED to RUNNING - https://phabricator.wikimedia.org/T134422#2285527 (10Nuria) 05Open>03Resolved [15:47:56] bearND: please log in to tin, go to the mcs deploy directory and issue "scap deploy" [15:47:59] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Spark yarn in client mode is never moved from ACCEPTED to RUNNING - https://phabricator.wikimedia.org/T134422#2265192 (10Nuria) 05Resolved>03Open [15:48:07] mobrovac: k [15:48:14] bearND: the deploy will fail, don't worry about it [15:48:20] <_joe_> kart_: you should do the same with cxserver I guess [15:48:22] <_joe_> mobrovac: lol [15:48:24] <_joe_> :P [15:48:31] _joe_: OK [15:48:35] yes, kart_, do the same [15:48:45] i'm watching you guys muahaha [15:48:53] mobrovac: yes, got deploy failed: [Errno 2] No such file or directory: '/srv/deployment/mobileapps/.git/config-files' [15:49:10] bearND: no, need to go into the deploy repo [15:49:29] bearND: /srv/deployment/mobileapps/deploy [15:49:33] mobrovac: oops, right [15:49:38] mobrovac: 15:49:21 1 targets had deploy errors [15:49:45] 06Operations, 10ops-esams, 06DC-Ops, 10Traffic, 10hardware-requests: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#2285533 (10RobH) [15:49:56] kart_: for you that's /srv/deployment/cxserver/deploy [15:50:07] kk bearND [15:50:20] yep. issuing. [15:50:48] <_joe_> ok when you're done I'll reenable puppet in codfw [15:50:58] ok, we're good [15:51:03] _joe_: go ahead [15:51:04] mobrovac: Stage 'fetch' failed on group 'canary'. Perform rollback? [y]: [15:51:13] kart_: type N [15:51:19] OK [15:51:27] k [15:51:35] 15:51:22 Finished Deploy: cxserver/deploy (duration: 00m 45s) [15:51:38] mobrovac: ^ [15:51:44] yup, i can see it too [15:51:49] cool. [15:51:57] big brother :) [15:51:58] <_joe_> forcing a puppet run on scb2001 [15:52:15] 06Operations, 10ops-esams, 10hardware-requests: replace bast3001 with newer hardware - https://phabricator.wikimedia.org/T131562#2285537 (10RobH) 05Open>03declined I discussed this in IRC with @faidon (Who is @mark's delegate this week.) Since bast3001 isn't in error, we won't be replacing it. If it do... [15:52:57] <_joe_> mobrovac: puppet ran like a charm [15:53:02] yay [15:53:04] <_joe_> should I run it everywhere? [15:53:11] _joe_: lemme check something there [15:53:12] just a sec [15:53:14] <_joe_> so we can test a real deploy? [15:53:15] <_joe_> ok [15:54:07] kk, dirs chowned, _joe_ you can force it everywhere [15:55:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] "inline comments. For posterity's sake, the actual action was to move /var/lib/postgresql to /srv/postgresql." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [15:55:47] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [15:56:10] <_joe_> mobrovac: puppet ran everywhere [15:56:16] <_joe_> we can test a deploy I guess [15:56:19] yup [15:56:33] kart_: bearND: you can go ahead and re-issue the same command on tin [15:57:18] jynus: I added you to the multi-DC meeting since you were missing from the invite. Just ignore it if you are busy today though, since it's last-minute. [15:57:34] when is that? [15:57:34] mobrovac: OK! [15:57:41] (03CR) 10Gehel: "Yes, that seems much cleaner. I'm actually wondering if we should let that directory be configureable or if we should just hardcode it. Do" [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [15:58:06] k, doing the canary right now (scb2001) [15:58:25] jynus: same time as the last one (which is like a few min from now) ;) [15:58:26] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 638 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6140515 keys - replication_delay is 638 [15:58:30] mobrovac: was successful. should I continue? [15:58:36] bearND: yes [15:58:44] kart_ already did i see :) [15:58:49] mobrovac: finished. Yep. [15:58:51] AaronSchulz, thanks, I was really interested on attending [16:00:04] hoo frimelle: Dear anthropoid, the time has come. Please deploy ArticlePlaceholder (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160511T1600). [16:00:10] what was the command to look at the logs again? [16:00:58] _joe_: kart_: bearND: thcipriani: \o/ we can declare victory! [16:01:05] \o/ [16:01:05] mobrovac: cool. [16:01:07] \o/ [16:01:09] \0/ [16:01:13] <_joe_> ok are we done? [16:01:16] yup [16:01:17] Thanks mobrovac _joe_ [16:01:35] <_joe_> cool [16:01:35] bearND: https://wikitech.wikimedia.org/wiki/Services/Deployment#Deployment_Debugging [16:01:56] bearND: you need to add "scap" in front of it though [16:02:05] mobrovac: thank you for the expert rollout :) [16:02:05] * mobrovac will update the docs as soon as he gets 5 mins free [16:02:16] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:02:17] mobileapps deploy finished [16:02:48] kart_: bearND: so, to be clear, from now on, you use "scap deploy" for deploying, no more "git deploy" [16:03:00] mobrovac: scap deploy-log gives a deprecation warning and fails with permission denied [16:03:19] (03CR) 10Alexandros Kosiaris: "yeah, the default /var/lib/postgresql/9.4/main. The datadir parameter came into being by the need to install postgresql into a separate pa" [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [16:03:23] mobrovac: on srv/deployment/mobileapps/deploy-cache/revs/b8c396aea43d81cf12b6eae6a7266bec189ae5db/scap/log [16:03:42] bearND: euh, are you running that on tin from /srv/deployment/mobileapps/deploy ? [16:03:58] mobrovac: scb2001 [16:04:07] our canary [16:04:09] no bearND, deploy-log works on tin only [16:04:24] (03CR) 10Gehel: "Damn! Standards are hard... there are so many to choose from..." [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [16:04:25] ah, good to know [16:04:41] bearND: you can use some expressions that will narrow the output of scap deploy-log to a single machine: https://doc.wikimedia.org/mw-tools-scap/scap3/deploy_commands.html#scap-deploy-log [16:05:01] (03PS2) 10Gehel: WIP - Create necessary folders for Postgresql and Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) [16:05:01] er...https://doc.wikimedia.org/mw-tools-scap/scap3/deploy_commands.html#examples better link [16:05:44] bearND: mdholloway thanks for spending a non-insignificant amount of time with me at your offsite working on the scap migration, I appreciate it :) [16:05:59] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2285553 (10mobrovac) >>! In T84723#2285492, @cscott wrote: > It *seems* that the "Error: Module did not register" is actually coming from node, and maybe it's complaining that the /etc/ocg/mw-ocg-se... [16:06:18] mobrovac: ack. Thanks a lot. [16:06:36] thcipriani: likewise! [16:07:27] !log wiping cache_misc caches again... [16:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:08:06] thcipriani: mobrovac: _joe_: thank you! [16:14:15] !log upgrade etherpad software to 1.6.0-1 and evaluate stability. packages NOT yet uploaded on apt.w.o [16:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:59] 06Operations, 06Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2285579 (10RobH) 05Open>03Resolved a:03RobH Systems arrived and have been racked. Resolving this hardware request. [16:19:01] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, 13Patch-For-Review: Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#2285583 (10RobH) [16:36:33] (03CR) 10Dzahn: "normal system::role's would be in a class that is also a role::something. .." [puppet] - 10https://gerrit.wikimedia.org/r/288150 (owner: 10Hoo man) [16:37:50] 06Operations, 06Performance-Team, 06Services, 07Availability: Consider cassandra for session storage (and SSL) - https://phabricator.wikimedia.org/T134811#2285697 (10Joe) [16:38:18] (03PS2) 10Dzahn: Add a role description to snapshot::cron::wikidatadumps::json [puppet] - 10https://gerrit.wikimedia.org/r/288150 (owner: 10Hoo man) [16:39:37] (03CR) 10Dzahn: [C: 032] "ok, anyways, the whole role/snapshot should be split and moved somehow like here Change-Id: I657d4de5731ea7d95" [puppet] - 10https://gerrit.wikimedia.org/r/288150 (owner: 10Hoo man) [16:40:06] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2285702 (10JanZerebecki) > Puppet runs on a per-host basis, That does not contradict the concept of roles/services. There are more things that are only available for the whole host, but a... [16:41:36] (03PS1) 10Catrope: Enable Flow beta feature on specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288224 (https://phabricator.wikimedia.org/T134898) [16:43:46] !log Reduced number of executors on Trusty instances from 3 to 2. Memory get exhausted causing the tmpfs to drop files and thus MW jobs to fail randomly. [16:43:56] hoo: ebernhardson : some jobs failed randomly. Just +2 again as needed :( [16:44:10] and morebot died [16:44:14] hashar: Ok, thanks for the heads up [16:45:38] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2285727 (10Dzahn) What Marko said, i googled around a bit and this sounds like the binary has to be recompiled for the newer node version. ack. [16:45:51] (03CR) 10Volans: [C: 032] Add the replication_role parameter [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/288195 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [16:46:01] CI doesn't scale well right now when bunch of mw/core and wikibases changes are tested [16:46:07] that consumes too much memory [16:46:13] reducing # of jobs running in parallel would help [16:46:32] until we get job to run on a dedicated instance which I am actively working on ;-D [16:47:19] Sorry [16:47:23] doing a lot of backports [16:47:24] not your fault [16:47:32] entirely the fault of CI really [16:47:47] sending a myriad of patches is a perfectly legitimate use :-] [16:47:50] I am off! [16:50:34] (03CR) 1020after4: [C: 031] "We definitely need to separate scap from trebuchet stuff" [puppet] - 10https://gerrit.wikimedia.org/r/284851 (owner: 10Alex Monk) [16:52:59] twentyafterfour, reckon we can get that in puppet swat? maybe after a puppet compiler run on tin/mira? [16:56:11] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2285807 (10elukey) mc1009 has been running today with growth factor 1.15 and I can see (finally!) 0 evictions up to now. Memory allocation is back to its old b... [16:59:48] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 685 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6138487 keys - replication_delay is 685 [17:00:17] (03PS1) 10Hoo man: Enable the ArticlePlaceholder on htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288228 (https://phabricator.wikimedia.org/T134273) [17:00:19] (03PS1) 10Hoo man: Enable the ArticlePlaceholder on eowiki, orwiki and napwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288229 (https://phabricator.wikimedia.org/T134273) [17:00:46] (03PS1) 10Elukey: Add new memcached features/settings to mc1009 as part of perf experiment. [puppet] - 10https://gerrit.wikimedia.org/r/288230 (https://phabricator.wikimedia.org/T129963) [17:03:02] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2285851 (10Dzahn) What Jan says makes sense to me. [17:03:15] (03CR) 10Lucie Kaffee: [C: 031] Enable the ArticlePlaceholder on eowiki, orwiki and napwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288229 (https://phabricator.wikimedia.org/T134273) (owner: 10Hoo man) [17:03:33] (03CR) 10Lucie Kaffee: [C: 031] Enable the ArticlePlaceholder on htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288228 (https://phabricator.wikimedia.org/T134273) (owner: 10Hoo man) [17:04:26] PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: puppet fail [17:04:42] (03Abandoned) 10Dzahn: interface: move rps::modparams to own file [puppet] - 10https://gerrit.wikimedia.org/r/284083 (owner: 10Dzahn) [17:04:47] (03Abandoned) 10Dzahn: interface: move aggregate_member to own file [puppet] - 10https://gerrit.wikimedia.org/r/284084 (owner: 10Dzahn) [17:06:16] (03Abandoned) 10Dzahn: role/dns: move to module role, rename ::dnsrecursor [puppet] - 10https://gerrit.wikimedia.org/r/271735 (owner: 10Dzahn) [17:06:41] (03Abandoned) 10Dzahn: quarry: use one file per class, autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/260187 (owner: 10Dzahn) [17:07:04] (03Abandoned) 10Dzahn: ldap: rename role classes [puppet] - 10https://gerrit.wikimedia.org/r/279682 (owner: 10Dzahn) [17:10:58] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: puppet fail [17:13:06] 06Operations, 10ops-eqiad: Install 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2285911 (10Cmjohnson) [17:13:10] (03PS1) 10BBlack: cache_misc: remove all CL-sensitive stream/pass logic [puppet] - 10https://gerrit.wikimedia.org/r/288231 [17:13:36] 06Operations, 10ops-eqiad: Install 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2285936 (10Cmjohnson) [17:13:38] 06Operations, 06Discovery, 10Maps: Switch Maps to production status - https://phabricator.wikimedia.org/T133744#2285935 (10Cmjohnson) [17:16:05] 06Operations, 10ops-eqiad: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2285939 (10Cmjohnson) a:03Cmjohnson [17:17:06] (03CR) 10Ema: [C: 031] "+1" [puppet] - 10https://gerrit.wikimedia.org/r/288231 (owner: 10BBlack) [17:26:57] (03CR) 10BBlack: [C: 032] cache_misc: remove all CL-sensitive stream/pass logic [puppet] - 10https://gerrit.wikimedia.org/r/288231 (owner: 10BBlack) [17:32:22] Krenair: seems like a good idea [17:33:41] RECOVERY - puppet last run on elastic2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:34:24] !log wiping cache_misc caches again... [17:36:26] 06Operations, 07Graphite, 13Patch-For-Review: put additional graphite machines in service - https://phabricator.wikimedia.org/T134889#2286020 (10fgiunchedi) this will be happening tomorrow at 10.00 UTC as per https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160512T1000 as follows: * [ahead o... [17:39:01] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:50:24] otrs down for me [17:50:30] (from Italy, via Amsterdam) [17:50:37] oh up again, nvm [17:51:15] !log hoo@tin Synchronized php-1.27.0-wmf.23/extensions/Wikidata: Update Wikibase (Property ordering related backports) (duration: 02m 11s) [17:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:31] !log hoo@tin Synchronized php-1.28.0-wmf.1/extensions/Wikidata: Update Wikibase (Property ordering related backports) (duration: 01m 52s) [17:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:56:01] !log hoo@tin Started scap: Update ArticlePlaceholder to master for initial Wikipedia deployment. [17:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:58:04] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1284-1307 - https://phabricator.wikimedia.org/T134309#2286140 (10Cmjohnson) [17:59:34] !log uploaded new linux package for jessie-wikimedia to carbon (based on 4.4.9) [17:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:03:31] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6137426 keys - replication_delay is 0 [18:04:50] (03PS1) 10BBlack: Partial Revert: cache_misc: remove all CL-sensitive stream/pass logic [puppet] - 10https://gerrit.wikimedia.org/r/288242 [18:06:50] (03CR) 10BBlack: [C: 032] Partial Revert: cache_misc: remove all CL-sensitive stream/pass logic [puppet] - 10https://gerrit.wikimedia.org/r/288242 (owner: 10BBlack) [18:11:45] 06Operations, 10DBA, 10MediaWiki-Database, 07Performance: Implement GTID replication on MariaDB 10 servers - https://phabricator.wikimedia.org/T133385#2286190 (10jcrespo) [18:14:27] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2286207 (10Cmjohnson) FIgured out the problem. I updated elastic1047 and it installed w/out a problem. I need to fix the other 15. unning a System ROM Version 1.32 (or later) dated March 5, 2015 (03-05... [18:16:42] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 4 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2286211 (10Dzahn) After this ticket is done we should really make one ultimate master list that lists each and every step needed for creating a... [18:18:35] bblack: regex good enough? https://gerrit.wikimedia.org/r/#/c/287032/6/modules/letsencrypt/files/acme-setup [18:18:58] moritzm: aaaand 4.4.10 was just released :) [18:19:12] (03PS1) 10BBlack: X-Cache: support moving through hit before miss [puppet] - 10https://gerrit.wikimedia.org/r/288244 [18:21:46] mutante: sorry, been busy. [18:22:09] mutante: well from old gerrit comments: '^[A-Za-z0-9][-A-Za-z0-9_]*$' [18:22:22] it doesn't explain as well in a help message though [18:22:34] but it is ugly to let them start a filename with - or _ [18:22:39] maybe not ugly enough to matter [18:23:49] bblack: ok, thanks dont worry about it now. there is more important stuff [18:24:02] mutante: only other thing is, the way the regex is written now is invalid, because it thinks the '-' is part of a character range [18:24:21] if you just rewrite it as: '^[-a-z0-9_]+$' it will work [18:24:41] and I guess put back in uppercase too [18:24:43] !right, ok, thanks, will fix it [18:24:59] so '^[-a-zA-Z0-9_]+$' [18:25:27] ok, yea, uppercase was the part i wasnt sure if we wanted it or not [18:27:12] should probably be legal, I think, but honestly I have no idea [18:27:37] I'd say as long as we're just validating sanity, uppercase is fine [18:27:49] it we were going to have the script normalize the id, I'd say force it to downcase. [18:28:22] but I donno, normalization of the id seems overkill. if they supply a bad one just fail, and be slight more lax about what's "bad" and let puppetization handle normalizing to something that meets the rules [18:28:35] bikeshed-hell [18:28:37] (03PS7) 10Dzahn: acme-setup: only accept '^[-a-zA-Z0-9_]+$' as unique cert ID [puppet] - 10https://gerrit.wikimedia.org/r/287032 (https://phabricator.wikimedia.org/T134447) [18:29:03] heh, yea. but that made sense [18:36:09] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2286251 (10Dzahn) I recently saw that we have now "blogs" as apps in phabricator (https://phabricator.wikimedia.org/phame/ , T104633 ,... I don't know much about i... [18:47:20] !log hoo@tin Finished scap: Update ArticlePlaceholder to master for initial Wikipedia deployment. (duration: 51m 18s) [18:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:52:58] (03CR) 10Hoo man: [C: 032] Enable the ArticlePlaceholder on htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288228 (https://phabricator.wikimedia.org/T134273) (owner: 10Hoo man) [18:53:36] (03Merged) 10jenkins-bot: Enable the ArticlePlaceholder on htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288228 (https://phabricator.wikimedia.org/T134273) (owner: 10Hoo man) [18:55:30] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on htwiki - T134273 (duration: 00m 46s) [18:55:31] T134273: Deploy ArticlePlaceholder to eowiki, htwiki, orwiki and napwiki - https://phabricator.wikimedia.org/T134273 [18:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:56:51] https://ht.wikipedia.org/wiki/Espesyal:AboutTopic?entityid=Q2013 YAY! [18:58:00] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (https://phabricator.wikimedia.org/T95501) (owner: 10Aaron Schulz) [18:58:21] !log disregard ulsfo cp system icinga spam, onsite work for thermal paste per T134831 [18:58:22] T134831: ulsfo planned maintenance on 2016-05-11 - https://phabricator.wikimedia.org/T134831 [18:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:04] hashar: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160511T1900). Please do the needful. [19:04:16] (03CR) 10Hoo man: [C: 032] Enable the ArticlePlaceholder on eowiki, orwiki and napwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288229 (https://phabricator.wikimedia.org/T134273) (owner: 10Hoo man) [19:04:56] (03Merged) 10jenkins-bot: Enable the ArticlePlaceholder on eowiki, orwiki and napwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288229 (https://phabricator.wikimedia.org/T134273) (owner: 10Hoo man) [19:08:36] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on eowiki, orwiki and napwiki (duration: 00m 29s) [19:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:17:11] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:18] ^assuming that's robh takin' care of bidness [19:19:10] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286416 (10Dzahn) [19:21:49] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286440 (10Dzahn) - created new project called "ocg" - added dzahn, cscott, giuseppe as admins - created instance called "ocg-jessie-01" - created "puppet group" called "ocg", added role class "role::ocg"... [19:23:40] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:23:41] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:23:42] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:23:46] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286444 (10Dzahn) currently we are getting: ``` currently we are getting: Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Error: Could not retrieve catalog from remote server:... [19:23:51] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:24:01] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:24:01] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:24:11] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:24:21] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:24:21] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:24:22] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:24:31] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:24:31] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:24:31] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:24:32] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [19:25:13] !log stopped icinga-wm, flood due to ulsfo maintenance [19:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:25:27] (but it's gonna come back by itself) [19:27:58] bleh [19:28:06] so i put the cp in ulsfo in maint [19:28:12] but didnt forsee that error, cannot help it [19:28:55] yea, it's just because of the IPSEC monitoring, so it's on non-4xxx hosts [19:29:04] if icinga is off someone (not me im onsite doing the work) should keep an eye for other alerts [19:29:06] since we wont see it! [19:29:07] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2286452 (10chasemp) >>! In T133992#2282661, @hashar wrote: > We can revisit the list of contint-admins. I am not sure whether b... [19:29:38] 06Operations, 10Pybal, 10Traffic: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893#2286453 (10chasemp) p:05Triage>03High [19:29:52] i will do it with acknowledgements for those IPSEC checks instead.. then bring the bot back [19:30:08] 06Operations: check status of multiple systemd units - https://phabricator.wikimedia.org/T134890#2286454 (10chasemp) p:05Triage>03Normal [19:30:19] 06Operations, 10Traffic, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2286456 (10chasemp) p:05Triage>03Normal [19:30:30] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10Jonas) I can confirm I am still having read errors every few requests. ``` curl -v 'https://query.wikidata.org/ven... [19:30:31] so i have rpelaced a 3rd of them [19:30:35] now powered off a third [19:30:38] 06Operations, 10Traffic, 13Patch-For-Review: Support websockets in cache_misc - https://phabricator.wikimedia.org/T134870#2286459 (10chasemp) p:05Triage>03Normal [19:30:40] and will do the last third when i finish this [19:30:43] so more will happen [19:33:09] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2286465 (10chasemp) [19:33:34] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 4 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2286468 (10Krenair) We already have wikitech:Add_a_wiki, but people like to add additional steps without documenting them there. [19:34:02] grr [19:34:25] 06Operations, 10Traffic, 07HTTPS: Secure connection failed when attempting to preview or save pages - https://phabricator.wikimedia.org/T134869#2286483 (10chasemp) p:05Triage>03High [19:34:50] so the actual fix would be to take these out of those configs [19:34:58] but i guess we didnt do that last time either [19:34:59] 06Operations, 10Mail, 10MediaWiki-Email: Wiki-Mail sent but never delivered - https://phabricator.wikimedia.org/T134674#2286487 (10chasemp) p:05Triage>03Normal [19:35:00] bblack: do you recall? [19:35:16] the 'grr' was at the duplicate icinga-wm , not you :) [19:35:30] killed one [19:36:09] some of these dells have half the thermal paste they should have [19:36:55] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [19:37:13] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [19:37:13] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [19:37:13] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [19:37:23] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [19:37:24] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [19:37:24] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [19:37:33] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [19:37:34] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [19:37:34] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [19:37:34] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [19:37:55] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [19:38:03] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [19:38:03] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [19:38:13] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [19:38:14] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [19:38:15] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [19:38:15] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [19:38:41] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [19:38:41] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [19:39:12] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [19:39:12] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [19:39:30] !log turned off notifications for icinga services that match IPSEC [19:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:03] robh: ^ no more messages if the service name has "IPsec" but the other are normal. we just need to turn back on when you are done [19:41:12] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6135894 keys - replication_delay is 0 [19:41:12] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6135894 keys - replication_delay is 0 [19:43:01] mutante: just did a select chcieckbox disable for ipsec in the web interface? [19:43:54] robh: i typed "ipsec" into the search box and on the result page of that i hit the checkbox at the top of the table, yes [19:44:08] then "disable notifications" [19:44:19] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2286518 (10Krenair) Am guessing this needs ops to run maintain-replicas.pl [19:54:59] 06Operations, 06Release-Engineering-Team, 10Wikimedia-General-or-Unknown: Inconsistently unable to download https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz (returns zero-byte response) - https://phabricator.wikimedia.org/T135038#2286540 (10matmarex) [19:56:32] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 4 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2286556 (10Multichill) >>! In T134017#2286036, @StevenJ81 wrote: > It looks like things are coming along. Somebody also needs to deal with langu... [19:59:32] jouncebot: next [19:59:32] In 0 hour(s) and 0 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160511T2000) [19:59:49] (03PS2) 10Volans: MariaDB: tune thread-pool to avoid Aborted_connects [puppet] - 10https://gerrit.wikimedia.org/r/287394 (https://phabricator.wikimedia.org/T133333) [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160511T2000). [20:00:45] no further mobileapps deploy today [20:01:01] no deploys today. [20:01:05] 06Operations, 06Release-Engineering-Team, 10Wikimedia-General-or-Unknown: Inconsistently unable to download https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz (returns zero-byte response) - https://phabricator.wikimedia.org/T135038#2286576 (10hashar) [20:01:36] (03PS1) 10Jdrewniak: T134512 updating survey banner on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288258 (https://phabricator.wikimedia.org/T134512) [20:03:01] 06Operations, 06Release-Engineering-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Inconsistently unable to download https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz (returns zero-byte response) - https://phabricator.wikimedia.org/T135038#2286540 (10hashar) #traffic people would sure... [20:04:20] ok, last 6 [20:04:25] almost done =P [20:09:36] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286590 (10Dzahn) 12:58 I don't think hieradata/common/ocg.yaml is taking effect 12:59 hieradata/common/ocg.yaml:temp_dir: "/mnt/tmpfs" 12:59 hieradata/labs/deployment-prep/c... [20:09:56] robh: sorry about the ipsec spam :) [20:10:51] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286591 (10Dzahn) 13:15 my terminal is now filled with puppet-green 13:15 oh !:) [20:11:40] PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: Connection refused [20:11:40] PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: Connection refused [20:12:00] PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection refused [20:12:00] PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection refused [20:12:08] PROBLEM - LVS HTTP IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection refused [20:12:08] PROBLEM - LVS HTTP IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection refused [20:12:15] blehhhh [20:12:17] thats my fault [20:12:19] robh: is this oy? [20:12:22] i took too many down in a row [20:12:23] ok :) [20:12:24] without recovery [20:12:26] PROBLEM - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: Connection refused [20:12:26] PROBLEM - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: Connection refused [20:12:29] ulsfo is depooled [20:12:38] bblack: thats whats happening right? [20:12:45] robh: we ar egetting paged on ulsfo LVS outages? this is all expected? [20:12:50] ah [20:12:58] I should ahve scrolled down first [20:13:00] i emailed about this earlier in the week as well [20:13:06] but the pages were unintended [20:13:20] ulsfo isnt serving traffic. [20:13:20] yup just making sure [20:13:25] 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2286611 (10Danny_B) I know... The deal is, that whatever task intended for SPDY will most probably be valid even for HTTP/2. [20:13:33] ok [20:13:34] ok [20:13:38] yeah last time i didnt generate those but i only took down a 3rd of them [20:13:50] i didnt think it was going to trigger it this time either, was my bad [20:14:15] robh: probably you took down all the uploads at once :) [20:14:33] i just finished hte other cab and moved right to this one [20:14:38] so yep, the others likely havent recovered [20:14:42] or more than the threshold? :) [20:16:22] 5 more to go, it should recover shortly. [20:17:28] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 74.86 ms [20:17:28] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 74.86 ms [20:19:55] (03PS1) 10BBlack: Revert "Partial Revert: cache_misc: remove all CL-sensitive stream/pass logic" [puppet] - 10https://gerrit.wikimedia.org/r/288265 [20:20:06] (03CR) 10BBlack: [C: 032 V: 032] Revert "Partial Revert: cache_misc: remove all CL-sensitive stream/pass logic" [puppet] - 10https://gerrit.wikimedia.org/r/288265 (owner: 10BBlack) [20:23:51] !log wiping cache misc caches, again.... [20:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:52] 06Operations, 06Release-Engineering-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Inconsistently unable to download https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz (returns zero-byte response) - https://phabricator.wikimedia.org/T135038#2286678 (10hashar) [20:24:53] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2286680 (10hashar) [20:25:30] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2285020 (10hashar) From T135038: `curl -I https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz` Example res... [20:27:50] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2286694 (10Dzahn) So then same hostname but different port. [20:30:32] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286723 (10Dzahn) a:05Dzahn>03None Error: Could not start Service[apparmor]: Execution of '/etc/init.d/apparmor start' returned 1: Starting apparmor (via systemctl): apparmor.serviceJob for apparmor.s... [20:32:05] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286728 (10Krenair) >>! In T135034#2286723, @Dzahn wrote: > redis.exceptions.ConnectionError: Error 101 connecting to tin.eqiad.wmnet:6379. Network is unreachable. > > <.. wants to talk to production tin... [20:32:42] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286729 (10Krenair) Or you could set up a full deployment server using the existing maniphest but I don't have time to figure that one out [20:34:09] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286743 (10Dzahn) I also have no experience with trebuchet and setting up a full deployment server in labs. This is really making it harder to "just" test in labs. [20:36:24] hrmm [20:36:32] bblack: so all but cp4009 are powered up or powering up [20:36:38] so i took down 50% of the upload [20:36:58] 1001-1002 [20:37:03] sorry 4001-4002 [20:37:13] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286750 (10cscott) Well... we're going to need a labs instance eventually. I don't understand why deployment-pdf01 etc in labs work fine but the new jessie instances don't? [20:37:15] 4003 and 4004 were swapped on thermal paste last time [20:37:28] so i guess one wasnt fully online to push ulsfo lvs for upload as unavail [20:38:39] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286754 (10Krenair) deployment-pdf01 has existing deployment servers in it's project (deployment-prep) it can use [20:40:27] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286765 (10Dzahn) Does deployment-pdf01 (i did not know this existed, know who set it up?) use the regular puppetmaster and "role ocg" just like production? [20:41:04] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286767 (10Krenair) It uses deployment-puppetmaster, it's in the deployment-prep project... [20:42:12] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286768 (10Krenair) mwalker set it up in May 2014: https://horizon.wikimedia.org/project/instances/d46df8b9-6c41-409d-9853-b2b4dc876088/ [20:42:49] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286770 (10Dzahn) I wonder if that means it actually uses the "role::ocg" puppet class just like ocg1001-3 have it in prod site.pp or something different. [20:43:07] i'm pretty sure it is configured identically [20:44:18] well, 4001-1004 are [20:44:24] bleh, missend [20:44:57] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286771 (10Krenair) ```krenair@bastion-01:~$ ldapsearch -x dc:dn:=deployment-pdf01.deployment-prep.eqiad.wmflabs # extended LDIF # # LDAPv3 # base (default) with scope subtree # filt... [20:45:02] mutante: you can re-enable the monitoring you disabled [20:45:29] some are still critical, but will start coming back. [20:46:44] cscott: alright, so if that's the same i guess the task becomes "upgrade deployment-pdf" to jessie or so [20:46:47] robh: done [20:47:32] ahh, the systems i thought were upload caches were not [20:47:36] !log re-enabled icinga notifications for ipsec [20:47:39] and i did totally take out too many of them, bleh [20:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:47:48] is ulsfo depooled? [20:47:51] yep [20:47:53] ok [20:49:55] (03CR) 10Volans: "@JCrespo: compiler looks good to me:" [puppet] - 10https://gerrit.wikimedia.org/r/287394 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [20:50:39] bblack: so to recover the uploads, they should do so upon boot and auto puppet run or does it require manual firing? [20:50:50] they are all powered on [20:51:55] i see them running varnish [20:52:18] well, at least 4020 [21:01:18] (03PS2) 10Ottomata: [WIP] Druid module [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [21:04:00] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Druid module [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [21:07:21] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286917 (10Dzahn) Then it seems that turns this ticket into "install deployment-pdf-02 with jessie" or "upgrade deployment-pdf" [21:07:49] 06Operations, 06Services: make ocg role work on labs instances - https://phabricator.wikimedia.org/T135034#2286919 (10Dzahn) or rename it to "ocg" while at it.. [21:09:11] 06Operations, 06Services: make ocg role work on labs instances (install deployment-pdf instance with jessie) - https://phabricator.wikimedia.org/T135034#2286931 (10Dzahn) [21:12:18] <_joe_> where is tcpircbot? [21:12:23] RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 988 bytes in 0.315 second response time [21:12:24] RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 988 bytes in 0.315 second response time [21:12:29] !log oblivian@palladium conftool action : set/pooled=yes; selector: dc=ulsfo,cluster=cache_upload [21:12:34] <_joe_> robh: ^^ tadah [21:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:23] RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 1003 bytes in 0.315 second response time [21:13:23] RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 1003 bytes in 0.315 second response time [21:13:31] RECOVERY - LVS HTTP IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 595 bytes in 0.153 second response time [21:13:31] RECOVERY - LVS HTTP IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 595 bytes in 0.153 second response time [21:13:41] RECOVERY - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 582 bytes in 0.150 second response time [21:13:41] RECOVERY - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 582 bytes in 0.150 second response time [21:15:20] !log robh@palladium conftool action : set/pooled=yes; selector: dc=ulsfo,cluster=cache_text [21:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:27] !log ulsfo maint complete, will push traffic back shortly. [21:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:56] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2287071 (10Dzahn) a:05Dzahn>03None [21:23:19] (03PS1) 10Jdrewniak: T134512 bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288315 (https://phabricator.wikimedia.org/T134512) [21:23:43] robh: back [21:23:54] so, are there still question-marks about ulsfo? [21:24:36] upload is probably not all you took out with depools, let me look for a bit [21:25:22] (03PS2) 10Jdrewniak: T134512 updating survey banner on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288315 (https://phabricator.wikimedia.org/T134512) [21:25:40] robh: nevermind! checked, everything is pooled [21:26:27] (03Abandoned) 10Jdrewniak: T134512 updating survey banner on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288258 (https://phabricator.wikimedia.org/T134512) (owner: 10Jdrewniak) [21:39:42] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2287218 (10BBlack) Status update: we've been debugging this off and on all day. It's some kind of bug fallout from cache_misc... [21:42:19] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2287225 (10mmodell) @dzahn yes I believe so, port `22280` by default. See [[ https://secure.phabricator.com/book/phabricator/article/notifications/#terminating-ssl-... [21:49:00] (03PS3) 10Gehel: WIP - Create necessary folders for Postgresql and Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) [21:49:23] (03PS1) 10Mattflaschen: Add Flow-specific External Store on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288323 (https://phabricator.wikimedia.org/T128417) [21:50:16] (03CR) 10jenkins-bot: [V: 04-1] WIP - Create necessary folders for Postgresql and Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [21:50:38] (03CR) 10Mattflaschen: [C: 04-1] "This is ready for review, but I would like to be around when it is deployed (i.e. merged) just in case." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288323 (https://phabricator.wikimedia.org/T128417) (owner: 10Mattflaschen) [21:51:36] (03PS4) 10Gehel: WIP - Create necessary folders for Postgresql and Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) [22:00:29] bblack: yep, i fixed it with _joe_'s help! [22:00:40] now that im not in my car, im going to push the dns change to allow traffic back [22:00:54] (03PS1) 10RobH: Revert "disabling ulsfo for onsite work on 2016-05-11" [dns] - 10https://gerrit.wikimedia.org/r/288325 [22:01:20] I didnt realize i had to force them back into the pool [22:01:32] (03CR) 10RobH: [C: 032] Revert "disabling ulsfo for onsite work on 2016-05-11" [dns] - 10https://gerrit.wikimedia.org/r/288325 (owner: 10RobH) [22:02:46] (03PS5) 10Gehel: WIP - Create necessary folders for Postgresql and Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) [22:09:01] robh: ok [22:09:29] robh: yeah it's like a dead-man switch. when cache hosts shut down, they auto-depool themselves before they stop their varnish/nginx daemons [22:09:41] there's a flag you can set to auto-repool post-reboot if you expect them to come back quick [22:28:53] (03CR) 10EBernhardson: [C: 031] Bump CirrusSearchRequestSet avro schema to rev 121456865906 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287973 (https://phabricator.wikimedia.org/T133726) (owner: 10DCausse) [22:29:41] (03CR) 10EBernhardson: [C: 031] "dependent patch merged, looks good to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281966 (https://phabricator.wikimedia.org/T131944) (owner: 10DCausse) [22:35:48] bblack: otherwise its best to let them come back online but not pool until their caches are hot? (just wondering why the manual step) [22:35:53] ulsfo seems to be repooling well [22:39:01] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 4 others: "Elastica: missing curl_init_pooled method" due to mwscript job running with PHP 5 on terbium - https://phabricator.wikimedia.org/T132751#2287530 (10Deskana) 05Open>03Resolved [22:44:09] robh: the auto-depool is meant for individual machines, not taking out a whole DC :) [22:44:19] they have to be repooled before we move traffic back in DNS [22:44:29] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 5 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2287567 (10Deskana) [22:47:30] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:47:30] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:47:38] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:47:38] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:49:19] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [22:49:19] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [22:49:28] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [22:49:28] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [23:00:04] RoanKattouw: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160511T2300). Please do the needful. [23:00:04] RoanKattouw jan_drewniak: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:11] I'll do the SWAT today [23:02:27] debt: Allegedly you know what https://phabricator.wikimedia.org/T134512 is about [23:02:45] A patch for that task is listed for the SWAT today and I need someone to tell me if it's working OK after I deploy it [23:03:33] (03CR) 10Catrope: [C: 032] Enable Flow beta feature on specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288224 (https://phabricator.wikimedia.org/T134898) (owner: 10Catrope) [23:03:41] (03PS2) 10Catrope: Enable Flow beta feature on specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288224 (https://phabricator.wikimedia.org/T134898) [23:03:46] (03CR) 10Catrope: [C: 032] Enable Flow beta feature on specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288224 (https://phabricator.wikimedia.org/T134898) (owner: 10Catrope) [23:03:50] RoanKattouw: I can verify if that's working :) [23:03:59] Oh you're here :) [23:04:05] yup - I'm here and can help [23:04:17] Since it's 1am where you are I figured you would likely be asleep [23:04:18] Cool [23:04:37] yeah.... jan_drewniak is a rockstar for getting up in the middle of the night! [23:04:37] (03Merged) 10jenkins-bot: Enable Flow beta feature on specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288224 (https://phabricator.wikimedia.org/T134898) (owner: 10Catrope) [23:04:43] RoanKattouw: as if I get any sleep with a baby anyway :P [23:04:59] Oh I did not realize you had a baby [23:05:02] Yes then I completely understand [23:05:40] (03PS3) 10Catrope: T134512 updating survey banner on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288315 (https://phabricator.wikimedia.org/T134512) (owner: 10Jdrewniak) [23:05:50] (03CR) 10Catrope: [C: 032] T134512 updating survey banner on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288315 (https://phabricator.wikimedia.org/T134512) (owner: 10Jdrewniak) [23:05:59] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable Flow beta feature on specieswiki (duration: 00m 34s) [23:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:39] (03Merged) 10jenkins-bot: T134512 updating survey banner on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288315 (https://phabricator.wikimedia.org/T134512) (owner: 10Jdrewniak) [23:07:14] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2286195 (10Dzahn) it's a "labsdb" thing afaict. one hit in entire wikitech " springle: ran operations/software maintain-replicas.pl and fedtables.pl on labsdbs" [23:08:00] !log catrope@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 31s) [23:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:11] jan_drewniak, debt: OK, should be live now [23:08:23] Or, wait, there's a part 2 that's still running [23:08:26] !log catrope@tin Synchronized portals: (no message) (duration: 00m 25s) [23:08:32] RoanKattouw: cool! [23:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:35] OK *now* it should be live [23:09:45] RoanKattouw: jan_drewniak it doesn't look like the stats were updated [23:10:44] debt: that issue hasn't been resolved yet. I just increased the occurrence of the banner in this deploy [23:10:55] gotcha [23:11:17] I had thought that was the second part that RoanKattouw was talking about :) [23:11:28] jan_drewniak: wishing...wishing... :) [23:13:01] in any case, RoanKattouw: It's working :) [23:13:22] !log catrope@tin Synchronized php-1.28.0-wmf.1/extensions/Echo/: SWAT (duration: 00m 33s) [23:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:59] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2287649 (10Dzahn) 05Open>03stalled [23:24:48] 06Operations, 10Wikimedia-IRC-RC-Server: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2287666 (10Dzahn) for the record: created labs project "ircd" with instance "udpmx-01", added puppet group / role class before reboot, things all work now in... [23:26:04] !log catrope@tin Started scap: Updating wmf23 Echo to wmf1 [23:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:27] RoanKattouw, are you doing swat? [23:27:38] *still [23:31:47] yurik: yeah I'm running a scap right now [23:32:39] RoanKattouw, could you ping me & MaxSem - we need to push out the zero patch, but we will do it after you are done [23:32:45] *once done :) [23:32:51] OK [23:32:53] gah, i keep skipping important words :) [23:45:44] (03PS1) 10Dzahn: ircd: remove exec permission bit on unit file [puppet] - 10https://gerrit.wikimedia.org/r/288333 [23:47:01] 06Operations, 10Wikimedia-IRC-RC-Server: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2287767 (10Dzahn) ..but the bot is not on the irc server. after restarting the service it is connected. checked with irssi /whois rc-pmtpa, so it's as reported.... [23:52:49] !log catrope@tin Finished scap: Updating wmf23 Echo to wmf1 (duration: 26m 45s) [23:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:33] yurik: scap done, go for it [23:55:00] MaxSem, ^ [23:55:03] go go go :) [23:58:31] pulling [23:58:44] dne, it's on mw1017 now [23:58:54] how to test? [23:59:07] MaxSem, checking... [23:59:36] MaxSem, is http://test.wikipedia.org/ being served from 1017L [23:59:37] ? [23:59:51] you need to set X-Wikimedia-Debug