[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170601T0000). [00:00:16] twentyafterfour: I'm done for SWAT. [00:01:48] (03PS2) 10Dzahn: Fix typos on Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356477 (owner: 10Bartosz Dziewoński) [00:02:42] (03CR) 10Chad: "Original version was written in a rush some time ago ;-)" [puppet] - 10https://gerrit.wikimedia.org/r/356477 (owner: 10Bartosz Dziewoński) [00:03:27] (03CR) 10Dzahn: "Actually.. that's not a typo :) freenode themselves uses "freenode". http://freenode.net/" [puppet] - 10https://gerrit.wikimedia.org/r/356477 (owner: 10Bartosz Dziewoński) [00:03:50] (03CR) 10Dzahn: ""freenode has been providing services to Free and Open Source Software projects"" [puppet] - 10https://gerrit.wikimedia.org/r/356477 (owner: 10Bartosz Dziewoński) [00:04:26] (03CR) 10Dzahn: "but let's remove the literal tabs in this file" [puppet] - 10https://gerrit.wikimedia.org/r/356477 (owner: 10Bartosz Dziewoński) [00:06:34] (03CR) 10Dzahn: "the other way around? would remove all the tabs and consistent spaces" [puppet] - 10https://gerrit.wikimedia.org/r/356478 (owner: 10Bartosz Dziewoński) [00:19:28] (03CR) 10Dzahn: Gerrit: Add non-masters to have public DNS entries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356499 (owner: 10Chad) [00:37:21] (03CR) 10Dzahn: [C: 031] l10nupdate: Reduce code duplication in git clone operations [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [01:04:22] (03PS1) 10Dzahn: gerrit: switch to base::service_unit, import sysvinit script to puppet [puppet] - 10https://gerrit.wikimedia.org/r/356516 [01:08:25] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/6602/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [01:16:13] (03CR) 10Dzahn: [C: 04-1] "I think we should:" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (https://phabricator.wikimedia.org/T158946) (owner: 10Paladox) [01:17:30] (03CR) 10Dzahn: [C: 04-1] "P.S. note that base::service_unit also has a "systemd_override" parameter to tell it to use a unit file provided by a Debian package. That" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (https://phabricator.wikimedia.org/T158946) (owner: 10Paladox) [01:27:01] (03PS1) 10Dzahn: gerrit: import systemd unit file from deb to puppet [puppet] - 10https://gerrit.wikimedia.org/r/356517 [01:28:46] (03CR) 10Dzahn: "after this would be https://gerrit.wikimedia.org/r/#/c/356517/" [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [01:29:58] (03CR) 10Dzahn: [C: 031] jenkins: Install java 8 on stretch and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [01:31:10] (03CR) 10Dzahn: [C: 031] "i like the content now, as i suggested. just need to adjust commit message to reflect the content." [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [01:31:46] (03CR) 10Paladox: "> I think we should:" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (https://phabricator.wikimedia.org/T158946) (owner: 10Paladox) [01:32:12] (03PS11) 10Dzahn: jenkins: Install java 8 on stretch and greater [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [01:32:39] (03CR) 10Paladox: [C: 031] "Let's merge this so we can add the systemd one please :)" [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [01:34:07] (03CR) 10Dzahn: [C: 04-1] "also: https://gerrit.wikimedia.org/r/#/c/356517/" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (https://phabricator.wikimedia.org/T158946) (owner: 10Paladox) [01:35:39] (03PS6) 10Dzahn: contint: Fix stretch support in package_builder [puppet] - 10https://gerrit.wikimedia.org/r/356237 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [01:37:08] (03Draft1) 10Paladox: Gerrit: Add systemd service to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/356518 [01:37:11] (03PS2) 10Paladox: Gerrit: Add systemd service to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/356518 (https://phabricator.wikimedia.org/T158946) [01:37:26] (03CR) 10Dzahn: [C: 032] "per hashar, already has a guard in the parent role that uses this" [puppet] - 10https://gerrit.wikimedia.org/r/356237 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [01:39:59] (03CR) 10Paladox: [C: 031] gerrit: import systemd unit file from deb to puppet [puppet] - 10https://gerrit.wikimedia.org/r/356517 (owner: 10Dzahn) [01:40:15] (03Abandoned) 10Paladox: Gerrit: Add systemd service to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/356518 (https://phabricator.wikimedia.org/T158946) (owner: 10Paladox) [01:41:55] (03CR) 10Dzahn: "that's a duplicate of https://gerrit.wikimedia.org/r/#/c/356517/ except it adds the template AND switches to systemd at the same time, so" [puppet] - 10https://gerrit.wikimedia.org/r/356518 (https://phabricator.wikimedia.org/T158946) (owner: 10Paladox) [02:25:03] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 29s) [02:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:43] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 07m 02s) [02:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:25] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jun 1 02:52:25 UTC 2017 (duration 6m 42s) [02:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:47] PROBLEM - Host cp3032 is DOWN: PING CRITICAL - Packet loss = 100% [04:31:07] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6 [04:31:08] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6 [04:31:27] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6 [04:31:27] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6 [04:31:27] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6 [04:31:27] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6 [04:31:37] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6 [04:31:37] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6 [04:31:37] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6 [04:31:37] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6 [04:31:47] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6 [04:31:57] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6 [04:32:07] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6 [04:32:07] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6 [04:32:07] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6 [04:32:07] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6 [04:42:27] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 1 down 0 [05:06:07] (03CR) 10Tim Starling: "> Also see https://github.com/facebook/hhvm/blob/493bdc2e8dcd7c16e2a57f63983bcfdf8eeac9c4/hphp/runtime/ext/string/ext_string.cpp#L2449 - I" [puppet] - 10https://gerrit.wikimedia.org/r/353228 (https://phabricator.wikimedia.org/T107128) (owner: 10Tim Starling) [05:06:22] (03PS2) 10Tim Starling: For HHVM set LANG=C.UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/353228 (https://phabricator.wikimedia.org/T107128) [05:09:07] PROBLEM - Apache HTTP on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [05:10:07] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.108 second response time [05:22:37] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [05:22:54] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3306721 (10Marostegui) The battery looks good now, it recharged, the temperature is ok and I have disabled the auto learn. I have started MySQL and once it ca... [05:24:04] (03PS1) 10Marostegui: Revert "wmnet: Point m3 slave to eqiad master" [dns] - 10https://gerrit.wikimedia.org/r/356525 [05:24:25] (03CR) 10Marostegui: [C: 04-2] "Wait for db1048 to catch up" [dns] - 10https://gerrit.wikimedia.org/r/356525 (owner: 10Marostegui) [05:34:54] cp3032 died ? [05:37:23] can reach the console though [05:38:46] a lot of errors like bnx2x: [bnx2x_mc_assert:750(eth0)]Chip Revision: everest3 [05:42:43] all right will explicitly depool it (just in case it recovers or get back to traffic without manual intervention) [05:43:33] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3032.esams.wmnet [05:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:34] seems something like T148891 [05:47:34] T148891: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891 [05:51:10] opening a phab task [05:56:44] 06Operations, 10Traffic: cp3032 ethernet link down (bnx2x dump in the dmesg) - https://phabricator.wikimedia.org/T166758#3306726 (10elukey) [05:58:19] 06Operations, 10Traffic: cp3032 ethernet link down (bnx2x dump in the dmesg) - https://phabricator.wikimedia.org/T166758#3306738 (10elukey) Host depooled manually, tried to run: ``` root@cp3032:/home/elukey# ifconfig eth0 down [3097418.717749] bnx2x: [bnx2x_del_all_macs:8501(eth0)]Failed to delete MACs: -5 [3... [05:58:32] !log powercycle cp3032 - T166758 [05:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:42] T166758: cp3032 ethernet link down (bnx2x dump in the dmesg) - https://phabricator.wikimedia.org/T166758 [05:59:57] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [06:00:07] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [06:00:07] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [06:00:07] RECOVERY - Host cp3032 is UP: PING OK - Packet loss = 0%, RTA = 119.54 ms [06:00:08] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [06:00:08] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [06:00:08] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [06:00:08] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [06:00:27] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [06:00:27] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [06:00:27] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [06:00:28] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK [06:00:37] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [06:00:37] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [06:00:37] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [06:00:37] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [06:00:47] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 44 ESP OK [06:02:14] (03PS1) 10Marostegui: db-eqiad.php: Repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356532 (https://phabricator.wikimedia.org/T166278) [06:02:48] !log Deploy alter table on s3, dbstore1001 - T166278 [06:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:56] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [06:04:42] !log Deploy alter table on s3, db1044 - T166278 [06:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356532 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [06:06:28] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356532 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [06:06:37] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356532 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [06:07:28] brb [06:07:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1035 - T166278 (duration: 00m 57s) [06:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:03] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [06:08:13] !log Deploy alter table on s3, labsdb1010 - T166278 [06:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:53] (03CR) 10Marostegui: [C: 032] Revert "wmnet: Point m3 slave to eqiad master" [dns] - 10https://gerrit.wikimedia.org/r/356525 (owner: 10Marostegui) [06:28:11] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3306767 (10Joe) >>! In T166345#3305630, @Krinkle wrote: > (Continued investigation using t... [06:30:08] 06Operations, 10DBA, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304453 (10Marostegui) That is very strange, the last accesses by the top IP we saw yesterday end up at 19:45 for enwiki. The last access that IP has was at "201... [06:32:00] <_joe_> 28/win 19 [06:32:03] <_joe_> wtf [06:39:44] (03PS3) 10Muehlenhoff: Add Druid hosts to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356189 [06:57:05] (03CR) 10Muehlenhoff: [C: 032] Add Druid hosts to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356189 (owner: 10Muehlenhoff) [06:58:45] (03PS3) 10Phedenskog: Add Save Timing alerts to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/356449 (https://phabricator.wikimedia.org/T153170) [07:01:47] (03PS1) 10Muehlenhoff: Record extended account dates for two researchers [puppet] - 10https://gerrit.wikimedia.org/r/356535 [07:07:42] (03CR) 10Muehlenhoff: [C: 032] Record extended account dates for two researchers [puppet] - 10https://gerrit.wikimedia.org/r/356535 (owner: 10Muehlenhoff) [07:21:44] (03CR) 10Muehlenhoff: "I don't really see the point for using base::service_unit here:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [07:24:21] (03PS6) 10Giuseppe Lavagetto: Test the future parser in puppet compiler [puppet] - 10https://gerrit.wikimedia.org/r/322898 (owner: 10Alexandros Kosiaris) [07:24:23] (03PS1) 10Giuseppe Lavagetto: mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 [07:27:44] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3306775 (10Gilles) Based on last night's short redeploy, it's extremely likely that the is... [07:29:59] (03PS2) 10Giuseppe Lavagetto: mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 [07:33:28] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 (owner: 10Giuseppe Lavagetto) [07:36:26] (03PS3) 10Giuseppe Lavagetto: mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 [07:36:38] <_joe_> LOL of course our CI thinks this is wrong [07:38:01] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 (owner: 10Giuseppe Lavagetto) [07:39:47] (03PS4) 10Giuseppe Lavagetto: mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 [07:41:09] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 (owner: 10Giuseppe Lavagetto) [07:46:03] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3306796 (10Gilles) I guess this API call might do? ``` https://en.wikipedia.org/w/api.php... [07:49:39] !log editing wikiversions.php manually on mwdebug1001 to point enwiki to wmf.2 [07:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:42] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3306828 (10Gilles) The API seems to fetch something cached, won't do. I tried running the... [08:10:17] PROBLEM - Apache HTTP on mw1194 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [08:11:17] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.117 second response time [08:15:03] !log upgrade grafana to 4.3.2 on labmon1001 / krypton [08:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:19] (03PS9) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [08:18:08] tried to make it a little bit better -^ [08:18:15] still not sure if it is enough or not [08:18:35] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3306847 (10Gilles) With mwdebug1001 pointed to wmf.2, ?action=purge on the main page worke... [08:21:36] (03PS1) 10Muehlenhoff: Tighten access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/356548 (https://phabricator.wikimedia.org/T114815) [08:39:44] brion, do you know why there are still small transcodes failing? https://commons.wikimedia.org/wiki/File%3ABhagam_Bhag_%281956%29.webm [08:41:12] yannf: they shouldn't afaik, at least not the same issue from earlier this week [08:42:26] godog, and these are still not fixed https://commons.wikimedia.org/wiki/User:Yann/Youtube_TODO#Failed_transcodes [08:42:32] (old issue) [08:45:27] yannf: yeah if the resulting transcode is >4GB it won't work, which might be the case for some of those [08:45:31] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3306883 (10Gilles) Last night after the purge the main page was showing up several times i... [08:47:34] godog, these files are big, but not over 4 GB, how the transcodes could be over 4 GB? [08:51:38] yannf: one case might be different transcoding options used on the original push the transcode over 4GB, though you are right some of those originals are nowhere close to 4GB [08:51:58] or maybe the transcode itself failed and/or hit some other limits, e.g. in time [08:53:22] godog, I have opened several reports about this https://phabricator.wikimedia.org/T157028 [08:53:53] yannf: nice, thanks! [08:54:12] I have to go afk now for a bit, bbiab [08:54:41] godog: we fixed a while ago an issue with the mod_proxy_fcgi timeout on the videoscalers, setting it to 1 day.. I don't think it gets breached but we can check [08:54:51] 06Operations, 10OTRS, 13Patch-For-Review: Upgrade OTRS to 5.0.19 - https://phabricator.wikimedia.org/T165284#3306888 (10akosiaris) [08:55:52] (03PS5) 10Giuseppe Lavagetto: mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 [08:56:56] should I restart the failed transcodes? [08:57:07] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 (owner: 10Giuseppe Lavagetto) [09:03:33] yannf: thanks for all the info, but I'd have a question about the tracking task.. are those errors of ffmpeg in your opinion or others? [09:04:18] (03PS6) 10Giuseppe Lavagetto: mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 [09:05:17] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 (owner: 10Giuseppe Lavagetto) [09:06:33] elukey, there are several reasons, I opened one report for each EXITCODE [09:09:22] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3306919 (10Gilles) @krinkle browsing thourgh xhgui I don't see your hit on action=purge.... [09:13:31] yannf: sure sure, but they seem all related to ffmpeg/transcoding specific issues rather than problems with the videoscalers in general right? Just trying to figure out if you guys are waiting for ops support or more dev one (from experts like b*rion from example) [09:13:55] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/6611/mw1261.eqiad.wmnet/ it compiles with some slight real differences I still have to weed out, plus " [puppet] - 10https://gerrit.wikimedia.org/r/356539 (owner: 10Giuseppe Lavagetto) [09:15:55] elukey, there were issues because of overloaded servers, but this seems to be fixed now [09:16:31] and yes, brion worked on this [09:16:38] *has [09:17:04] yannf: I opened a task a while ago after the last outage but it didn't re-happen - https://phabricator.wikimedia.org/T162815 [09:17:18] so I was wondering if we had newer issues [09:17:22] sounds like we are "ok" [09:18:12] I am also looking forward to put Debian Jessie in there with better hw [09:21:53] 06Operations, 10Community-Wikimetrics, 10DBA, 10Icinga, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625#3306961 (10jcrespo) Some modules maybe should install wmf-mariadb101-client ? [09:23:54] (03PS2) 10Ema: Instrumentation fixes [debs/pybal] - 10https://gerrit.wikimedia.org/r/354680 (https://phabricator.wikimedia.org/T103882) [09:24:52] (03CR) 10Jcrespo: [C: 032] query-killer: Do not kill queries containing gtid_wait or DMLs [software] - 10https://gerrit.wikimedia.org/r/351796 (owner: 10Jcrespo) [09:26:03] (03PS1) 10Elukey: Revert "Add zookeeper.yaml to hieradata common" [labs/private] - 10https://gerrit.wikimedia.org/r/356556 [09:26:11] (03CR) 10Elukey: [V: 032 C: 032] Revert "Add zookeeper.yaml to hieradata common" [labs/private] - 10https://gerrit.wikimedia.org/r/356556 (owner: 10Elukey) [09:39:54] (03PS10) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [09:40:24] (03CR) 10Jcrespo: "root@db1089:~$ lsof -p $(pidof mysqld) | wc -l # s1-production" [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [09:42:33] (03CR) 10Paladox: [C: 031] "Might as well make systemd true in jetty since the systemd script would have been installed already so sudo service will use the systemd s" [puppet] - 10https://gerrit.wikimedia.org/r/356517 (owner: 10Dzahn) [09:50:08] (03PS2) 10Jcrespo: dbtools: Update package for stretch and include systemd support [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) [09:51:37] <_joe_> !log refreshing facts on the puppet compiler [09:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:56] (03PS1) 10Hashar: jenkins: get rid of /etc/init.d/jenkins [puppet] - 10https://gerrit.wikimedia.org/r/356560 [09:52:20] (03CR) 10jerkins-bot: [V: 04-1] dbtools: Update package for stretch and include systemd support [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [09:53:34] (03CR) 10Hashar: "I think you had a similar issue with Gerrit yesterday: namely trying to use /etc/init.d/gerrit instead of systemctl restart gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/356560 (owner: 10Hashar) [09:54:39] (03CR) 10Paladox: "> I think you had a similar issue with Gerrit yesterday: namely" [puppet] - 10https://gerrit.wikimedia.org/r/356560 (owner: 10Hashar) [09:54:55] is there problems with CI ? [09:55:44] (03PS2) 10Filippo Giunchedi: base: blacklist acpi_power_meter [puppet] - 10https://gerrit.wikimedia.org/r/356422 (https://phabricator.wikimedia.org/T125205) [09:56:06] I got this, but not sure I folow the error: https://integration.wikimedia.org/ci/job/tox-jessie/18506/console [09:56:40] ImportError: No module named six [10:00:17] (03CR) 10Filippo Giunchedi: [C: 032] base: blacklist acpi_power_meter [puppet] - 10https://gerrit.wikimedia.org/r/356422 (https://phabricator.wikimedia.org/T125205) (owner: 10Filippo Giunchedi) [10:03:44] (03PS11) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [10:06:38] !log run puppet to blacklist acpi_power_meter across the fleet and rmmod the module [10:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:22] (03PS12) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [10:08:06] (03CR) 10Muehlenhoff: jenkins: get rid of /etc/init.d/jenkins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356560 (owner: 10Hashar) [10:11:07] (03PS2) 10Ema: varnish: do not chmod VSM files [puppet] - 10https://gerrit.wikimedia.org/r/356401 [10:11:15] (03CR) 10Ema: [V: 032 C: 032] varnish: do not chmod VSM files [puppet] - 10https://gerrit.wikimedia.org/r/356401 (owner: 10Ema) [10:14:48] (03CR) 10Volans: [C: 04-1] "Global structure looks ok, see a bunch of comments inline." (0318 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [10:15:47] ahahahha [10:16:03] 22 + 18 = 40 comments in total (from another code review) [10:16:07] good job Luca [10:16:15] rotfl [10:16:30] you should distinguis the [optional] ones and the questions though [10:16:36] :-P [10:16:55] and damn.. 42 was the goal number [10:16:58] I missed it [10:21:26] (03CR) 10Hashar: jenkins: get rid of /etc/init.d/jenkins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356560 (owner: 10Hashar) [10:24:11] !log Point nutcracker to localhost on mwdebug1001 [10:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:01] (03CR) 10Muehlenhoff: jenkins: get rid of /etc/init.d/jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356560 (owner: 10Hashar) [10:31:42] (03CR) 10Paladox: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356560 (owner: 10Hashar) [10:38:28] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307070 (10Gilles) I tried pointing nutcracker to localhost again on mwdebug1001. Purged t... [10:42:44] (03CR) 10Ema: [V: 032 C: 032] Instrumentation fixes [debs/pybal] - 10https://gerrit.wikimedia.org/r/354680 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [10:48:58] (03PS13) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [10:49:53] (03CR) 10jerkins-bot: [V: 04-1] role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (owner: 10Elukey) [10:54:31] (03PS14) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [10:55:43] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307084 (10Gilles) OK, I think it's still the issue of the parser cache being backed by th... [10:59:22] (03PS1) 10Ema: Re-enable temperature monitoring via NRPE [puppet] - 10https://gerrit.wikimedia.org/r/356567 (https://phabricator.wikimedia.org/T125205) [11:03:15] ema: great :) [11:03:24] but are we collecting temp data from both hwmon and IPMI? [11:04:54] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307117 (10Gilles) That did what I expected, now I get a parser run from mwdebug1001 for t... [11:05:56] (03CR) 10Faidon Liambotis: "Are we doing temperature monitoring using both IPMI and Prometheus/hwmon?" [puppet] - 10https://gerrit.wikimedia.org/r/356567 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [11:06:57] ehehe I had the same question the other day, if we could use hwmon for both without bothering the ipmi [11:11:09] (03PS15) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [11:13:17] almost, pcc is finally looking decent --^ [11:17:37] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:18:27] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73845 bytes in 1.796 second response time [11:19:07] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307130 (10Gilles) Looking at logstash I see this suspicious message for the loggedout cas... [11:21:14] gilles: well done :-) [11:21:37] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:27] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73845 bytes in 4.518 second response time [11:23:17] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1161.70 Read Requests/Sec=3839.80 Write Requests/Sec=7.30 KBytes Read/Sec=25160.80 KBytes_Written/Sec=3492.00 [11:29:47] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293314 (10jcrespo) > why is mwdebug1001 picking up parser cache written by codfw Note on... [11:31:53] <_joe_> gilles: so it was the parsercache, not wancache [11:31:54] <_joe_> :P [11:33:52] !log test upgrade of swift 2.10 on ms-fe2005 - T162609 [11:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:02] T162609: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609 [11:34:17] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=6.90 Read Requests/Sec=0.00 Write Requests/Sec=2.70 KBytes Read/Sec=0.00 KBytes_Written/Sec=56.80 [11:34:51] (03PS1) 10Volans: Maps cache: fix parameters stripped away [puppet] - 10https://gerrit.wikimedia.org/r/356570 (https://phabricator.wikimedia.org/T164608) [11:36:26] (03PS2) 10MaxSem: Maps cache: fix parameters stripped away [puppet] - 10https://gerrit.wikimedia.org/r/356570 (https://phabricator.wikimedia.org/T164608) (owner: 10Volans) [11:37:14] (03CR) 10MaxSem: [C: 031] Maps cache: fix parameters stripped away [puppet] - 10https://gerrit.wikimedia.org/r/356570 (https://phabricator.wikimedia.org/T164608) (owner: 10Volans) [11:40:58] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Maps cache: fix parameters stripped away (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356570 (https://phabricator.wikimedia.org/T164608) (owner: 10Volans) [11:43:57] (03PS3) 10Volans: Maps cache: fix parameters stripped away [puppet] - 10https://gerrit.wikimedia.org/r/356570 (https://phabricator.wikimedia.org/T164608) [11:44:18] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307196 (10Gilles) I'd like to figure out a few things about codfw here that I spotted dur... [11:44:23] (03CR) 10Volans: "addressed comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356570 (https://phabricator.wikimedia.org/T164608) (owner: 10Volans) [11:44:48] (03CR) 10Giuseppe Lavagetto: [C: 031] Maps cache: fix parameters stripped away [puppet] - 10https://gerrit.wikimedia.org/r/356570 (https://phabricator.wikimedia.org/T164608) (owner: 10Volans) [11:45:36] (03CR) 10Ema: [V: 032 C: 032] Maps cache: fix parameters stripped away [puppet] - 10https://gerrit.wikimedia.org/r/356570 (https://phabricator.wikimedia.org/T164608) (owner: 10Volans) [11:46:54] 06Operations, 10Monitoring, 03Interactive-Sprint, 06Maps (Kartotherian), 07Technical-Debt: Geoshape and geoline subservices need monitoring - https://phabricator.wikimedia.org/T166776#3307199 (10MaxSem) [11:47:20] ACKNOWLEDGEMENT - HP RAID on ms-be1020 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166777 [11:47:23] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T166777#3307213 (10ops-monitoring-bot) [11:47:33] 06Operations, 10Monitoring, 03Interactive-Sprint, 06Maps (Kartotherian), 07Technical-Debt: Geoshape and geoline subservices need monitoring - https://phabricator.wikimedia.org/T166776#3307217 (10MaxSem) p:05Triage>03High [11:52:52] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T166777#3307225 (10Volans) @Cmjohnson @Papaul FYI: given that now the RAID alarm in Icinga can be triggered also for a faulty BBU or wrong WritePolicy, I've added on top of the get raid output the Icing... [11:56:53] (03CR) 10Hashar: jenkins: get rid of /etc/init.d/jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356560 (owner: 10Hashar) [11:57:13] (03Abandoned) 10Hashar: jenkins: get rid of /etc/init.d/jenkins [puppet] - 10https://gerrit.wikimedia.org/r/356560 (owner: 10Hashar) [11:58:27] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:00:07] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:01:54] (03Abandoned) 10Faidon Liambotis: Remove trebuchet setup from restbase config [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke) [12:04:38] (03CR) 10Faidon Liambotis: [C: 04-2] "1) Put it in the package, no reason (and way more confusing) to have it in puppet." [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [12:05:49] (03PS4) 10Faidon Liambotis: Rewrite the LLDP fact(s) [puppet] - 10https://gerrit.wikimedia.org/r/354084 [12:06:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:06:25] (03CR) 10Faidon Liambotis: [C: 032] Rewrite the LLDP fact(s) [puppet] - 10https://gerrit.wikimedia.org/r/354084 (owner: 10Faidon Liambotis) [12:06:27] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:07:06] (03CR) 10Hashar: contint: Only install libmysqlclient-dev if on trusty or jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356246 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [12:09:24] merging the lldp fact, it could go wrong :) [12:09:29] (03CR) 10Paladox: contint: Only install libmysqlclient-dev if on trusty or jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356246 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [12:09:43] lol [12:11:26] (03CR) 10Hashar: [C: 031] Do the echo when running update.php [puppet] - 10https://gerrit.wikimedia.org/r/354932 (owner: 10Reedy) [12:23:47] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3307252 (10chasemp) @cmjohnson @robh the `row b` requirement for labvirts and labnets is unfortunately still real as of now. We are working on it t... [12:30:34] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3307281 (10chasemp) >>! In T166237#3304793, @Papaul wrote: > @chasemp the others lab servers in the DHCP file are pointing to the Trusty install do you want t... [12:31:24] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307283 (10Gilles) New plot twist... TMH was wrongly accused. I did more digging in mwrepl... [12:32:28] _joe_: yeah, similar idea [12:36:40] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293316 (10TheDJ) @Gilles Krinkle and I poured over that TMH change during the hackathon.... [12:37:46] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307307 (10Gilles) @TheDJ I'm now pretty sure that the change to TMH is fine and it's this... [12:38:18] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 06Discovery-Search (Current work): high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524#3307311 (10Gehel) a:05Gehel>03Cmjohnson wdqs1002 has not had any issue since then. Hardware request is done on a separate ti... [12:40:56] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3307314 (10chasemp) >>! In T166237#3301373, @RobH wrote: > It seems that there is confusion, due to the fact that racktables shows two labtestvirt2001 systems... [12:42:01] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Replace wdqs100[12] servers as they are getting old - https://phabricator.wikimedia.org/T166780#3307316 (10Gehel) [12:43:05] (03PS16) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [12:50:51] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307338 (10thcipriani) @Gilles the mw-parser-output change in CommonSettings was created a... [12:53:29] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307339 (10Gilles) Is it expected that the parser output in wmf.2 has the mw-parser-output... [12:55:26] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307344 (10Gilles) Removing the config hook would probably make the slowdown issue go away... [12:59:07] (03PS17) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170601T1300). [13:04:19] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307355 (10thcipriani) >>! In T166345#3307344, @Gilles wrote: > Removing the config hook w... [13:05:52] jouncebot: refresh [13:05:54] I refreshed my knowledge about deployments. [13:05:55] jouncebot: next [13:05:55] In 2 hour(s) and 54 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170601T1600) [13:06:48] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307356 (10Gilles) mw-parser-output making it into the loggedout parser cache is reproduci... [13:07:34] (03PS1) 10Ema: varnish: add explicit guards around upload-specific VCL [puppet] - 10https://gerrit.wikimedia.org/r/356583 (https://phabricator.wikimedia.org/T164608) [13:09:19] PROBLEM - HHVM rendering on mw2216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:09] RECOVERY - HHVM rendering on mw2216 is OK: HTTP OK: HTTP/1.1 200 OK - 73843 bytes in 0.122 second response time [13:11:41] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307366 (10jcrespo) > is Mediawiki itself on codfw supposed to be configured as read-only?... [13:11:42] !log restored original configuration on mwdebug1001 [13:11:48] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307367 (10thcipriani) >>! In T166345#3307356, @Gilles wrote: > mw-parser-output making it... [13:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:20] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307368 (10jcrespo) BTW, I think many of those monitoring checks specifically parse the Ma... [13:15:14] !log Deploy alter table s3 revision on labsdb1011 - T166278 [13:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:22] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [13:15:49] (03CR) 10Ema: "> Are we doing temperature monitoring using both IPMI and" [puppet] - 10https://gerrit.wikimedia.org/r/356567 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [13:17:32] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307373 (10Gilles) That makes sense, and it explains the slowdown because folks forgot abo... [13:18:39] !log Deploy alter table s3 revision on labsdb1001 - T166278 [13:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:58] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307375 (10Gilles) >>! In T166345#3307368, @jcrespo wrote: > BTW, I think many of those mo... [13:20:28] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307376 (10jcrespo) What about the passive->active DC replication, should we do something... [13:21:50] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307377 (10Gilles) I've just double checked and group0/group1 wikis definitely look affect... [13:24:48] (03PS18) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [13:25:46] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307384 (10Gilles) >>! In T166345#3307376, @jcrespo wrote: > What about the passive->activ... [13:27:19] (03PS1) 10Jcrespo: Change the read only message for something generic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356584 (https://phabricator.wikimedia.org/T166345) [13:28:12] jynus: please change db-eqiad too [13:28:17] and thanks for taking care of it [13:28:47] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307388 (10jcrespo) ^I have at least changed the message that, so it doesn't refer to the switchover... [13:30:45] (03PS2) 10Jcrespo: codfw:dc Change the read only message for something generic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356584 (https://phabricator.wikimedia.org/T166345) [13:31:28] (03CR) 10Jcrespo: "Volans: Does this messes up the failover script in anyway?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356584 (https://phabricator.wikimedia.org/T166345) (owner: 10Jcrespo) [13:32:27] _joe_: revert-happy is the Brion rule re https://phabricator.wikimedia.org/T166345#3293926 [13:33:48] (03PS19) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [13:42:21] jynus: I'll answer after lunch, but the TL;DR don't worry about it for now ;) [13:42:55] (03PS3) 10Jcrespo: codfw:dc Change the read only message for something generic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356584 (https://phabricator.wikimedia.org/T166345) [13:44:27] (03PS4) 10Jcrespo: db-readonly:Change the read only message for something generic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356584 (https://phabricator.wikimedia.org/T166345) [13:47:40] (03PS5) 10Jcrespo: db-readonly: Change the read only message for something generic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356584 (https://phabricator.wikimedia.org/T166345) [13:48:36] (03CR) 10Muehlenhoff: Gerrit: Set ulimit's in gerrit.service (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (https://phabricator.wikimedia.org/T158946) (owner: 10Paladox) [13:50:16] (03PS2) 10Ema: varnish: add explicit guards around upload-specific VCL [puppet] - 10https://gerrit.wikimedia.org/r/356583 (https://phabricator.wikimedia.org/T164608) [13:52:40] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3307485 (10Anomie) >>! In T166345#3307355, @thcipriani wrote: >>>! In T166345#3307344, @Gilles wrote:... [13:52:45] (03PS8) 10Paladox: Gerrit: Set ulimit's in gerrit.service [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (https://phabricator.wikimedia.org/T158946) [13:52:52] (03PS2) 10Filippo Giunchedi: site: add prometheus global instance in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/356427 [13:52:59] (03CR) 10Paladox: Gerrit: Set ulimit's in gerrit.service (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (https://phabricator.wikimedia.org/T158946) (owner: 10Paladox) [13:55:13] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2048 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356587 [13:55:17] (03PS2) 10Jcrespo: Revert "mariadb: Depool db2048 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356587 [13:55:31] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db2048 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356587 (owner: 10Jcrespo) [13:56:09] (03CR) 10Filippo Giunchedi: [C: 032] site: add prometheus global instance in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/356427 (owner: 10Filippo Giunchedi) [13:58:26] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2048 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356587 (owner: 10Jcrespo) [13:58:30] (03PS1) 10Thcipriani: Revert "Add RejectParserCacheValue handler for mw-parser-output invalidation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356589 (https://phabricator.wikimedia.org/T166345) [13:58:38] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2048 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356587 (owner: 10Jcrespo) [14:00:12] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2048 after maintenance (duration: 00m 44s) [14:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:35] (03CR) 10Anomie: [C: 031] Revert "Add RejectParserCacheValue handler for mw-parser-output invalidation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356589 (https://phabricator.wikimedia.org/T166345) (owner: 10Thcipriani) [14:01:33] (03PS1) 10Andrew Bogott: Labs: Include 'bikeshed' package on Trusty VMs. [puppet] - 10https://gerrit.wikimedia.org/r/356590 [14:06:36] (03CR) 10Ottomata: [C: 031] role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 (owner: 10Elukey) [14:08:21] (03CR) 10Andrew Bogott: [C: 032] Labs: Include 'bikeshed' package on Trusty VMs. [puppet] - 10https://gerrit.wikimedia.org/r/356590 (owner: 10Andrew Bogott) [14:08:26] <_joe_> elukey: can I look again? [14:09:50] _joe_ still haven't pushed the zookeeper_cluster_name to roles, so not finished :( [14:09:57] <_joe_> oook [14:09:59] (03CR) 10Ottomata: Tighten access to zookeeper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356548 (https://phabricator.wikimedia.org/T114815) (owner: 10Muehlenhoff) [14:10:23] <_joe_> it's not in too many places come on :P [14:10:36] (03PS1) 10Filippo Giunchedi: hieradata: default retention for prometheus/ops eqiad [puppet] - 10https://gerrit.wikimedia.org/r/356591 [14:10:38] (03PS1) 10Filippo Giunchedi: role: report instance disk full % in beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/356592 [14:10:51] _joe_ there is also kafka_config.rb [14:10:56] I want to be careful [14:11:06] <_joe_> sigh [14:11:33] (03CR) 10Muehlenhoff: Tighten access to zookeeper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356548 (https://phabricator.wikimedia.org/T114815) (owner: 10Muehlenhoff) [14:11:53] _joe_: have a look at mine while you wait! :p [14:11:55] https://gerrit.wikimedia.org/r/#/c/356232/3 [14:12:12] <_joe_> okey dokey [14:13:27] oh, wait, that's missing a recent patch [14:13:29] why... [14:13:30] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: default retention for prometheus/ops eqiad [puppet] - 10https://gerrit.wikimedia.org/r/356591 (owner: 10Filippo Giunchedi) [14:13:35] (03PS2) 10Filippo Giunchedi: hieradata: default retention for prometheus/ops eqiad [puppet] - 10https://gerrit.wikimedia.org/r/356591 [14:13:40] ah gerrit was down yesterday when i tried to push it [14:13:41] ACKNOWLEDGEMENT - HP RAID on ms-be1019 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166787 [14:13:46] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T166787#3307543 (10ops-monitoring-bot) [14:13:52] (03PS4) 10Ottomata: Kafka broker profile and roles for new 'aggregate' (TBD) cluster and 'simple' cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [14:13:56] _joe_: ^ new patch [14:14:11] <_joe_> ottomata: oook [14:15:06] (03CR) 10jerkins-bot: [V: 04-1] Kafka broker profile and roles for new 'aggregate' (TBD) cluster and 'simple' cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [14:15:08] (03CR) 10Muehlenhoff: [C: 031] "Looks good. I can merge that tomorrow (and rebuild the gerrit package)." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (https://phabricator.wikimedia.org/T158946) (owner: 10Paladox) [14:15:27] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T166746#3307558 (10fgiunchedi) [14:15:29] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T166777#3307557 (10fgiunchedi) [14:15:31] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3307553 (10fgiunchedi) [14:15:33] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T166787#3307559 (10fgiunchedi) [14:18:29] !log mforns@tin Started deploy [analytics/refinery@7540403]: (no justification provided) [14:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:26] (03CR) 10Hashar: [C: 031] contint: Only install libmysqlclient-dev if on trusty or jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356246 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [14:20:59] <_joe_> ottomata: looks good so far :) [14:21:20] !log mforns@tin Finished deploy [analytics/refinery@7540403]: (no justification provided) (duration: 02m 50s) [14:21:20] (03CR) 10Volans: "I personally don't like much the new messages and having different messages between the two datacenters is a problem for the switchdc scri" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356584 (https://phabricator.wikimedia.org/T166345) (owner: 10Jcrespo) [14:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:11] (03CR) 10Muehlenhoff: "@paladox: The reason this didn't work for gerrit is because the init script of gerrit doesn't include /lib/lsb/init-functions (which jenki" [puppet] - 10https://gerrit.wikimedia.org/r/356560 (owner: 10Hashar) [14:24:22] _joe_: :) cool. isn't it annoying though! Even if a module has sane defaults, if I want to override them for one speciifc role, I have to specify the defaults again for every other role .yaml file :/ annoying! [14:25:05] <_joe_> ottomata: I agree, but the result is cleaner code [14:25:21] <_joe_> and btw, I don't like the hiera call buried inside kafka_config ;) [14:25:23] (03PS2) 10Jcrespo: mariadb: Allow full reimage of db2041,38,37,35,44 (still on trusty) [puppet] - 10https://gerrit.wikimedia.org/r/356387 [14:25:34] (03CR) 10Jcrespo: mariadb: Allow full reimage of db2041,38,37,35,44 (still on trusty) [puppet] - 10https://gerrit.wikimedia.org/r/356387 (owner: 10Jcrespo) [14:26:41] (03PS2) 10Filippo Giunchedi: role: report instance disk full % in beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/356592 [14:27:11] haha [14:27:19] not my fault (ok, kinda my fault) [14:27:21] hmm [14:27:47] we could maybe change that _joe_, make kafka_config take the $kafka_clusters [14:27:52] and make it a profile param [14:27:59] $kafka_clusters = hiera('kafka_clusters') [14:28:09] $config = kafka_config($kafka_clusters, $kafka_cluster_name) [14:30:42] (03CR) 10Filippo Giunchedi: [C: 032] role: report instance disk full % in beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/356592 (owner: 10Filippo Giunchedi) [14:31:24] (03PS20) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [14:32:02] (03PS4) 10Filippo Giunchedi: profile: introduce swift::storage::labs [puppet] - 10https://gerrit.wikimedia.org/r/350389 (https://phabricator.wikimedia.org/T162247) [14:32:51] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3307646 (10BBlack) Yeah that was the plan, for XKey to help here by consolidating that down to a single HTCP /... [14:34:41] <_joe_> ottomata: yeah, but I'd leave that for later [14:35:11] aye [14:36:50] (03CR) 10Filippo Giunchedi: [C: 032] profile: introduce swift::storage::labs [puppet] - 10https://gerrit.wikimedia.org/r/350389 (https://phabricator.wikimedia.org/T162247) (owner: 10Filippo Giunchedi) [14:39:14] (03CR) 10Filippo Giunchedi: "Old review but still relevant I think in case of multiple celery workers" [puppet] - 10https://gerrit.wikimedia.org/r/313562 (https://phabricator.wikimedia.org/T146581) (owner: 10Filippo Giunchedi) [14:40:43] (03CR) 10Hashar: [C: 04-1] "I have intentionally made zuul-merger to run in foreground and let systemd track the PID. That is way easier to handle that way and make t" [puppet] - 10https://gerrit.wikimedia.org/r/356185 (owner: 10Paladox) [14:41:26] (03CR) 10Paladox: "> I have intentionally made zuul-merger to run in foreground and let" [puppet] - 10https://gerrit.wikimedia.org/r/356185 (owner: 10Paladox) [14:41:30] (03PS8) 10Filippo Giunchedi: base: export puppet agent stats to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/354457 [14:41:32] (03PS8) 10Filippo Giunchedi: prometheus: add alertmanager_url to prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/354459 [14:41:34] (03PS8) 10Filippo Giunchedi: role: use alertmanager in beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/354460 [14:41:36] (03PS7) 10Filippo Giunchedi: role: set external url for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/354975 [14:41:38] (03PS8) 10Filippo Giunchedi: WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 [14:42:31] (03CR) 10Hashar: [C: 04-1] "Maybe because other services can not run in foreground ? In this case I dont think we need that." [puppet] - 10https://gerrit.wikimedia.org/r/356185 (owner: 10Paladox) [14:42:53] (03CR) 10Bartosz Dziewoński: "I'm fixing the indentation in https://gerrit.wikimedia.org/r/#/c/356478/ ." [puppet] - 10https://gerrit.wikimedia.org/r/356477 (owner: 10Bartosz Dziewoński) [14:42:59] (03CR) 10Alexandros Kosiaris: [C: 031] celery: use SyslogIdentifier [puppet] - 10https://gerrit.wikimedia.org/r/313562 (https://phabricator.wikimedia.org/T146581) (owner: 10Filippo Giunchedi) [14:43:02] (03Abandoned) 10Paladox: Zuul: Update zuul-merger.systemd.erb to run in the background with a pid [puppet] - 10https://gerrit.wikimedia.org/r/356185 (owner: 10Paladox) [14:43:14] (03PS3) 10Bartosz Dziewoński: Fix typo on Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356477 [14:43:16] (03PS2) 10Bartosz Dziewoński: Fix indentation of Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356478 [14:43:24] paladox: the intent was to make the zuul-merger systemd config super trivial :-} [14:43:27] (03PS4) 10Bartosz Dziewoński: Fix typo on Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356477 [14:43:32] (03CR) 10jerkins-bot: [V: 04-1] WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 (owner: 10Filippo Giunchedi) [14:43:39] (03PS3) 10Bartosz Dziewoński: Fix indentation of Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356478 [14:43:40] paladox: and essentially drop the demonization / custom pid tracking. systemd keeps care of it for us automagically :-} [14:43:54] (03PS21) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [14:43:56] (03CR) 10Bartosz Dziewoński: "I can do that too if you really want." [puppet] - 10https://gerrit.wikimedia.org/r/356478 (owner: 10Bartosz Dziewoński) [14:44:00] Oh. thanks for explaning :). [14:45:42] (03PS4) 10Bartosz Dziewoński: Fix indentation of Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356478 [14:47:49] (03CR) 10Hashar: [C: 031] ExtensionDistributor: Add REL1_29, drop REL1_23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356486 (owner: 10Legoktm) [14:52:43] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3307745 (10Papaul) [14:53:35] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3252410 (10Papaul) a:05Papaul>03RobH @robh This is complete on my end [14:53:58] 06Operations, 10ops-codfw, 10DBA: db2044 cannot install jessie - requires BIOS firmware upgrade - https://phabricator.wikimedia.org/T166683#3307767 (10Papaul) p:05Triage>03Normal [14:54:29] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3307770 (10daniel) @BBlack I have looked into XKey before, and have been wanting to work on this for a while (... [14:54:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Looks fine overall, some small comments." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [14:57:09] (03PS2) 10Filippo Giunchedi: celery: use SyslogIdentifier [puppet] - 10https://gerrit.wikimedia.org/r/313562 (https://phabricator.wikimedia.org/T146581) [15:03:41] !log restart kafka100[23] for jvm upgrades [15:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:01] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/6631/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/313562 (https://phabricator.wikimedia.org/T146581) (owner: 10Filippo Giunchedi) [15:07:13] (03PS1) 10BBlack: LVS: new redundancy layout for new eqiad+ulsfo hosts [puppet] - 10https://gerrit.wikimedia.org/r/356605 (https://phabricator.wikimedia.org/T150256) [15:08:11] (03PS2) 10BBlack: LVS: new redundancy layout for new eqiad+ulsfo hosts [puppet] - 10https://gerrit.wikimedia.org/r/356605 (https://phabricator.wikimedia.org/T150256) [15:11:16] (03PS3) 10BBlack: varnish: add explicit guards around upload-specific VCL [puppet] - 10https://gerrit.wikimedia.org/r/356583 (https://phabricator.wikimedia.org/T164608) (owner: 10Ema) [15:11:58] (03CR) 10BBlack: [V: 032 C: 032] varnish: add explicit guards around upload-specific VCL [puppet] - 10https://gerrit.wikimedia.org/r/356583 (https://phabricator.wikimedia.org/T164608) (owner: 10Ema) [15:18:24] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3307873 (10chasemp) >>! In T166237#3300388, @Papaul wrote: > @chasemp what partman recipe do you want to use for the server? We have : > raid10-gpt-srv-lv... [15:19:31] 06Operations, 10ops-codfw, 10DBA: db2044 cannot install jessie - requires BIOS firmware upgrade - https://phabricator.wikimedia.org/T166683#3307874 (10jcrespo) According to my records, 43-70 were bought together. 33-42 are the same exact model, I will check if I can reimage one of those, too. [15:25:06] 06Operations, 10ops-codfw, 10DBA: db2044 cannot install jessie - requires BIOS firmware upgrade - https://phabricator.wikimedia.org/T166683#3307886 (10Papaul) a:05Papaul>03jcrespo Firmware update complete [15:27:47] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3307895 (10Papaul) @chasemp thanks. [15:28:57] (03PS1) 10Mark Bergsma: Fix IPPrefix value comparisons with different packed paddings [debs/pybal] - 10https://gerrit.wikimedia.org/r/356611 [15:28:59] (03PS1) 10Mark Bergsma: Add basic BGP.parseUpdate test case [debs/pybal] - 10https://gerrit.wikimedia.org/r/356612 [15:39:59] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [15:40:09] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [15:40:18] (03CR) 10Filippo Giunchedi: [C: 032] celery: use SyslogIdentifier [puppet] - 10https://gerrit.wikimedia.org/r/313562 (https://phabricator.wikimedia.org/T146581) (owner: 10Filippo Giunchedi) [15:40:23] (03PS3) 10Filippo Giunchedi: celery: use SyslogIdentifier [puppet] - 10https://gerrit.wikimedia.org/r/313562 (https://phabricator.wikimedia.org/T146581) [15:40:39] XioNoX: FYI, interface recovered ^^^ [15:41:16] 06Operations, 10ops-ulsfo, 10Traffic, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3307984 (10RobH) [15:43:11] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] celery: use SyslogIdentifier [puppet] - 10https://gerrit.wikimedia.org/r/313562 (https://phabricator.wikimedia.org/T146581) (owner: 10Filippo Giunchedi) [15:45:49] (03PS1) 10Papaul: Add partman recipe and DHCP entries for labtestvirt2003 [puppet] - 10https://gerrit.wikimedia.org/r/356617 [15:52:53] 06Operations, 10Monitoring, 03Interactive-Sprint, 06Maps (Kartotherian), 07Technical-Debt: Geoshape and geoline subservices need monitoring - https://phabricator.wikimedia.org/T166776#3308016 (10debt) Hi @Gehel and @Pnorman - is this something that you can fix? [15:53:38] (03CR) 10RobH: [C: 032] Add partman recipe and DHCP entries for labtestvirt2003 [puppet] - 10https://gerrit.wikimedia.org/r/356617 (owner: 10Papaul) [15:53:44] (03PS2) 10RobH: Add partman recipe and DHCP entries for labtestvirt2003 [puppet] - 10https://gerrit.wikimedia.org/r/356617 (owner: 10Papaul) [15:56:37] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3308034 (10RobH) [15:56:57] papaul: did you want to kick labtestvirt2003 into the installer or shall i? https://phabricator.wikimedia.org/T166237 [15:57:07] i merged your patchset for the install_module update [15:58:14] (03PS6) 10BBlack: facter: add NUMA information [puppet] - 10https://gerrit.wikimedia.org/r/355809 [15:58:16] (03PS7) 10BBlack: NUMA via facter+hiera for RPS [puppet] - 10https://gerrit.wikimedia.org/r/355810 [15:58:18] (03PS7) 10BBlack: [placeholder] nginx NUMA-networking awareness [puppet] - 10https://gerrit.wikimedia.org/r/355811 [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170601T1600). Please do the needful. [16:02:13] robh: is switch port config is done you can thanks [16:03:13] papaul: ahh, ill do that and then kick it into the isntaller, thanks =] [16:03:26] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3308046 (10RobH) a:05Papaul>03RobH [16:03:35] robh: ok [16:04:10] (03PS4) 10Mark Bergsma: Add some protocol BGP class test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355415 [16:04:12] (03PS3) 10Mark Bergsma: Add bgp.ip unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355425 [16:04:15] (03PS3) 10Mark Bergsma: Add basic unit tests for protocol BGP send methods [debs/pybal] - 10https://gerrit.wikimedia.org/r/355445 [16:04:17] (03PS2) 10Mark Bergsma: Add BGP.parseOpen unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355795 [16:04:19] (03PS2) 10Mark Bergsma: Fix IPPrefix value comparisons with different packed paddings [debs/pybal] - 10https://gerrit.wikimedia.org/r/356611 [16:04:21] (03PS2) 10Mark Bergsma: Add basic BGP.parseUpdate test case [debs/pybal] - 10https://gerrit.wikimedia.org/r/356612 [16:04:22] (03PS1) 10Mark Bergsma: Add BGP.parse{KeepAlive,Notification} test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/356620 [16:06:58] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3308055 (10RobH) [16:07:00] 06Operations, 10ops-codfw, 06cloud-services-team, 10netops: codfw: labtestvirt2002 swith port configuration - https://phabricator.wikimedia.org/T166564#3308052 (10RobH) 05Open>03Resolved a:03RobH done! [16:10:21] (03PS22) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [16:18:57] (03PS23) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [16:19:19] 06Operations, 10ops-codfw, 10DBA: db2044 cannot install jessie - requires BIOS firmware upgrade - https://phabricator.wikimedia.org/T166683#3308073 (10jcrespo) Thank you, taking over for retrying reimage. [16:21:28] (03PS24) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [16:22:33] !log retrying reimage of db2044 [16:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:29] 06Operations, 10Monitoring, 03Interactive-Sprint, 06Maps (Kartotherian), 07Technical-Debt: Geoshape and geoline subservices need monitoring - https://phabricator.wikimedia.org/T166776#3308084 (10Gehel) With some digging, either of us should be able to do it. The kartotherian comes bundled with its checks... [16:31:43] I am curious to see if db2044 works after the firmware upgrade... [16:32:11] (03PS1) 10RobH: setting labtestvirt2003 into site.pp [puppet] - 10https://gerrit.wikimedia.org/r/356625 [16:33:58] we'll see, I think it will [16:34:48] ive also been lurking on that task [16:34:54] alternatively, it could be the power drain and we are making papaul overwork :-/ [16:34:58] i hope the firmware update will fix it.... [16:35:34] (03PS1) 10Fdans: Add exception for events tagged as coming from MW [puppet] - 10https://gerrit.wikimedia.org/r/356626 (https://phabricator.wikimedia.org/T67508) [16:38:32] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [16:39:05] (03CR) 10RobH: [C: 032] setting labtestvirt2003 into site.pp [puppet] - 10https://gerrit.wikimedia.org/r/356625 (owner: 10RobH) [16:42:45] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3308139 (10RobH) [16:43:42] (03PS1) 10RobH: Revert "setting labtestvirt2003 into site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/356628 [16:43:48] (03CR) 10RobH: [C: 032] Revert "setting labtestvirt2003 into site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/356628 (owner: 10RobH) [16:44:11] (03CR) 10Jcrespo: [C: 031] "Moritz- I am going to merge the current state because it will allow me to test it (iterating with systemd on an existing package will be e" [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [16:45:14] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3308141 (10RobH) Tried to add into site.pp but seems there is an error condition due to kernels in use (system was just freshly installed.) > Error: Could... [16:52:32] (03CR) 10Jcrespo: [C: 032] dbtools: Update package for stretch and include systemd support [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [16:52:52] volans: finally! Thanks! [16:53:44] (03Merged) 10jenkins-bot: dbtools: Update package for stretch and include systemd support [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [16:53:57] mutante: Did you set up the cert management crons on wikitech-static? [16:54:28] XioNoX: :) [16:55:31] (03CR) 10Jcrespo: "Something is wrong here, some changes got lost in the process, I will fix that on a separate comit." [software] - 10https://gerrit.wikimedia.org/r/356074 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170601T1700). Please do the needful. [17:01:25] no parsoid deploy today [17:02:19] !log sto mysql, eventlogging_sync and shutdown db1047 (analytics-store) for maintenance - T159266 [17:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:41] T159266: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266 [17:08:17] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [17:08:17] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [17:09:21] (03CR) 10Gehel: "@EBernhardson wasn't there an issue if logstash plugins directory contained a directory which isnt a plugin? Could we have an issue with t" (031 comment) [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/354466 (https://phabricator.wikimedia.org/T165748) (owner: 10Thcipriani) [17:09:56] (03CR) 10Gehel: [C: 031] "This looks reasonable, but my understanding of scap3 is minimal." [puppet] - 10https://gerrit.wikimedia.org/r/354472 (https://phabricator.wikimedia.org/T165748) (owner: 10Thcipriani) [17:12:37] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3308265 (10RobH) a:05RobH>03chasemp So it kicked back that failure, and though it now has the puppet keys and such installed, has failures left over from... [17:16:02] (03PS1) 10Faidon Liambotis: varnish: don't use $name as a parameter name [puppet] - 10https://gerrit.wikimedia.org/r/356630 [17:16:03] (03PS1) 10Faidon Liambotis: restbase: don't define parameter $hosts twice [puppet] - 10https://gerrit.wikimedia.org/r/356631 [17:16:05] (03PS1) 10Faidon Liambotis: phabricator: don't assign a new hash key [puppet] - 10https://gerrit.wikimedia.org/r/356632 [17:16:08] _joe_: ^ [17:16:18] bblack: ^ too for the first [17:16:44] uh [17:16:46] wait, buggy [17:16:47] 06Operations, 10Wikimedia-Mailing-lists: cleanup mailman archives - introduce apache rewrites - https://phabricator.wikimedia.org/T109609#3308276 (10RobH) a:05RobH>03None [17:17:12] (03PS2) 10Faidon Liambotis: varnish: don't use $name as a parameter name [puppet] - 10https://gerrit.wikimedia.org/r/356630 [17:17:14] (03PS2) 10Faidon Liambotis: restbase: don't define parameter $hosts twice [puppet] - 10https://gerrit.wikimedia.org/r/356631 [17:17:16] ok now [17:17:16] (03PS2) 10Faidon Liambotis: phabricator: don't assign a new hash key [puppet] - 10https://gerrit.wikimedia.org/r/356632 [17:18:17] paravoid: $name gets used a lot within varnish::instance too... [17:19:02] now I'm not sure which version of $name gets used in all those cases heh [17:20:55] PROBLEM - puppet last run on labtestvirt2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 11 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-node-exporter] [17:21:09] oh, right [17:21:15] good point [17:21:43] * paravoid wears a brown paper bag [17:22:49] (03CR) 10Giuseppe Lavagetto: [C: 031] restbase: don't define parameter $hosts twice [puppet] - 10https://gerrit.wikimedia.org/r/356631 (owner: 10Faidon Liambotis) [17:22:55] PROBLEM - DPKG on labtestvirt2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:23:06] (03PS3) 10Faidon Liambotis: restbase: don't define parameter $hosts twice [puppet] - 10https://gerrit.wikimedia.org/r/356631 [17:23:08] (03PS3) 10Faidon Liambotis: varnish: don't use $name as a parameter name [puppet] - 10https://gerrit.wikimedia.org/r/356630 [17:23:10] (03PS3) 10Faidon Liambotis: phabricator: don't assign a new hash key [puppet] - 10https://gerrit.wikimedia.org/r/356632 [17:23:11] well I guess we can run the puppet compiler on the cp* nodes [17:24:26] (03CR) 10Giuseppe Lavagetto: [C: 031] phabricator: don't assign a new hash key [puppet] - 10https://gerrit.wikimedia.org/r/356632 (owner: 10Faidon Liambotis) [17:33:11] paravoid: I'm trying now [17:33:20] on one anyways, they're all the same in this respect [17:33:43] oh I was about to [17:33:46] ok, cool :) [17:33:59] fwiw, I did find -type *pp -exec puppet parser validate {} \; [17:34:04] locally, on stretch, with puppet 4.8 [17:34:11] surprisingly, these three changesets were the only errors [17:34:13] https://puppet-compiler.wmflabs.org/6639/cp1065.eqiad.wmnet/change.cp1065.eqiad.wmnet.err [17:34:18] which means puppet parser validate is useless [17:34:23] ^ apparently it has some other consequences [17:34:27] for the puppet 4 migration :) [17:34:51] elukey, transcode error is: An unknown error occurred in storage backend "local-swift-eqiad" [17:36:05] and it failed again after restart [17:36:44] yannf: this smells like the error that godog worked on recently, namely transcodes too big for swift [17:37:24] https://commons.wikimedia.org/wiki/File:Bhagam_Bhag_(1956).webm 600 × 480 (655.24 MB) [17:37:32] so probably not [17:37:44] https://quarry.wmflabs.org/query/19158 [17:38:20] bblack: gah [17:38:26] I fucked up again, maybe I should just call it a day [17:38:56] (03PS4) 10Faidon Liambotis: phabricator: don't assign a new hash key [puppet] - 10https://gerrit.wikimedia.org/r/356632 [17:38:56] oh, the else clause [17:38:58] (03PS4) 10Faidon Liambotis: varnish: don't use $name as a parameter name [puppet] - 10https://gerrit.wikimedia.org/r/356630 [17:38:59] yes [17:39:00] it took me a while to find it heh [17:39:19] yeah me too [17:40:03] sorry :/ [17:40:17] really that whole thing is a mess to start with [17:40:23] it can probably be simplified [17:40:28] but some other time :) [17:41:26] (03CR) 10BBlack: [C: 031] "LGTM in https://puppet-compiler.wmflabs.org/6640/cp1065.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/356630 (owner: 10Faidon Liambotis) [17:41:32] !log shutting down wdqs1002 for maintenance - T166524 [17:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:40] T166524: high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524 [17:41:59] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1002.eqiad.wmnet [17:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:15] bblack: so shall I merge then? [17:42:57] PROBLEM - Host wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:43:45] paravoid: sure [17:43:52] damn, sorry, wdqs1002 is me, I did not wait enough... [17:44:43] (03CR) 10Faidon Liambotis: [C: 032] restbase: don't define parameter $hosts twice [puppet] - 10https://gerrit.wikimedia.org/r/356631 (owner: 10Faidon Liambotis) [17:45:09] (03CR) 10Faidon Liambotis: [C: 032] "Noop: https://puppet-compiler.wmflabs.org/6641/" [puppet] - 10https://gerrit.wikimedia.org/r/356632 (owner: 10Faidon Liambotis) [17:45:23] (03CR) 10Faidon Liambotis: [C: 032] varnish: don't use $name as a parameter name [puppet] - 10https://gerrit.wikimedia.org/r/356630 (owner: 10Faidon Liambotis) [17:46:28] elukey, should I open another report? [17:48:17] yannf: this was the issue https://phabricator.wikimedia.org/T166482 [17:49:56] ok, thanks, I will add the recent failures there [17:51:11] 06Operations, 10TimedMediaHandler, 10media-storage: Persistent failure of TMH to transcode videos at specific resolutions - https://phabricator.wikimedia.org/T166482#3297326 (10Yann) More failed transcodes today with the same error: https://commons.wikimedia.org/wiki/File:Bhagam_Bhag_(1956).webm https://comm... [17:52:03] (03PS25) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [17:56:23] (03PS26) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [17:58:45] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [17:58:45] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [17:58:45] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [17:58:45] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [17:58:55] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [17:58:55] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [17:58:55] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [17:58:55] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [17:58:55] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [17:58:56] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [17:58:56] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [17:59:05] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [17:59:09] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [17:59:15] RECOVERY - Host wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 37.30 ms [17:59:16] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [17:59:16] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [17:59:16] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [17:59:25] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [17:59:25] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [17:59:35] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [17:59:35] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [17:59:35] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [17:59:47] sorry that's me, I downtimed cp4021, but stupid ipsec :P [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170601T1800). Please do the needful. [18:01:03] really the only logically-correct answer there is split up the ipsec checks so there's N separate checks per host on each host, and then do service->remote_host dependency stuff [18:01:09] but that's a lot of excess checks [18:02:08] (03PS1) 10Andrew Bogott: Labtestvirt2003: Add to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/356641 [18:02:15] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 58 ESP OK [18:02:15] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 72 ESP OK [18:02:16] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 72 ESP OK [18:02:25] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 72 ESP OK [18:02:25] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 58 ESP OK [18:02:35] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 72 ESP OK [18:02:35] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 72 ESP OK [18:02:35] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 58 ESP OK [18:02:45] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 58 ESP OK [18:02:45] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 72 ESP OK [18:02:45] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 72 ESP OK [18:02:45] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 72 ESP OK [18:02:50] (03PS27) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [18:02:55] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 58 ESP OK [18:02:55] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 58 ESP OK [18:02:55] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 58 ESP OK [18:02:56] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 58 ESP OK [18:02:56] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 58 ESP OK [18:02:56] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 58 ESP OK [18:02:56] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 58 ESP OK [18:03:05] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 72 ESP OK [18:03:05] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 72 ESP OK [18:03:36] (03CR) 10Andrew Bogott: [C: 032] Labtestvirt2003: Add to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/356641 (owner: 10Andrew Bogott) [18:04:23] bblack: we could check if we can set the check on all the hosts to be dependend on the monitored host with icinga dependencies [18:04:45] (03Draft2) 10Paladox: Gerrit: Increase packedGitOpenFiles to 65536 [puppet] - 10https://gerrit.wikimedia.org/r/356586 [18:06:55] RECOVERY - DPKG on labtestvirt2003 is OK: All packages OK [18:08:15] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [18:08:25] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 [18:08:45] PROBLEM - Host labtestvirt2003 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:10] volans: without splitting up the check, that leads to a situation where once one host in a cluster is dead, we've lost ipsec monitoring for the rest of the cluster [18:09:35] RECOVERY - Host labtestvirt2003 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [18:09:37] oh is a single check, right, forgot about that [18:10:22] right now the scenario is something like: if there's 36 hosts in a cluster, there's 72x connections (4v+v6) defined on each host to all the others [18:10:22] labtestvirt2003 is a brand new box, I just downtimed it. Sorry for the noise. [18:10:30] and we have one check per host that validates all 72 connections are ok [18:11:02] (it's not quite that simple, we don't do a full mesh, just all edge nodes to core nodes, and core nodes to opposite-site core nodes, but still same basic problem) [18:11:41] if we want host deps to silence them intelligently without suppressing important notifications, we'd have to split it up into one check per connection [18:11:58] which means ~50x+ the total check count for ipsec, which is a lot to add onto our huge list [18:12:56] yeah, unless we could check from icinga that a host is down, but still a lot of hosts to check, probably too heavy and not worth [18:13:16] (03PS28) 10Elukey: role::zookeeper: refactor to multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [18:18:42] (03PS1) 10Faidon Liambotis: puppet: disable stringified facts in Labs as well [puppet] - 10https://gerrit.wikimedia.org/r/356644 [18:20:47] !log wdqs1002 back in LVS - thermal paste added - T166524 [18:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:57] T166524: high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524 [18:20:59] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1002.eqiad.wmnet [18:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:42] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 06Discovery-Search (Current work): high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524#3308686 (10Gehel) a:05Cmjohnson>03Gehel thermal paste has been added by @Cmjohnson, this can be closed. [18:31:15] probably another small storm of cp4021 ipsec alerts incoming [18:38:15] (03CR) 10Rush: Add initial class for ferm rules shared by all labstore hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [18:40:53] (nope, finished faster this time) [18:41:11] (03PS1) 10Jcrespo: mariadb: Add tokudb support for analytics eventlogging nodes [puppet] - 10https://gerrit.wikimedia.org/r/356648 [18:43:03] (03Draft1) 10Paladox: Gerrit: Reveal the author in the title of the email [puppet] - 10https://gerrit.wikimedia.org/r/356645 [18:43:06] (03PS2) 10Paladox: Gerrit: Reveal the author in the title of the email [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) [18:50:42] (03PS2) 10Rush: nodepool: lower min-ready for trusty [puppet] - 10https://gerrit.wikimedia.org/r/356466 (owner: 10Hashar) [18:54:58] (03CR) 10VolkerE: [C: 031] dynamicproxy: Centralise error page template and use it [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170601T1900). Please do the needful. [19:00:17] * thcipriani does [19:01:00] (03PS1) 10Andrew Bogott: Add hiera file for labtestvirt2003 [puppet] - 10https://gerrit.wikimedia.org/r/356650 [19:01:17] (03CR) 10Thcipriani: [C: 032] Revert "Add RejectParserCacheValue handler for mw-parser-output invalidation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356589 (https://phabricator.wikimedia.org/T166345) (owner: 10Thcipriani) [19:02:16] (03Merged) 10jenkins-bot: Revert "Add RejectParserCacheValue handler for mw-parser-output invalidation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356589 (https://phabricator.wikimedia.org/T166345) (owner: 10Thcipriani) [19:02:32] (03CR) 10Andrew Bogott: [C: 032] Add hiera file for labtestvirt2003 [puppet] - 10https://gerrit.wikimedia.org/r/356650 (owner: 10Andrew Bogott) [19:05:26] (03CR) 10jenkins-bot: Revert "Add RejectParserCacheValue handler for mw-parser-output invalidation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356589 (https://phabricator.wikimedia.org/T166345) (owner: 10Thcipriani) [19:06:21] 06Operations, 10Monitoring, 03Interactive-Sprint, 06Maps (Kartotherian), 07Technical-Debt: Geoshape and geoline subservices need monitoring - https://phabricator.wikimedia.org/T166776#3308868 (10debt) a:03Gehel Let's get this test created so that if maps stop working, someone is notified. [19:07:44] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:08:37] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: [[gerrit:356589|Revert "Add RejectParserCacheValue handler for mw-parser-output invalidation"]] T166345 (duration: 00m 43s) [19:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:46] T166345: wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345 [19:12:42] (03CR) 10Chad: [C: 04-1] "This is not the right place to change it, this is the subject line for /all/ e-mails, which would be spammy and overly-long." [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) (owner: 10Paladox) [19:15:17] (03PS1) 10Andrew Bogott: Labvirt2003: Switch to xfs [puppet] - 10https://gerrit.wikimedia.org/r/356655 [19:16:00] (03CR) 10Chad: [C: 04-1] "This is wayyyyy more than we want. Something around 6000 or so should be plenty--we currently only use about 4800-4900 on average (total, " [puppet] - 10https://gerrit.wikimedia.org/r/356586 (owner: 10Paladox) [19:16:29] (03CR) 10Andrew Bogott: [C: 032] Labvirt2003: Switch to xfs [puppet] - 10https://gerrit.wikimedia.org/r/356655 (owner: 10Andrew Bogott) [19:16:31] (03PS3) 10Paladox: Gerrit: Increase packedGitOpenFiles to 65536 [puppet] - 10https://gerrit.wikimedia.org/r/356586 [19:16:42] (03CR) 10Paladox: "> This is wayyyyy more than we want. Something around 6000 or so" [puppet] - 10https://gerrit.wikimedia.org/r/356586 (owner: 10Paladox) [19:17:03] (03PS9) 10Paladox: Gerrit: Set ulimit's in gerrit.service [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (https://phabricator.wikimedia.org/T158946) [19:17:48] (03PS3) 10Paladox: Gerrit: Reveal the author in the title of the email [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) [19:21:06] paladox: Thanks for lowering that. Hitting 65k is a sign gerrit's freaking out and doing things it shouldn't :) [19:21:14] We don't want to exhaust our entire quota on the one process! [19:21:15] :) [19:21:21] (03PS1) 10Thcipriani: All wikis to php-1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356656 [19:21:35] RainbowSprinkles Your welcome, thanks for reviewing :). I was thinking that too but didnt really know a number :) [19:22:06] (03CR) 10Thcipriani: [C: 032] All wikis to php-1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356656 (owner: 10Thcipriani) [19:22:15] !log gehel@tin Started deploy [wdqs/wdqs@3936e36]: (no justification provided) [19:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:00] And actually, I think we only need to fix it in the gerrit config. [19:23:28] Eh, harmless to do both [19:23:32] Yeh [19:23:33] (03Merged) 10jenkins-bot: All wikis to php-1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356656 (owner: 10Thcipriani) [19:23:35] !log gehel@tin Finished deploy [wdqs/wdqs@3936e36]: (no justification provided) (duration: 01m 20s) [19:23:41] (03CR) 10jenkins-bot: All wikis to php-1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356656 (owner: 10Thcipriani) [19:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:52] SMalyshev: wdqs fix deployed ^^ [19:23:54] But, it seems systemd does not get affected by shell scripts if we set ulimits in there [19:24:39] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3041135 (10faidon) OK, as far as the problem statement is concerned, I think we've identified three separate problems so far: 1. Documentation (and potentially tooling): `backports` was always meant to be just... [19:25:25] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.30.0-wmf.2 [19:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:58] manually purged enwiki mainpage, no explosions [19:27:00] RainbowSprinkles even if we raise it in that config, systemd does not get affected by it, it seems. But it has special ulimit configs in systemd that allows us to do it :) [19:28:38] (03CR) 10Ottomata: Kafka broker profile and roles for new 'aggregate' (TBD) cluster and 'simple' cluster (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [19:29:02] (03PS5) 10Ottomata: Kafka broker profile and roles for new 'aggregate' (TBD) cluster and 'simple' cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [19:35:44] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [19:37:50] is wmf.3 deployment happening or the train is frozen for now? [19:39:34] teh slave cannot catch teh master, teh Internet iz broken!!11! [19:39:49] well, no more [19:40:09] SMalyshev: we're on wmf.2 now and I think we'll stick there considering the schedule was to have wmf.3 on all wikis today. I think the plan next week will be to move forward with wmf.4 as long as wmf.2 stays not broken. [19:40:44] thcipriani: got it, thanks [19:57:26] (03PS6) 10Ottomata: Kafka broker profile and roles for new 'aggregate' (TBD) cluster and 'simple' cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [20:14:43] 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic, 15User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3309109 (10Ottomata) [20:18:08] !log bsitzmann@tin Started deploy [mobileapps/deploy@2a8e648]: Update mobileapps to c4dc72d [20:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:14] (03CR) 10Ottomata: Genericize ca-manager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [20:23:26] !log bsitzmann@tin Finished deploy [mobileapps/deploy@2a8e648]: Update mobileapps to c4dc72d (duration: 05m 18s) [20:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:00] !log mobrovac@tin Started deploy [citoid/deploy@ba0db9c]: Update spec to minimise alert noise - T163986 [20:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:09] T163986: Revamp spec.yaml in citoid - https://phabricator.wikimedia.org/T163986 [20:37:20] !log mobrovac@tin Finished deploy [citoid/deploy@ba0db9c]: Update spec to minimise alert noise - T163986 (duration: 05m 20s) [20:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:29] T163986: Revamp spec.yaml in citoid - https://phabricator.wikimedia.org/T163986 [20:54:42] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, and 2 others: wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3309212 (10Gilles) 05Open>03Resolved a:03Gilles [21:01:29] (03PS1) 10BryanDavis: shinken, icinga: direct bots to #wikimedia-cloud [puppet] - 10https://gerrit.wikimedia.org/r/356702 (https://phabricator.wikimedia.org/T166420) [21:01:31] (03PS1) 10BryanDavis: labs: Direct people to #wikimedia-cloud for support [puppet] - 10https://gerrit.wikimedia.org/r/356703 (https://phabricator.wikimedia.org/T166420) [21:07:17] ACKNOWLEDGEMENT - HP RAID on ms-be1020 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166837 [21:07:23] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T166837#3309262 (10ops-monitoring-bot) [21:14:18] (03CR) 10Rush: [C: 032] labs: Direct people to #wikimedia-cloud for support [puppet] - 10https://gerrit.wikimedia.org/r/356703 (https://phabricator.wikimedia.org/T166420) (owner: 10BryanDavis) [21:14:25] (03CR) 10Rush: [C: 032] shinken, icinga: direct bots to #wikimedia-cloud [puppet] - 10https://gerrit.wikimedia.org/r/356702 (https://phabricator.wikimedia.org/T166420) (owner: 10BryanDavis) [21:15:46] (03PS7) 10Ottomata: Kafka broker profile and roles for new 'aggregate' (TBD) cluster and 'simple' cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [21:33:15] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:33:34] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [21:33:54] PROBLEM - SSH access on gerrit2001 is CRITICAL: connect to address 208.80.153.106 and port 29418: Connection refused [21:34:24] hmm [21:36:04] That's me ^ [21:36:14] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [21:36:20] :) [21:36:34] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [21:36:54] RECOVERY - SSH access on gerrit2001 is OK: SSH OK - GerritCodeReview_2.13.8 (SSHD-CORE-1.2.0) (protocol 2.0) [21:40:29] (03CR) 10EBernhardson: "a non-plugin directory should be fine now. Previous incarnations that might have been a problem, but now that we build a plugin pack and o" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/354466 (https://phabricator.wikimedia.org/T165748) (owner: 10Thcipriani) [21:59:01] !log gerrit2001: Upgraded to 2.13.8, seems to be running fine this time. [21:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:54] PROBLEM - Blazegraph process on wdqs2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (blazegraph), regex args ^java .* blazegraph-service-.*war [22:02:24] PROBLEM - Blazegraph Port on wdqs2001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused [22:03:24] RECOVERY - Blazegraph Port on wdqs2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [22:03:54] RECOVERY - Blazegraph process on wdqs2001 is OK: PROCS OK: 1 process with UID = 997 (blazegraph), regex args ^java .* blazegraph-service-.*war [22:09:54] 06Operations, 10Wikimedia-General-or-Unknown, 07I18n: wikimediafoundation.org's language selector is confusing to most visitors who don't have accounts there - https://phabricator.wikimedia.org/T166782#3309408 (10Nemo_bis) [22:15:14] PROBLEM - swift-container-replicator on ms-be1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:16:04] RECOVERY - swift-container-replicator on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [22:24:24] (03PS1) 10Chad: Drop gerrit2001.yaml only includes temp admin permissions [puppet] - 10https://gerrit.wikimedia.org/r/356765 [22:37:52] 06Operations, 10Security-Reviews, 07Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#3309456 (10egalvezwmf) Thanks @Aklapper ! Perhaps a few of us can start to explore the software as a potential solution before we do the security review. I may have time to start this next quarter.... [22:39:38] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 06Discovery-Search (Current work): high replication lag on wdqs1002 - https://phabricator.wikimedia.org/T166524#3309458 (10debt) 05Open>03Resolved [23:00:06] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3309503 (10Andrew) Labtestvirt2003 is installed now, and properly attached to rabbitmq and the nova controller. - It is not currently in the scheduler pool (... [23:05:03] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3309507 (10Andrew) 05Open>03Resolved [23:05:22] no bot announcing swat? [23:05:30] jouncebot: you there? [23:05:35] jouncebot: next [23:05:35] In 0 hour(s) and 54 minute(s): Phabricator Upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170602T0000) [23:07:14] i'm the only one in the window, i'll just ship it [23:07:20] (03CR) 10EBernhardson: [C: 032] [cirrus] Blacklist wikinews, wikiversity and multimedia from cross project search on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353043 (https://phabricator.wikimedia.org/T163463) (owner: 10DCausse) [23:13:23] ebernhardson: let me know when you're all done then? I'll do the phabricator upgrade as soon as everything is clear from SWAT [23:13:27] (03PS3) 10EBernhardson: [cirrus] Blacklist wikinews, wikiversity and multimedia from cross project search on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353043 (https://phabricator.wikimedia.org/T163463) (owner: 10DCausse) [23:13:37] (03CR) 10EBernhardson: [C: 032] [cirrus] Blacklist wikinews, wikiversity and multimedia from cross project search on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353043 (https://phabricator.wikimedia.org/T163463) (owner: 10DCausse) [23:14:18] twentyafterfour: sure, should be quick it's just a config change [23:14:32] (03Merged) 10jenkins-bot: [cirrus] Blacklist wikinews, wikiversity and multimedia from cross project search on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353043 (https://phabricator.wikimedia.org/T163463) (owner: 10DCausse) [23:14:44] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:15:34] (03CR) 10Andrew Bogott: [C: 031] "I will plan to merge this tomorrow, and then watch a set of canaries to make sure this doesn't break anything." [puppet] - 10https://gerrit.wikimedia.org/r/356644 (owner: 10Faidon Liambotis) [23:15:34] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 73771 bytes in 0.329 second response time [23:15:42] (03CR) 10jenkins-bot: [cirrus] Blacklist wikinews, wikiversity and multimedia from cross project search on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353043 (https://phabricator.wikimedia.org/T163463) (owner: 10DCausse) [23:18:21] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T163463: apply sister search restrictions requested by enwiki (duration: 00m 40s) [23:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:32] T163463: suppress results from multimedia, wikiversity and wikinews; wikivoyage title search only - https://phabricator.wikimedia.org/T163463 [23:20:14] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:14] PROBLEM - Nginx local proxy to apache on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:15] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: T163463: apply sister search restrictions requested by enwiki (duration: 00m 39s) [23:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:36] twentyafterfour: all done [23:21:14] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.166 second response time [23:21:14] RECOVERY - Nginx local proxy to apache on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 5.373 second response time [23:24:41] thanks! [23:25:42] !log Preparing phabricator update to tag release/2017-06-01/1 [ https://phabricator.wikimedia.org/project/view/2802/ ] [23:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:06] (03PS2) 10Faidon Liambotis: Remove str2bool from is_virtual facts [puppet] - 10https://gerrit.wikimedia.org/r/356031 (https://phabricator.wikimedia.org/T166372) [23:29:08] (03PS2) 10Faidon Liambotis: raid: switch from stringified fact to array [puppet] - 10https://gerrit.wikimedia.org/r/356030 (https://phabricator.wikimedia.org/T166372) [23:29:10] (03PS2) 10Faidon Liambotis: Remove to_i/Integer from now unstringified facts [puppet] - 10https://gerrit.wikimedia.org/r/356032 (https://phabricator.wikimedia.org/T166372) [23:29:10] !log Performing phabricator update, expect momentary downtime. [23:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:08] !log phabricator upgrade complete. [23:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:06] woot. T116515 is resolved. :) [23:35:06] T116515: Enable embedding of media from Wikimedia Commons - https://phabricator.wikimedia.org/T116515 [23:38:05] yay! [23:38:14] worth a phame post? :) [23:53:21] greg-g: https://phabricator.wikimedia.org/phame/post/view/18/new_feature_embed_videos_from_commons_into_phabricator_markup/ [23:55:13] twentyafterfour: <3