[00:11:46] 10Ops-Access-Requests, 6operations, 6Multimedia: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1876813 (10matmarex) Thank you, it all works fine. >>! In T119404#1829716, @matmarex wrote: > Full name: Bartosz Dziewoński > Wikitech page: ht... [00:15:30] (03CR) 10Alex Monk: "Is there a bug open with the error?" [puppet] - 10https://gerrit.wikimedia.org/r/258713 (owner: 10Ori.livneh) [00:40:59] (03PS3) 10Yuvipanda: k8s: Use official docker packages instead of jessie's [puppet] - 10https://gerrit.wikimedia.org/r/258733 [00:41:06] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Use official docker packages instead of jessie's [puppet] - 10https://gerrit.wikimedia.org/r/258733 (owner: 10Yuvipanda) [00:45:29] (03CR) 10Reedy: "Nope.. I was just bumping it periodically previously, and hadn't done it for a while...." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255822 (owner: 10Reedy) [01:01:43] (03PS5) 10Madhuvishy: [WIP] apache: Add role to serve static sites on multiple hosts using apache [puppet] - 10https://gerrit.wikimedia.org/r/258096 [01:22:31] (03PS6) 10Madhuvishy: apache: Add role to serve static sites on multiple hosts using apache [puppet] - 10https://gerrit.wikimedia.org/r/258096 [01:34:21] (03CR) 10TTO: "This is working well on Labs, with one exception. I'll try to get a production patch together for SWAT soon, before the summer shutdown." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [01:52:12] (03PS7) 10Madhuvishy: apache: Add role to serve static sites on multiple hosts using apache [puppet] - 10https://gerrit.wikimedia.org/r/258096 [01:58:03] (03PS8) 10Yuvipanda: apache: Add role to serve static sites on multiple hosts using apache [puppet] - 10https://gerrit.wikimedia.org/r/258096 (https://phabricator.wikimedia.org/T120891) (owner: 10Madhuvishy) [01:58:10] (03PS9) 10Yuvipanda: apache: Add role to serve static sites on multiple hosts using apache [puppet] - 10https://gerrit.wikimedia.org/r/258096 (https://phabricator.wikimedia.org/T120891) (owner: 10Madhuvishy) [01:59:15] (03CR) 10Yuvipanda: [C: 032 V: 032] "Thanks for the patch :)" [puppet] - 10https://gerrit.wikimedia.org/r/258096 (https://phabricator.wikimedia.org/T120891) (owner: 10Madhuvishy) [02:01:30] (03Abandoned) 10Bmansurov: Enable RelatedArticles and Cards on Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255553 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [02:02:04] (03Abandoned) 10Bmansurov: Use CirrusSearch API in RelatedArticles on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252022 (https://phabricator.wikimedia.org/T116707) (owner: 10Bmansurov) [02:03:55] 7Puppet, 6Analytics-Kanban, 5Patch-For-Review: Puppet support for multiple Dashiki instances running on one server - https://phabricator.wikimedia.org/T120891#1876914 (10madhuvishy) [02:04:18] 7Puppet, 6Analytics-Kanban, 5Patch-For-Review: Puppet support for multiple Dashiki instances running on one server - https://phabricator.wikimedia.org/T120891#1876916 (10madhuvishy) a:5yuvipanda>3madhuvishy [02:22:42] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.8) (duration: 09m 00s) [02:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:39] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Dec 14 02:29:39 UTC 2015 (duration 6m 57s) [02:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:50:38] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [03:16:18] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [03:29:59] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: Puppet has 1 failures [03:55:38] RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [04:17:49] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 11.54% of data above the critical threshold [100000000.0] [04:35:17] 7Puppet, 6Analytics-Kanban, 5Patch-For-Review: Puppet support for multiple Dashiki instances running on one server - https://phabricator.wikimedia.org/T120891#1876982 (10yuvipanda) Gerrrit Bot says uploaded by me, but was actually from @madhuvishy [04:47:28] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:57:48] PROBLEM - puppet last run on ganeti2002 is CRITICAL: CRITICAL: puppet fail [05:10:49] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Puppet has 1 failures [05:23:29] RECOVERY - puppet last run on ganeti2002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [05:36:27] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:05:37] PROBLEM - Disk space on elastic1008 is CRITICAL: DISK CRITICAL - free space: / 1062 MB (3% inode=95%) [06:27:07] RECOVERY - Disk space on elastic1008 is OK: DISK OK [06:28:08] PROBLEM - puppet last run on rdb2003 is CRITICAL: CRITICAL: puppet fail [06:30:49] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:08] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:09] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:48] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:49] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:49] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:43:59] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:08] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:55:48] RECOVERY - puppet last run on rdb2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:38] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:56:38] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:18] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:18] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:19] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:09:38] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:07:01] (03PS2) 10Giuseppe Lavagetto: gdash: completely decommission all req* dashboards [puppet] - 10https://gerrit.wikimedia.org/r/257373 (https://phabricator.wikimedia.org/T118979) [08:07:41] (03CR) 10Giuseppe Lavagetto: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/257373 (https://phabricator.wikimedia.org/T118979) (owner: 10Giuseppe Lavagetto) [08:08:04] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] gdash: completely decommission all req* dashboards [puppet] - 10https://gerrit.wikimedia.org/r/257373 (https://phabricator.wikimedia.org/T118979) (owner: 10Giuseppe Lavagetto) [08:12:21] (03CR) 10Chmarkine: [C: 031] ssl_ciphersuite: prefer SHA-2 HMACs more-strongly [puppet] - 10https://gerrit.wikimedia.org/r/258763 (owner: 10BBlack) [08:16:52] (03PS1) 10Giuseppe Lavagetto: gdash: fix typos... [puppet] - 10https://gerrit.wikimedia.org/r/258935 [08:17:19] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] gdash: fix typos... [puppet] - 10https://gerrit.wikimedia.org/r/258935 (owner: 10Giuseppe Lavagetto) [08:30:13] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/258491 (owner: 10Alexandros Kosiaris) [08:31:38] (03CR) 10Giuseppe Lavagetto: "@bblack the problem is - the compiler verifies the changes you'd get when if you are going to merge the change now, so the proper test is " [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [08:32:28] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Lacks error management." (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [08:50:09] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [08:52:09] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [09:29:22] (03PS1) 10TTO: Enable cluster-wide import setup in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258943 (https://phabricator.wikimedia.org/T17583) [09:29:47] (03CR) 10jenkins-bot: [V: 04-1] Enable cluster-wide import setup in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258943 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [09:30:06] (03CR) 10Chmarkine: [C: 031] "In addition to this change, I also suggest to prefer ECDHE-ECDSA-AES256 to ECDHE-RSA-AES128 for both strong and mid lists." [puppet] - 10https://gerrit.wikimedia.org/r/258762 (owner: 10BBlack) [09:33:09] (03PS2) 10TTO: Enable cluster-wide import setup in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258943 (https://phabricator.wikimedia.org/T17583) [09:36:48] (03CR) 10TTO: [C: 04-1] "There are yet more problems with this :( :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258943 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [09:44:30] mobrovac: FYI I'm going to reimage 1007 shortly [09:44:41] kk thnx godog [09:44:58] no deploy today, so all good [09:45:28] PROBLEM - puppet last run on mw1057 is CRITICAL: CRITICAL: Puppet has 1 failures [09:45:33] godog: just let me know when the machine is back up and imaged [09:46:24] mobrovac: kk, I will! [09:46:29] !log reimage restbase1007 [09:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:13] 6operations, 10MediaWiki-General-or-Unknown, 7Graphite, 5MW-1.27-release-notes, and 2 others: mediawiki should send statsd metrics in batches - https://phabricator.wikimedia.org/T116031#1877170 (10fgiunchedi) the train has hit all wikis now, still seeing some jobqueue related traffic not in batches but loo... [09:59:06] (03CR) 10Mobrovac: "I tested this in beta:" [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [10:01:03] (03PS1) 10Filippo Giunchedi: diamond: stop referencing to python-diamond [puppet] - 10https://gerrit.wikimedia.org/r/258952 [10:03:23] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [10:03:33] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [10:03:38] (03CR) 10Filippo Giunchedi: [C: 04-1] "I guess this would need adaptation after https://gerrit.wikimedia.org/r/#/c/258657/ ?" [puppet] - 10https://gerrit.wikimedia.org/r/258491 (owner: 10Alexandros Kosiaris) [10:07:18] (03CR) 10Filippo Giunchedi: [C: 031] import debian directory [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257168 (owner: 10Ori.livneh) [10:11:27] !log elastic in eqiad: freezing writes (to restart elastic1016) [10:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:40] !log restarting mysql at labsdb1004 [10:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:13:15] RECOVERY - puppet last run on mw1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:19:09] !log elastic in eqiad: restarting elastic1016 (to release deleted filedesc) [10:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "So, Faidon already merged something similar in" [puppet] - 10https://gerrit.wikimedia.org/r/258491 (owner: 10Alexandros Kosiaris) [10:24:24] (03PS18) 10Giuseppe Lavagetto: etcd: auth puppetization [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [10:25:42] (03PS1) 10Jcrespo: Reconfigure and upgrade labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/258959 [10:27:37] (03CR) 10Jcrespo: [C: 032] Reconfigure and upgrade labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/258959 (owner: 10Jcrespo) [10:29:39] (03PS2) 10Filippo Giunchedi: diamond: stop referencing to python-diamond [puppet] - 10https://gerrit.wikimedia.org/r/258952 [10:29:45] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] diamond: stop referencing to python-diamond [puppet] - 10https://gerrit.wikimedia.org/r/258952 (owner: 10Filippo Giunchedi) [10:30:53] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: puppet fail [10:35:29] 6operations, 10Traffic, 7HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1877228 (10Chmarkine) @BBlack I suggest to remove at least `VeriSignClass3_G2` and `VeriSignClass1` from our trust list. According to [1], `Class3_G2` is a 1024 bit root, and `... [10:35:57] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:37:50] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1007-a instance [puppet] - 10https://gerrit.wikimedia.org/r/258962 [10:39:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1007-a instance [puppet] - 10https://gerrit.wikimedia.org/r/258962 (owner: 10Filippo Giunchedi) [10:50:08] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:50:18] !log performing schema change on wikishared.cx_translations (x1-master) [10:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:55:23] santhosh ^ [10:56:07] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [10:58:26] jynus: could you give your blessing on https://phabricator.wikimedia.org/T121335 ? [10:59:38] tgr, I am not even CCd on that ticket :-) [11:00:56] PROBLEM - salt-minion processes on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:01:26] PROBLEM - Check size of conntrack table on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:01:27] PROBLEM - dhclient process on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:03:24] PROBLEM - DPKG on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:03:42] PROBLEM - Disk space on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:03:44] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: / is CRITICAL: Could not fetch url http://10.64.32.153:10042/: Timeout on connection while downloading http://10.64.32.153:10042/ [11:03:54] PROBLEM - RAID on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:04:32] PROBLEM - configured eth on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:04:44] PROBLEM - puppet last run on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:05:33] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [11:06:01] (03CR) 10Hashar: [C: 031] contint: rename slave-scripts class [puppet] - 10https://gerrit.wikimedia.org/r/258057 (owner: 10Dzahn) [11:06:44] jynus: you are now [11:07:47] No, sorry, that is not acceptable: "This should go out before Tuesday as I forgot a schema change is involved and merged the patch." [11:07:59] it will have to wait [11:08:31] the table has <1000 rows [11:08:32] I ask for 2 weeks on advance notice on the ticket description [11:08:47] https://phabricator.wikimedia.org/project/profile/1494/ [11:09:44] follow the procedure and I can promise to do it as fast as possible, but I cannot guarantee it will be done my tomorrow [11:09:57] there are people that have scheduled my time already [11:10:28] it could be done, maybe, on Wednesday [11:12:16] mobrovac: 1007 is up btw, bootstrapping [11:13:25] !log elastic in eqiad: resuming writes [11:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:15:20] 6operations, 10Traffic: openssl 1.0.2 packaging for jessie - https://phabricator.wikimedia.org/T104143#1877278 (10Remware) Do we have jessie package already in stable ? [11:16:04] jynus: if this the current workflow, and not just a proposal, can it be marked as such on https://wikitech.wikimedia.org/wiki/Schema_changes ? [11:16:18] yes, I am working on it [11:18:23] arg, that is broken, I cannot move over a redirection [11:18:40] 6operations, 10Traffic: openssl 1.0.2 packaging for jessie - https://phabricator.wikimedia.org/T104143#1877284 (10MoritzMuehlenhoff) @remware: If you're referring with stable to standard Debian jessie, then no. The latest version of openssl 1.0.2 is available in jessie-wikimedia, though. [11:20:45] 6operations, 10Traffic: openssl 1.0.2 packaging for jessie - https://phabricator.wikimedia.org/T104143#1877285 (10BBlack) I'm not sure what the context of your question is, but yes we have been running an openssl-1.0.2 package on our jessie hosts from our jessie-wikimedia repo since back when this was resolved... [11:21:31] heh thanks for letting me know my comment was redundant 2 minutes before I hit Submit, phab :P [11:22:53] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [11:23:43] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:24:48] 6operations, 10Traffic: openssl 1.0.2 packaging for jessie - https://phabricator.wikimedia.org/T104143#1877286 (10Remware) My question was related to standard Debian, thanks for the answer, I hope your are pushing upstream at some point... [11:25:33] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [11:27:09] bblack: BTW, the chacha20/poly1305 implementation (which was initially patched into the custom package) has been merged into openssl master a few days ago [11:28:46] (03PS2) 10BBlack: ssl_ciphersuite: prefer ECDHE to DHE within strong list [puppet] - 10https://gerrit.wikimedia.org/r/258762 [11:29:06] 6operations, 10Traffic: openssl 1.0.2 packaging for jessie - https://phabricator.wikimedia.org/T104143#1877313 (10MoritzMuehlenhoff) >>! In T104143#1877286, @Remware wrote: > My question was related to standard Debian, thanks for the answer, I hope your are pushing upstream at some point... jessie will stick... [11:29:19] (03CR) 10BBlack: [C: 032 V: 032] ssl_ciphersuite: prefer ECDHE to DHE within strong list [puppet] - 10https://gerrit.wikimedia.org/r/258762 (owner: 10BBlack) [11:29:20] (03PS2) 10BBlack: ssl_ciphersuite: prefer SHA-2 HMACs more-strongly [puppet] - 10https://gerrit.wikimedia.org/r/258763 [11:29:21] (03CR) 10BBlack: [C: 032 V: 032] ssl_ciphersuite: prefer SHA-2 HMACs more-strongly [puppet] - 10https://gerrit.wikimedia.org/r/258763 (owner: 10BBlack) [11:29:47] moritzm: nice! they're targetting 1.1 probably though [11:31:40] !log upgrade diamond to 3.5-5 in eqiad [11:31:44] yeah, that's all for 1.1 [11:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:31:53] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:33:39] moritzm: hopefully 1.1 will include the ECHDE-ECDSA-CAMELLLIA128-GCM-SHA256 and similar, then we'll finally have a few options instead of this bottleneck with only AES-GCM for "good" ciphers. [11:33:47] ACKNOWLEDGEMENT - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] Filippo Giunchedi restbase1007 bootstrapping [11:34:54] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [11:35:04] looks like the ietf standard for chacha20/poly1305 will get a last call on Dec 17th now for the latest rev [11:36:08] https://www.openssl.org/policies/releasestrat.html interesting too [11:38:13] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: puppet fail [11:38:43] PROBLEM - puppet last run on mw1105 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:44] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: Puppet has 1 failures [11:39:42] PROBLEM - puppet last run on elastic1003 is CRITICAL: CRITICAL: puppet fail [11:39:53] PROBLEM - puppet last run on elastic1007 is CRITICAL: CRITICAL: Puppet has 1 failures [11:40:32] PROBLEM - puppet last run on mw1077 is CRITICAL: CRITICAL: Puppet has 1 failures [11:40:53] PROBLEM - puppet last run on mw2159 is CRITICAL: CRITICAL: Puppet has 1 failures [11:41:53] PROBLEM - puppet last run on mw2180 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:03] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:13] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:19] is this your change or mine? [11:42:32] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures [11:42:43] probably mine, but I don't think it's anything to worry about [11:43:24] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Puppet has 2 failures [11:43:41] I think it's that palladium + strontium apache2 servers used for puppet itself are auto-restarted on config change, so the ciphersuite change caused a spike of "500 Internal Server Error" on the puppetmasters for active clients (including itself in the case of strontium) [11:43:53] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [11:44:01] ah [11:44:19] <_joe_> just puppetmasters acting up [11:44:23] RECOVERY - puppet last run on mw1077 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [11:44:24] makes sense, and it is confirmed on the logs that it is a server issue [11:44:43] "issue" [11:44:56] <_joe_> "fart" [11:45:03] ESC[mNotice: /Stage[main]/Apache/Service[apache2]: Triggered 'refresh' from 1 eventsESC[0m [11:45:07] ESC[1;31mError: /Stage[main]/Puppetmaster::Gitpuppet/Ssh::Userkey[gitpuppet]/File[/etc/ssh/userkeys/gitpuppet]: Could not evaluate: Error 500 on SERVER: [11:45:23] in the strontium case, it failed on a 500 right after it refreshed its own apache2 service, so I tend to think that's what caused it for all [11:46:33] (03PS2) 10BBlack: cache_text/mobile: send randomized pass traffic directly to t1 backends [puppet] - 10https://gerrit.wikimedia.org/r/258465 (https://phabricator.wikimedia.org/T96847) [11:46:35] (03PS2) 10BBlack: cache_upload: send randomized pass traffic directly to t1 backends [puppet] - 10https://gerrit.wikimedia.org/r/258464 (https://phabricator.wikimedia.org/T96847) [11:46:37] (03PS2) 10BBlack: cache_misc: send randomized pass traffic directly to t1 backends [puppet] - 10https://gerrit.wikimedia.org/r/258463 (https://phabricator.wikimedia.org/T96847) [11:46:39] (03PS3) 10BBlack: VCL: no special handling for CentralAutoLogin [puppet] - 10https://gerrit.wikimedia.org/r/258207 (https://phabricator.wikimedia.org/T96847) [11:46:43] RECOVERY - puppet last run on mw1105 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:48:44] PROBLEM - puppet last run on aqs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [11:53:03] (03CR) 10BBlack: [C: 032] cache_misc: send randomized pass traffic directly to t1 backends [puppet] - 10https://gerrit.wikimedia.org/r/258463 (https://phabricator.wikimedia.org/T96847) (owner: 10BBlack) [11:53:54] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.32.153:10042/complete/: Timeout on connection while downloading http://10.64.32.153:10042/complete/ [11:54:35] euh? [11:54:40] * mobrovac looking at ^ [11:55:44] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [11:55:54] (03PS1) 10Aude: Set formatterUrlProperty setting for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258968 (https://phabricator.wikimedia.org/T121382) [11:56:32] of course varnish+confd has our tiers/layers hardcoded in it :P [11:57:41] <_joe_> bblack: :/ [11:58:23] _joe_: working on it! :) [11:58:42] this is all going to get even trickier down the road though [11:58:48] <_joe_> if you need any help with the demented go text/template, just tell me [11:59:08] <_joe_> bblack: ok so, can we make vcl read a json file or something like that? [11:59:21] doubt it [11:59:47] the issue right now is that the directors template is per-instance, and an instance has multiple backends that are dynamic, and they don't share attributes necessarily, etc... [12:00:02] we kinda made it work, but we hardcoded a lot of knowledge about how things currently work with backends and such [12:00:08] 6operations, 6Performance-Team, 10Thumbor, 5Patch-For-Review: Use cgroups to limit thumbor & subprocesses resource usage - https://phabricator.wikimedia.org/T120940#1877392 (10Gilles) [12:00:15] <_joe_> iup [12:00:17] <_joe_> *y [12:00:36] my puppet-level change was to leave the frontend's "backend" alone (chash to local backends) but to change "backend_random" to randomize to tier-1-dc backends rather than local dc [12:00:51] but the templating just has "keyspace" for both that's created with the local dcname, etc... [12:01:49] eventually we'll want it all to be switchable at that level anyways, so it's eventually going to get a lot more complicated [12:02:37] e.g. with codfw+eqiad as dual tier-one, we'd have ulsfo using codfw as its tier1 and esams using eqiad as its tier1, with the ability to switch that mapping as necc for downing a DC or whatever [12:02:58] godog: i'll deploy rb on rb1007 [12:03:02] either that or have them both using both active/active and still be able to remove one of the two [12:03:29] and then all layers/tiers knowing the tier-one applayer backend endpoints as well... [12:03:31] mobrovac: ack, I'm going to lunch now, will repool after you are done and I'm back [12:03:42] kk, buon appetito [12:03:51] grazie! ttyl [12:04:03] RECOVERY - puppet last run on elastic1007 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [12:04:14] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:04:42] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [12:04:43] !log restbase deploy on rb1007 after re-imaging [12:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:04:54] RECOVERY - puppet last run on strontium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:05:03] RECOVERY - puppet last run on mw2159 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:05:19] (03CR) 10Aude: "the setting is not used yet but we can still go ahead and set it now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258968 (https://phabricator.wikimedia.org/T121382) (owner: 10Aude) [12:05:43] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:05:52] RECOVERY - puppet last run on elastic1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:06:03] RECOVERY - puppet last run on mw2180 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:06:13] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:06:23] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:06:23] godog: done, all good [12:08:12] PROBLEM - NTP on cygnus is CRITICAL: NTP CRITICAL: No response from NTP server [12:11:09] akosiaris: kart_: around? [12:11:40] should we try out the shiny new cxserver in beta? [12:12:14] PROBLEM - SSH on cygnus is CRITICAL: Server answer [12:12:53] RECOVERY - puppet last run on aqs1003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:14:24] mobrovac: I am going to lunch but when I am back, sure [12:14:37] k, bon ap [12:14:48] how do you say it in greek? [12:14:59] (in latin alphabet, please) [12:15:01] :P [12:20:13] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [12:21:03] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: Puppet has 1 failures [12:21:06] (03PS1) 10BBlack: varnish dynamic directors: override dc per-service, use on misc [puppet] - 10https://gerrit.wikimedia.org/r/258970 [12:23:28] (03PS2) 10BBlack: varnish dynamic directors: override dc per-service, use on misc [puppet] - 10https://gerrit.wikimedia.org/r/258970 [12:27:24] PROBLEM - Disk space on restbase1004 is CRITICAL: DISK CRITICAL - free space: /var 106127 MB (3% inode=99%) [12:28:12] (03CR) 10BBlack: [C: 032 V: 032] "Compiler checks out ok, will test clusters with puppet disabled..." [puppet] - 10https://gerrit.wikimedia.org/r/258970 (owner: 10BBlack) [12:31:06] damn it rb1004 [12:32:07] godog: back from lunch perhaps? [12:39:10] ping me when you are [12:44:24] PROBLEM - SSH on cygnus is CRITICAL: Server answer [12:47:11] (03PS3) 10BBlack: cache_text/mobile: send randomized pass traffic directly to t1 backends [puppet] - 10https://gerrit.wikimedia.org/r/258465 (https://phabricator.wikimedia.org/T96847) [12:47:11] (03PS3) 10BBlack: cache_upload: send randomized pass traffic directly to t1 backends [puppet] - 10https://gerrit.wikimedia.org/r/258464 (https://phabricator.wikimedia.org/T96847) [12:47:11] (03PS4) 10BBlack: VCL: no special handling for CentralAutoLogin [puppet] - 10https://gerrit.wikimedia.org/r/258207 (https://phabricator.wikimedia.org/T96847) [12:47:14] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [12:48:04] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: puppet fail [12:48:25] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [12:52:38] (03CR) 10BBlack: [C: 032] cache_upload: send randomized pass traffic directly to t1 backends [puppet] - 10https://gerrit.wikimedia.org/r/258464 (https://phabricator.wikimedia.org/T96847) (owner: 10BBlack) [12:56:31] (03PS4) 10Thiemo Mättig (WMDE): Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) [12:56:53] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: puppet fail [13:00:13] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [13:05:53] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [13:09:44] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [13:12:34] PROBLEM - SSH on cygnus is CRITICAL: Server answer [13:14:34] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [13:22:10] !log repool restbase1007 [13:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:22:18] mobrovac: yup [13:23:13] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:23:48] (03CR) 10Alex Monk: Checkout and then rebase instead of cherry-pick (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [13:24:41] godog: rb1004 running out of disk space [13:24:48] godog: is rb1007 streaming from it? [13:24:53] if not, i'd restart it [13:27:11] mobrovac: no it isn't [13:27:50] godog: ack, r u ok with me restarting it? [13:28:24] mobrovac: sure go for it [13:29:03] !log restbase cassandra restarting cassandra on rb1004 due tolow disk space caused by compactions [13:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:29:53] PROBLEM - Varnishkafka log producer on cp4005 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [13:30:30] !log issues on cp4005, investigating [13:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:31:58] !log rebooting cp4005 [13:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:32:03] RECOVERY - Disk space on restbase1004 is OK: DISK OK [13:32:12] PROBLEM - puppet last run on cp4005 is CRITICAL: Timeout while attempting connection [13:33:03] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:13] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [13:34:53] RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 80.06 ms [13:34:54] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [13:35:32] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp4005 is OK: HTTP OK: HTTP/1.1 200 OK - 489 bytes in 0.172 second response time [13:36:02] RECOVERY - Varnishkafka log producer on cp4005 is OK: PROCS OK: 1 process with command name varnishkafka [13:48:07] godog: i tested https://gerrit.wikimedia.org/r/#/c/257898/ in labs, and it's looking good [13:51:23] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:52:25] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [13:52:54] 6operations, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 7Availability, and 2 others: puppet keep trying to restart redis because upstart track wrong PID - https://phabricator.wikimedia.org/T121396#1877607 (10hashar) 3NEW a:3yuvipanda [13:53:14] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [13:55:32] (03PS1) 10Hashar: redis: upstart should track PID after one fork [puppet] - 10https://gerrit.wikimedia.org/r/258972 (https://phabricator.wikimedia.org/T121396) [13:55:46] 6operations, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 7Availability, and 3 others: puppet keep trying to restart redis because upstart track wrong PID - https://phabricator.wikimedia.org/T121396#1877620 (10hashar) a:5yuvipanda>3hashar [13:56:04] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [13:56:08] mobrovac: yup, looks good [13:56:15] (03CR) 10Hashar: "Cause T121396 , fixed by https://gerrit.wikimedia.org/r/258972" [puppet] - 10https://gerrit.wikimedia.org/r/253295 (https://phabricator.wikimedia.org/T118704) (owner: 10Yuvipanda) [13:56:24] godog: cool thnx [13:56:29] godog: when can we get it out? [13:56:30] :P [13:57:33] mobrovac: not sure tbh [14:00:58] godog: k, we can discuss it in today's deployment cabal meeting [14:01:05] (03CR) 10Hashar: [C: 031 V: 031] "Cherry picked on CI and beta cluster puppet master. I had to kill redis-server then start the instance and after that upstart works as exp" [puppet] - 10https://gerrit.wikimedia.org/r/258972 (https://phabricator.wikimedia.org/T121396) (owner: 10Hashar) [14:09:44] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:53] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [5000000.0] [14:11:23] PROBLEM - SSH on cygnus is CRITICAL: Server answer [14:11:42] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [5000000.0] [14:11:44] PROBLEM - Disk space on scandium is CRITICAL: DISK CRITICAL - /srv/ssd is not accessible: No such file or directory [14:13:33] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [14:14:47] scandium apparently magically loose one of its SSD .. [14:15:11] (03CR) 10Addshore: [C: 031] Set formatterUrlProperty setting for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258968 (https://phabricator.wikimedia.org/T121382) (owner: 10Aude) [14:17:13] 6operations, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 7Availability, and 3 others: puppet keep trying to restart redis because upstart track wrong PID - https://phabricator.wikimedia.org/T121396#1877637 (10hashar) a:5hashar>3None Fixed on Beta cluster / CI. Not sure of h... [14:17:21] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1877639 (10faidon) 5Open>3declined a:3faidon [14:17:46] (03CR) 10BBlack: "Rebases or merges, even if automatically-resolved, could still lead to unexpected visible (in code terms) or invisible diffs in functional" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [14:18:12] tgr|away, https://wikitech.wikimedia.org/wiki/Schema_changes [14:19:23] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [14:21:35] RoanKattouw_away, with your permission, I would like to rewrite https://wikitech.wikimedia.org/wiki/How_to_do_a_schema_change [14:23:45] (03CR) 10BBlack: "(but for the record: if all gerrit changesets were isolated, I could see the argument for rebase/cherrypick (which would be identical anyw" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [14:24:28] <_joe_> bblack: we're circling back to why I decided to use simple cherry-picks [14:27:40] 6operations, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 7Availability, and 3 others: puppet keep trying to restart redis because upstart track wrong PID - https://phabricator.wikimedia.org/T121396#1877649 (10scfc) Actually, @Joe's 16794d5a56bae609d7a6fe85382cfe5e475063cb //remo... [14:27:55] (03CR) 10Hashar: "Need to inject --key-prefix as well :(" [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [14:28:22] _joe_: yeah but they totally don't work for a series of dependent patches [14:28:59] e.g. patch #3 may rely on a puppet variable set in patch #1, and it will just fail puppetcompile completely when cherrypicked [14:29:03] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [14:29:32] PROBLEM - SSH on cygnus is CRITICAL: Server answer [14:30:37] <_joe_> bblack: I know, I understand it's an issue [14:30:48] when I compilertest #3, I want that to include the changes in #1. If you test the actual sha1, that always happens reliably. if you cherrypick you lose patch #1. if you rebase/merge, it's unpredictable (to me) whether that's a valid test or if there was something else new in production that affected my test results in any way (assuming it even was able to resolve automatically) [14:31:09] <_joe_> bblack: another possibility could be to rewind production at the base of current changeset [14:31:18] <_joe_> that might work [14:31:38] <_joe_> and we add a big warning sign in the output if the patch is not rebased properly [14:32:02] (03CR) 10Dereckson: [C: 04-1] "urwiki / ukwiki confusion" (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258453 (https://phabricator.wikimedia.org/T120348) (owner: 10Luke081515) [14:32:02] well probably the way to "rewind" is to first calculate the unmerged base commit via something like "git log production..$TO_BE_TESTED" and see the first change there [14:32:03] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [14:32:12] <_joe_> yep [14:32:36] <_joe_> that would break in edge cases, but honestly, I don't care much of such edge cases [14:32:45] <_joe_> it's always going to be a tradeoff [14:33:02] <_joe_> we should be very careful with... submodules! [14:33:04] still it's kind of automating a step that humans should be doing, though [14:33:13] * _joe_ stares ottomata [14:33:18] <_joe_> bblack: heh, yes [14:33:38] <_joe_> the other options is to fail to compile on any patch that is not properly rebased [14:33:43] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [14:33:56] whoa walk into a room and get stared at [14:34:04] RECOVERY - Disk space on scandium is OK: DISK OK [14:34:11] <_joe_> from a grumpy opsen, no less [14:34:25] well, on any compile whose whole history is not properly rebase - it should still be ok for patch 1 to rebased properly and test patch 3 [14:34:46] 6operations, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 7Availability, and 3 others: puppet keep trying to restart redis because upstart track wrong PID - https://phabricator.wikimedia.org/T121396#1877662 (10hashar) Oh good finding @scfc , right now the instances do have `daemo... [14:34:56] <_joe_> bblack: btw, what do you think of the functionality I added to conftool? https://gerrit.wikimedia.org/r/#/c/258428/ [14:35:02] the problem with enforcing rebase is the time lag and racing when we're all busy on independent things that we can see, as humans, don't affect each other. [14:35:31] we had the same issue on CI [14:36:07] what Zuul is doing behind the hood is merge the patchset on tip of the branch. To prevent races between repos, all changes are enqueued and merged on top of each other [14:36:18] but for puppet.git that is not going to work since Jenkins can't merge there [14:36:36] (and the compiler is not part of a +2 -> jenkins merge) [14:37:09] maybe it can be made an option in the compiler run? rebase = true|false [14:37:10] _joe_: I like --find :) [14:38:10] hashar: (and if rebase==true, rebase from the earliest divergence, not just rebase the singular patch in isolation like a cherrypick) [14:38:33] OOoO gdash gone? [14:38:39] gdash req dashes? [14:38:45] bblack: yeah or just attempt a merge [14:38:49] <_joe_> ottomata: is that a problem? [14:38:59] <_joe_> you +1'd the change [14:39:01] <_joe_> :P [14:39:27] <_joe_> ottomata: that data ("pageviews") was a wildly inaccurate set of data coming from udp2log [14:39:44] no not a problem [14:39:47] <_joe_> if we want to reproduce that 1:1, we can modify the collection scripts [14:39:53] an exciting development [14:40:02] that is an excited OOoOoO [14:40:09] <_joe_> oh, eheh [14:40:11] <_joe_> ok [14:40:24] that means i can turn off some udp2log stuff! :D [14:40:35] <_joe_> yes you probably can [14:40:40] ottomata: while you're here, did you guys end up talking about the webrequest_(mobile|text) thing? [14:40:51] is it a big issue? [14:41:55] yeah we talked about it, there are a few jobs that will have to be modified, but they are mainly legacy tsv jobs (ones that replaced some udp2log outputs) [14:41:56] (03PS1) 10Giuseppe Lavagetto: conftool: add support for ACLs, helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/258975 [14:42:09] i sent an email to the public analytics list asking if we could just stop generating those files all together [14:42:12] (03CR) 10jenkins-bot: [V: 04-1] conftool: add support for ACLs, helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/258975 (owner: 10Giuseppe Lavagetto) [14:42:16] (still checking morning email) [14:42:19] <_joe_> meh [14:42:50] <_joe_> hashar: ^^ (jenkins fail) [14:43:21] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [14:43:32] (03PS19) 10Giuseppe Lavagetto: etcd: auth puppetization [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [14:43:34] (03PS2) 10Giuseppe Lavagetto: conftool: add support for ACLs, helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/258975 [14:43:34] _joe_: parent patch https://gerrit.wikimedia.org/r/#/c/255155/ probably conflicts with tip of branch [14:43:37] <_joe_> err, I'm rebasing [14:43:43] <_joe_> hashar: doubt it [14:43:58] at least Gerrit had Can Merge: No [14:44:10] (03CR) 10jenkins-bot: [V: 04-1] conftool: add support for ACLs, helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/258975 (owner: 10Giuseppe Lavagetto) [14:45:30] <_joe_> hashar: this is pretty confusing tbh [14:45:43] <_joe_> I just rebased on top of gerrit/production [14:46:19] ah [14:47:00] <_joe_> anyways, not an issue - i just need to cherry-pick it in beta :) [14:47:11] !log Stopping zuul-merger daemon on scandium. It lost its disk somehow earlier "DISK CRITICAL - /srv/ssd is not accessible: No such file or directory" [14:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:53] PROBLEM - SSH on cygnus is CRITICAL: Server answer [14:47:58] _joe_: zuul attempts to merges the patch on the tip of branch, that is done by a process "zuul-merger" and the one on scandium is misbehaving because the disk went lost :( [14:48:09] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/258975 (owner: 10Giuseppe Lavagetto) [14:48:09] <_joe_> ahah ok [14:48:15] <_joe_> that makes sense [14:49:31] (03PS1) 10Filippo Giunchedi: graphite: run archive-instance once a day [puppet] - 10https://gerrit.wikimedia.org/r/258976 (https://phabricator.wikimedia.org/T120377) [14:50:01] (03PS2) 10Filippo Giunchedi: graphite: run archive-instance once a day [puppet] - 10https://gerrit.wikimedia.org/r/258976 (https://phabricator.wikimedia.org/T120377) [14:50:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: run archive-instance once a day [puppet] - 10https://gerrit.wikimedia.org/r/258976 (https://phabricator.wikimedia.org/T120377) (owner: 10Filippo Giunchedi) [14:50:43] PROBLEM - zuul_merger_service_running on scandium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [14:53:44] 6operations, 10Continuous-Integration-Infrastructure: scandium lost /srv - https://phabricator.wikimedia.org/T121400#1877688 (10hashar) 3NEW [14:54:23] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:54:55] may someone please acknowledge the Icinga alarm " zuul_merger_service_running on scandium " The server has lost its disk and I stopped the process ( https://phabricator.wikimedia.org/T121400 ) [14:56:13] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [14:58:42] 6operations, 10Analytics-Cluster: New LDAP IPs not reachable from analytics VLAN - https://phabricator.wikimedia.org/T121401#1877697 (10Ottomata) 3NEW a:3akosiaris [15:01:23] !log "service ganglia-monitor restart" on cp*, because most had no varnish stats flowing into ganglia [15:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:01:55] we have some dependency problem there I think. As in, it needs to have a hard service dep on the varnish instances themselves (stop before they stop, start after they start) [15:02:01] 6operations, 10Continuous-Integration-Infrastructure: scandium lost /srv - https://phabricator.wikimedia.org/T121400#1877725 (10hashar) scandium has: ``` lang=ruby mount { '/srv/ssd': ensure => mounted, device => '/dev/md2', fstype => 'xfs', options => 'noatime,nodiratime... [15:02:13] if varnish restarts while it's running, the stats just die [15:02:22] (or if it starts before varnish on boot) [15:02:56] 6operations, 10Continuous-Integration-Infrastructure: scandium lost /srv - https://phabricator.wikimedia.org/T121400#1877738 (10hashar) So I think we need to drop in `/etc/fstab` the `/srv` mount: ``` UUID=d588649c-4a40-4853-8d33-a82ed028fb1e /srv xfs defaults 0 2 ``` Then unmount both point and remount `/srv... [15:04:38] (03PS1) 10Giuseppe Lavagetto: mediawiki: add conftool-specifc credentials and scripts [puppet] - 10https://gerrit.wikimedia.org/r/258979 [15:06:22] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:06:30] may someone assist in dropping a mount on scandium please? [15:06:38] (03CR) 10Daniel Kinzler: [C: 031] "Oh, I though this was already set? We could already be tracking this in the property_info table..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258968 (https://phabricator.wikimedia.org/T121382) (owner: 10Aude) [15:06:58] we have the same disk mounted on /srv/ and /srv/ssd which is causing some interesting effects depending on the mount order :-} [15:07:35] hashar: like corruption ? that is THE side effect you should be looking for [15:07:52] unless it's ext4 which would probably make it funnier [15:07:55] (03PS1) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/258981 [15:08:06] akosiaris: xfs for some reason [15:08:24] ah, so corruption [15:08:35] it worked fine until for some reason the mount disappeared [15:08:52] and I have no idea what is writing there now. The paths are surely awkward now :-D [15:09:07] (03CR) 10jenkins-bot: [V: 04-1] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/258981 (owner: 10Giuseppe Lavagetto) [15:10:10] I think it is all about unmounting and removing in /etc/fstab the '/srv' entry that came from the server installation. described at https://phabricator.wikimedia.org/T121400#1877725 [15:12:23] PROBLEM - SSH on cygnus is CRITICAL: Server answer [15:14:10] (03CR) 10Aude: "@daniel it's not set anywhere yet but want to go ahead and do this :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258968 (https://phabricator.wikimedia.org/T121382) (owner: 10Aude) [15:15:03] 6operations, 7Monitoring, 5Patch-For-Review: Migrate reqstats icinga alerts to new graphite metrics and deprecate or adapt reqstats gdash - https://phabricator.wikimedia.org/T118979#1877757 (10Ottomata) 5Open>3Resolved [15:15:05] 6operations, 6Analytics-Backlog, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Turn off sqstat udp2log instance - https://phabricator.wikimedia.org/T117727#1877758 (10Ottomata) [15:15:16] (03PS1) 10Ottomata: Disable analytics1026 misc (sqstat) udp2log instance, remove sqstat and udp2log [puppet] - 10https://gerrit.wikimedia.org/r/258983 (https://phabricator.wikimedia.org/T117727) [15:16:39] (03CR) 10jenkins-bot: [V: 04-1] Disable analytics1026 misc (sqstat) udp2log instance, remove sqstat and udp2log [puppet] - 10https://gerrit.wikimedia.org/r/258983 (https://phabricator.wikimedia.org/T117727) (owner: 10Ottomata) [15:17:29] (03PS2) 10Ottomata: Disable analytics1026 misc (sqstat) udp2log instance, remove sqstat and udp2log [puppet] - 10https://gerrit.wikimedia.org/r/258983 (https://phabricator.wikimedia.org/T117727) [15:20:23] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:21:38] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1877760 (10GWicke) Effect of upgrading to 2.1.12 in codfw: {F3087201} [15:21:52] (03CR) 10Ottomata: [C: 032] Disable analytics1026 misc (sqstat) udp2log instance, remove sqstat and udp2log [puppet] - 10https://gerrit.wikimedia.org/r/258983 (https://phabricator.wikimedia.org/T117727) (owner: 10Ottomata) [15:24:33] (03PS5) 10BBlack: VCL: no special handling for CentralAutoLogin [puppet] - 10https://gerrit.wikimedia.org/r/258207 (https://phabricator.wikimedia.org/T96847) [15:26:37] (03CR) 10BBlack: [C: 032] VCL: no special handling for CentralAutoLogin [puppet] - 10https://gerrit.wikimedia.org/r/258207 (https://phabricator.wikimedia.org/T96847) (owner: 10BBlack) [15:28:23] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [15:28:53] PROBLEM - SSH on cygnus is CRITICAL: Server answer [15:29:49] what is cygnus?^ [15:30:06] <_joe_> a constellation including several very interesting X-ray sources [15:30:27] <_joe_> including the black hole cygnus X-1, the first source emitting X-rays ever observed [15:30:34] god help me I laughed [15:30:37] <_joe_> *galactic [15:31:03] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:31:29] <_joe_> I can describe most of the salient features of all the X-ray sources in that constellation, probably [15:31:33] <_joe_> :P [15:31:54] <_joe_> as for the codfw server: no idea [15:35:06] PROBLEM - SSH on cygnus is CRITICAL: Server answer [15:35:08] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1877793 (10JAllemandou) I don't know how ti will affect our use of cassandra (heavy load once a day), but we can definitely try :) [15:37:06] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:41:18] (03PS2) 10Ottomata: Remove gadolinium related udp2log and varnishncsa stuff [puppet] - 10https://gerrit.wikimedia.org/r/254864 (https://phabricator.wikimedia.org/T84062) [15:42:12] <_joe_> \o/ [15:42:24] mutante|1way: looking cygnus is something you were working on but temp and not sure of current state, it's flapping madly so I'm silencing it [15:42:39] (03CR) 10jenkins-bot: [V: 04-1] Remove gadolinium related udp2log and varnishncsa stuff [puppet] - 10https://gerrit.wikimedia.org/r/254864 (https://phabricator.wikimedia.org/T84062) (owner: 10Ottomata) [15:42:50] (03Abandoned) 10Thcipriani: RESTBase configuration for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/252887 (owner: 10Thcipriani) [15:43:16] PROBLEM - SSH on cygnus is CRITICAL: Server answer [15:44:00] (03CR) 10Ottomata: [C: 032 V: 032] Remove gadolinium related udp2log and varnishncsa stuff [puppet] - 10https://gerrit.wikimedia.org/r/254864 (https://phabricator.wikimedia.org/T84062) (owner: 10Ottomata) [15:45:17] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:45:22] (03PS4) 10BBlack: cache_text/mobile: send randomized pass traffic directly to t1 backends [puppet] - 10https://gerrit.wikimedia.org/r/258465 (https://phabricator.wikimedia.org/T96847) [15:47:16] (03CR) 10jenkins-bot: [V: 04-1] cache_text/mobile: send randomized pass traffic directly to t1 backends [puppet] - 10https://gerrit.wikimedia.org/r/258465 (https://phabricator.wikimedia.org/T96847) (owner: 10BBlack) [15:48:52] ottomata: you bypassed a jenkins -1 for puppetlint-strict, now it's going to -1 everyone else's changes :P [15:49:01] ? [15:49:08] OH [15:49:10] because i left it there [15:49:18] yeah, bblack, i did that beacuse i'm about to remove that whole block of puppet code [15:49:34] guessi shouldn't have been lazy abou tit [15:49:36] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Puppet has 1 failures [15:49:52] just needed puppet to run everywhere and then can remove that [15:49:55] it just doesn't like ensure not being the first attribute [15:49:57] yeah [15:49:58] ok [15:50:16] PROBLEM - Varnish traffic logger - multicast_relay on cp2014 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:50:18] bblack, btw, that is removing the last varnishncsa instance from all varnishes!!!! :) [15:50:21] Ohhhh pssshhhh [15:50:25] (03CR) 10BBlack: [C: 032 V: 032] "The jenkins -1 isn't from my change :P" [puppet] - 10https://gerrit.wikimedia.org/r/258465 (https://phabricator.wikimedia.org/T96847) (owner: 10BBlack) [15:50:26] PROBLEM - Varnish traffic logger - multicast_relay on cp1054 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:50:26] PROBLEM - Varnish traffic logger - multicast_relay on cp2011 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:50:27] Ohhhh those ones [15:50:28] uh oh [15:50:33] pfffff [15:50:48] get ready for a deluge, is there i way I can schedule downtime for a service on all nodes at once? [15:51:17] PROBLEM - Varnish traffic logger - multicast_relay on cp4014 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:51:22] looking at wikitech... [15:51:23] nope [15:51:26] PROBLEM - SSH on cygnus is CRITICAL: Server answer [15:51:35] but you could remove the monitor before removing the service in separate changes :) [15:51:46] PROBLEM - Varnish traffic logger - multicast_relay on cp1066 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:51:46] PROBLEM - Varnish traffic logger - multicast_relay on cp2003 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:51:46] PROBLEM - Varnish traffic logger - multicast_relay on cp3042 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [15:51:55] PROBLEM - Varnish traffic logger - multicast_relay on cp3006 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:51:57] PROBLEM - Varnish traffic logger - multicast_relay on cp4019 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:51:59] hm, that will only do hosts [15:52:02] yeah [15:52:04] eef [15:52:07] too late ! [15:52:18] maybe I can manually remove those blocks from icinga confs [15:52:18] (03PS2) 10Luke081515: Enable interface-editor group at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258453 (https://phabricator.wikimedia.org/T120348) [15:52:19] ... [15:52:27] PROBLEM - Varnish traffic logger - multicast_relay on cp3030 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [15:52:36] PROBLEM - Varnish traffic logger - multicast_relay on cp1071 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:52:37] PROBLEM - Varnish traffic logger - multicast_relay on cp1099 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:52:38] don't worry there's only something like 119 cache hosts :P [15:52:42] hah [15:52:44] ja [15:52:56] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:53:17] PROBLEM - Varnish traffic logger - multicast_relay on cp4005 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:53:26] PROBLEM - Varnish traffic logger - multicast_relay on cp2010 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:53:27] PROBLEM - Varnish traffic logger - multicast_relay on cp4015 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:53:31] puppeting neon, I don't know if it requires new catalog compilations on all first though (probably) [15:53:38] it does [15:53:46] PROBLEM - Varnish traffic logger - multicast_relay on cp4017 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:53:46] PROBLEM - Varnish traffic logger - multicast_relay on cp1073 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:53:52] i could remove storedconfigs for all of them [15:53:57] PROBLEM - Varnish traffic logger - multicast_relay on cp3009 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:54:12] and then the next puppet run on each and then neon would bring the desidred alerts back... [15:54:17] PROBLEM - Varnish traffic logger - multicast_relay on cp1065 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:54:31] no [15:54:35] yeah [15:54:37] not worth it [15:54:37] PROBLEM - Varnish traffic logger - multicast_relay on cp2019 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:54:40] (don't remove storedconfig, that will screw up everything else) [15:54:45] PROBLEM - Varnish traffic logger - multicast_relay on cp1043 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:54:46] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [15:54:55] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:55:07] PROBLEM - Varnish traffic logger - multicast_relay on cp1062 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:55:36] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:55:46] PROBLEM - Varnish traffic logger - multicast_relay on cp3031 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [15:55:57] PROBLEM - Varnish traffic logger - multicast_relay on cp3013 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:56:35] PROBLEM - Varnish traffic logger - multicast_relay on cp2015 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:56:36] PROBLEM - Varnish traffic logger - multicast_relay on cp3045 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [15:56:46] PROBLEM - Varnish traffic logger - multicast_relay on cp3046 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [15:57:54] sorry for the noise everyone... [15:59:16] PROBLEM - Varnish traffic logger - multicast_relay on cp4009 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [15:59:27] PROBLEM - Varnish traffic logger - multicast_relay on cp2002 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151214T1600). [16:00:04] bearND, mdholloway revi MatmaRex aharoni dcausse: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:09] duh [16:00:18] PROBLEM - Varnish traffic logger - multicast_relay on cp3017 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:00:23] sup [16:00:24] yo [16:00:25] hi [16:00:52] I can SWAT, I'll go mostly in order, I'm going to get revi 's first, since it's a simple config change. [16:00:56] Nikerabbit: around? [16:00:59] yes [16:01:02] thanks [16:01:08] PROBLEM - Varnish traffic logger - multicast_relay on cp2001 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:01:18] PROBLEM - Varnish traffic logger - multicast_relay on cp3016 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:01:21] a real thanks since it's 1AM here :-p [16:01:31] Nikerabbit: As I mentioned earler, I can test the JWT patch from high-level, but it's nice to have you around if anything deeper is needed. [16:01:37] PROBLEM - Varnish traffic logger - multicast_relay on cp1053 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:01:38] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi) [16:01:46] PROBLEM - Varnish traffic logger - multicast_relay on cp3007 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:01:54] aharoni: if MT works then it works [16:01:57] PROBLEM - Varnish traffic logger - multicast_relay on cp1064 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:02:07] aharoni: I tested it locally so far that it produces a token [16:02:07] revi: glad that works out well for both of us :) [16:02:16] :-) [16:02:42] (03Merged) 10jenkins-bot: Don't index User namespace on ko.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi) [16:03:45] PROBLEM - Varnish traffic logger - multicast_relay on cp3032 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [16:04:06] PROBLEM - Varnish traffic logger - multicast_relay on cp1048 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:04:17] 6operations, 10Analytics-Cluster: New LDAP IPs not reachable from analytics VLAN - https://phabricator.wikimedia.org/T121401#1877845 (10akosiaris) 5Open>3Resolved ACLs update. Dropped labcontrol2001 and neptunium and added seaborgium/serpens. Tested from stat1003 and it now works fine. [16:04:25] PROBLEM - Varnish traffic logger - multicast_relay on cp3015 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:04:37] PROBLEM - Varnish traffic logger - multicast_relay on cp1059 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:04:55] PROBLEM - Varnish traffic logger - multicast_relay on cp2004 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:05:15] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: do not index User namespace on ko.wikipedia [[gerrit:258667]] (duration: 00m 29s) [16:05:18] ^ revi check please [16:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:32] confirmed [16:05:43] revi: thank you! [16:05:46] was bot index allowed before SWAT, now bot index not allowed [16:05:49] goodnight! [16:06:08] (damn grammar anyway$ [16:06:08] revi: have a good night :) [16:06:17] PROBLEM - Varnish traffic logger - multicast_relay on cp4006 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:06:20] bearND: mdholloway you're up! [16:06:25] :) [16:06:38] thcipriani: howdy! [16:06:50] thcipriani: think we're all set this morning [16:06:55] PROBLEM - Varnish traffic logger - multicast_relay on cp1060 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:07:06] thcipriani: (backport patch up, i mean) [16:07:16] mdholloway: cool, thanks. [16:08:05] PROBLEM - Varnish traffic logger - multicast_relay on cp2009 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:08:06] PROBLEM - Varnish traffic logger - multicast_relay on cp2024 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:08:06] PROBLEM - Varnish traffic logger - multicast_relay on cp3049 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [16:08:26] PROBLEM - Varnish traffic logger - multicast_relay on cp1047 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:08:57] PROBLEM - Varnish traffic logger - multicast_relay on cp1052 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:09:05] PROBLEM - Varnish traffic logger - multicast_relay on cp4012 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:09:17] PROBLEM - Varnish traffic logger - multicast_relay on cp4020 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:11:05] PROBLEM - Varnish traffic logger - multicast_relay on cp3005 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:11:22] PROBLEM - Varnish traffic logger - multicast_relay on cp3041 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [16:11:43] PROBLEM - Varnish traffic logger - multicast_relay on cp3043 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [16:12:02] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:12:03] PROBLEM - Varnish traffic logger - multicast_relay on cp3044 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [16:12:13] PROBLEM - Varnish traffic logger - multicast_relay on cp3033 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [16:13:13] PROBLEM - Varnish traffic logger - multicast_relay on cp1074 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:13:22] !log performing schema change on testwiki.page (s3) [16:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:56] Nikerabbit: Yeah, it's mostly for the case that translation doesn't work :) [16:14:23] PROBLEM - Varnish traffic logger - multicast_relay on cp2021 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:14:48] !log thcipriani@tin Synchronized php-1.27.0-wmf.8/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 55% [[gerrit:258985]] (duration: 00m 30s) [16:14:51] ^ bearND mdholloway sync'd! [16:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:54] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [16:14:54] aharoni: then I will have a look [16:15:13] PROBLEM - Varnish traffic logger - multicast_relay on cp3034 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [16:15:33] PROBLEM - Varnish traffic logger - multicast_relay on cp1049 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:15:52] PROBLEM - SSH on cygnus is CRITICAL: Server answer [16:16:12] PROBLEM - Varnish traffic logger - multicast_relay on cp1072 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:16:32] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: / is CRITICAL: Could not fetch url http://10.64.32.153:10042/: Timeout on connection while downloading http://10.64.32.153:10042/ [16:16:42] PROBLEM - Varnish traffic logger - multicast_relay on cp2007 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:16:48] thcipriani: hmm, I don't see it yet, even when providing a different query parameter [16:17:03] bearND: lemme spot check it. [16:17:13] https://meta.wikimedia.org/static/current/extensions/MobileApp/config/android.json?gakljoighoi [16:17:44] or https://www.wikimedia.org/static/current/extensions/MobileApp/config/android.json?dsfhsfiuhui [16:17:53] it still has the old value of 30. I expected 55 [16:18:14] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [16:18:24] PROBLEM - Varnish traffic logger - multicast_relay on cp2023 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:18:32] PROBLEM - Varnish traffic logger - multicast_relay on cp2016 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:18:44] PROBLEM - Varnish traffic logger - multicast_relay on cp3010 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:19:03] PROBLEM - Varnish traffic logger - multicast_relay on cp3018 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [16:19:04] bearND: hmm, that's strange. the file is definitely up-to-date on mw1017. [16:19:34] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [16:19:44] PROBLEM - Varnish traffic logger - multicast_relay on cp3040 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [16:20:15] PROBLEM - Varnish traffic logger - multicast_relay on cp3047 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-multicast_relay.pid, UID = 114 (varnishlog) [16:21:00] bearND: so, sending the x debug header with the request definitely makes it show; have you done the purge thing? [16:21:27] thcipriani: I've done the purge thing as well [16:22:35] thcipriani: what's the URL you use that makes it work? [16:23:08] bearND: it's just when I send the X-Wikimedia-Debug header with a request [16:23:27] There's an extension that will send the header for you on chrome ;) [16:23:51] thcipriani: do you set a value for the header? [16:23:57] just 1 [16:24:06] ok, i'll try curl then [16:25:00] curl -H'X-Wikimedia-Debug: 1' https://www.wikimedia.org/static/current/extensions/MobileApp/config/android.json [16:25:38] thcipriani: ok, that worked, but only with the header [16:25:41] lemme try touch and resync, because I definitely can't get to the file without the header. [16:26:23] caching? [16:26:32] https://www.wikimedia.org/static/current/extensions/MobileApp/config/android.json works without the header for me [16:27:01] !log thcipriani@tin Synchronized php-1.27.0-wmf.8/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 55% Part II [[gerrit:258985]] (duration: 00m 29s) [16:27:22] Probably need to purge it from varnish [16:27:33] PROBLEM - SSH on cygnus is CRITICAL: Server answer [16:28:03] Reedy: you're seeing the restbaseBetaPercent: 55 ? [16:28:15] no, 30 [16:28:19] it's varnish caching [16:28:26] static/current is unversioned, there shouldn't be any implicit reliance on fast updates to such contents :P [16:28:47] echo "https://www.wikimedia.org/static/current/extensions/MobileApp/config/android.json" | mwscript purgeList.php aawiki [16:28:55] Though, doesn't it appear at meta, rather than on www.wikimedia.org? [16:29:32] Reedy: I've ran that several times already [16:29:33] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [16:29:53] purgeList is sinful :P [16:30:17] age on that url reports as 106s, so it has been purged [16:30:21] what's the correct new version, 30 or 55? [16:30:27] (03CR) 10Hashar: "scfc pointed on the task that daemonize yes has been removed explicitly 16794d5a56bae609d7a6fe85382cfe5e475063cb That is when using the " [puppet] - 10https://gerrit.wikimedia.org/r/258972 (https://phabricator.wikimedia.org/T121396) (owner: 10Hashar) [16:30:28] 55 [16:30:30] 55 is expected [16:30:43] yeah I get 30 even on a direct internal request to mediawiki from at least some servers [16:31:08] hmm, no error reports from scap [16:31:20] also, the cache-control/expires sent by mediawiki for that file are 1-year-long [16:31:26] You mention config.json and android.json above [16:31:54] Reedy: should be symlinks :\ [16:31:59] try from inside our network: [16:32:00] curl http://www.wikimedia.org/static/current/extensions/MobileApp/config/android.json --resolve www.wikimedia.org:80:10.2.2.1 [16:32:05] maybe the purges are sent to the wrong varnishes? [16:32:21] well step 1 is getting mediawiki to output the right thing [16:32:28] I get 30 every time [16:33:06] yes, android.json is a sym link to config.json [16:34:59] curl -H 'host:www.wikimedia.org' localhost/static/current/extensions/MobileApp/config/config.json on mw1017 at least returns 55 [16:35:43] that's the test server though, not one of the prod ones [16:35:49] I'm sure Krenair said last time around about there being some meta url to purge that for [16:35:57] nope. [16:36:13] it's www.wikimedia.org [16:36:19] PROBLEM - salt-minion processes on technetium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:36:53] bblack@mw1258:~$ curl -sH 'host:www.wikimedia.org' localhost/static/current/extensions/MobileApp/config/config.json |grep restbase "restbaseBetaPercent": 30 [16:37:05] blerg. yeah, that curl on mw1244 returns 30; however, the file on disk shows 55 [16:37:47] PROBLEM - SSH on cygnus is CRITICAL: Server answer [16:37:59] regardless of that level of issue, though, there's something fundamentally misdesigned if we're expecting the file contents located at an unversioned URL-path with 1-year cache headers to update reliably [16:38:40] 6operations, 10DBA, 6Phabricator, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1877926 (10jcrespo) 5Open>3Resolved Th... [16:39:14] "When is static not so static..." [16:42:31] that's fair, although it's strange that we've been able to update it in the recent past without incident. [16:42:44] probably by using some cache-purging script [16:43:01] but still, ideally purging should not be a normal part of deployment, those are design issues we need to eventually address [16:43:36] This isn't a "normal deployment" in many regards [16:44:09] I'd assume since that looks like json configuration values for an application that we might change regularly, this falls into that kind of category in general [16:45:10] generally that kind of thing isn't time-critical, and we want caching, but we don't want forever-caching. Just giving it, say, 15 or 30 minute cache lifetime would be fine, and then waiting to see the effect gradually after a change. [16:45:24] (03PS1) 10Dereckson: Add davidabian.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259003 (https://phabricator.wikimedia.org/T121383) [16:45:37] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [16:45:39] (and planning changes such that it's not critical. e.g. if code relies on a new key being present, deploy the new config key first, then wait and deploy the code) [16:45:55] (03PS1) 10Alexandros Kosiaris: monitoring: Use nagios_common for contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/259004 [16:46:38] in the interest of getting other patches out, I'm going to push forward on SWAT. MatmaRex do you have a backport to .8 of your patch, or should I make one? [16:47:37] bblack: Better yet, code defensively so if the config /isn't/ live yet, it falls back to current behavior :) [16:47:44] well sure :) [16:47:44] (03CR) 10jenkins-bot: [V: 04-1] monitoring: Use nagios_common for contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/259004 (owner: 10Alexandros Kosiaris) [16:47:57] thcipriani: uh, no, i think i didn't make one yet [16:48:00] continuing on the cache-header theme though: with the current headers on that URL, even caches outside our control can keep it for a year [16:48:06] MatmaRex: kk, I'll maake one. [16:48:11] bblack: I'd be ok with a 15 or 30 min. cache expiration. 1 year does seem a bit long for a config file. Not sure why the statsi folder was used for this. [16:48:52] Yes, the app handles missing keys just fine, and it falls back to previous behavior. [16:49:38] bblack: On the subject of 1y cache evictions, fancy a glance at https://gerrit.wikimedia.org/r/#/c/257676/? It's trivial. [16:51:13] (03PS1) 10Alexandros Kosiaris: Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 [16:51:37] PROBLEM - SSH on cygnus is CRITICAL: Server answer [16:52:11] (03PS2) 10Alexandros Kosiaris: monitoring: Use nagios_common for contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/259004 [16:52:29] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to comands - https://phabricator.wikimedia.org/T120831#1877978 (10ArielGlenn) Well that was painful. The SAuth class keeps a cache of keys per master/minion combination (in theory one can run multiple min... [16:52:35] (03CR) 10jenkins-bot: [V: 04-1] Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [16:53:38] (03CR) 10jenkins-bot: [V: 04-1] monitoring: Use nagios_common for contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/259004 (owner: 10Alexandros Kosiaris) [16:54:36] MatmaRex: getting ready to sync this out for .8, everything look right to you: https://gerrit.wikimedia.org/r/#/c/259006/1 [16:55:47] (03PS2) 10Alexandros Kosiaris: Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 [16:55:49] also on the subject of 1y cache expiries: all of our varnishes cap TTLs to 30 days internally [16:55:59] Just FYI [16:56:05] but that won't stop a 3rd party [16:57:16] thcipriani, ostriches, bblack * Any chance to still deploy our SWAT patch today? [16:57:17] (03CR) 10jenkins-bot: [V: 04-1] Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [16:57:34] (03PS1) 10Dereckson: Throttle typo: ip → IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259010 [16:57:36] (03PS1) 10Dereckson: Throttle rule for University of Haifa event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259011 (https://phabricator.wikimedia.org/T121321) [16:58:05] aharoni: I was planning on it. Trying to get MatmaRex 's feedback so I can sync out UploadWizard. [16:58:15] bblack: Well gerrit gets a full pass from varnish anyway, so this is mainly for third party and browser caches. [16:58:37] (03PS1) 10Alexandros Kosiaris: role::cache::logging: set absent in the correct place [puppet] - 10https://gerrit.wikimedia.org/r/259012 [16:59:25] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to commands - https://phabricator.wikimedia.org/T120831#1878001 (10Dereckson) [16:59:38] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [17:00:04] (03CR) 10Alexandros Kosiaris: [C: 032] role::cache::logging: set absent in the correct place [puppet] - 10https://gerrit.wikimedia.org/r/259012 (owner: 10Alexandros Kosiaris) [17:00:18] RECOVERY - salt-minion processes on technetium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:00:22] thcipriani: Something go wrong with the patch? [17:01:05] (03PS3) 10Alexandros Kosiaris: monitoring: Use nagios_common for contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/259004 [17:01:14] marktraceur: no, just couldn't get anyone to verify I was about to sync the right thing (and probably that means to check after it was sync'd) [17:01:23] (03PS3) 10Alexandros Kosiaris: Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 [17:01:29] Oh, I thought MatmaRex was here [17:01:44] ottomata: FYI on fluorine, Error: /Stage[main]/Misc::Udp2log::Utilities/File[/usr/local/bin/sqstat]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///files/udp2log/sqstat.pl [17:02:06] HMmMmMm, i ensured absent, interesting..>. [17:02:09] looking [17:02:09] marktraceur: yeah, a little bit earlier, couldn't get a verification about my backport though, if you want to babysit/check, I have it ready to sync on tin. [17:02:23] Yeah I can do that. [17:02:27] kk, thanks. [17:02:46] sok, can just remove puppet resource for that [17:03:03] (03CR) 10jenkins-bot: [V: 04-1] Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [17:05:48] !log thcipriani@tin Synchronized php-1.27.0-wmf.8/extensions/UploadWizard/resources/mw.UploadWizardDetails.js: SWAT: mw.UploadWizardDetails update [[gerrit:258767]] (duration: 00m 30s) [17:05:50] ^ marktraceur check please [17:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:06:38] On it, need to deal with the cache [17:06:45] kk [17:06:59] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.48.29:10042/complete/: Timeout on connection while downloading http://10.64.48.29:10042/complete/ [17:07:10] Looks like we're good, fixes the bug [17:07:18] Thanks thcipriani [17:07:28] marktraceur: thanks for checking [17:08:15] aharoni: kk, still around for your SWAT patch? [17:08:50] thcipriani: yes! [17:09:15] yes! [17:09:18] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [17:10:58] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [17:13:24] durr [17:14:28] something stuck? https://integration.wikimedia.org/zuul/ [17:14:58] and moving [17:15:03] Nikerabbit: yeah [17:15:19] Nikerabbit: https://integration.wikimedia.org/ci/job/mwext-testextension-zend/16897/console [17:15:21] Nikerabbit: still running [17:15:36] the progress bar are not really accurate [17:15:46] it was waiting for VE job to finish [17:16:04] if the lines make any sense to me [17:17:49] done [17:18:27] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:19:38] PROBLEM - SSH on cygnus is CRITICAL: Server answer [17:19:39] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:20:23] !log thcipriani@tin Synchronized php-1.27.0-wmf.8/extensions/ContentTranslation/api/ApiContentTranslationToken.php: SWAT: Fix check for JWT [[gerrit:258776]] (duration: 00m 29s) [17:20:26] ^ aharoni check please [17:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:20:31] checkin [17:20:34] api works for me [17:20:45] thcipriani: marktraceur: i'm really sorry, i had to leave at the worst possible time. i see you took care of it? [17:20:55] MatmaRex: Yeah, no problem [17:20:57] yay, works [17:21:02] MatmaRex: yup, marktraceur got it :) [17:21:07] aharoni: thanks for checking! [17:21:08] Nikerabbit, kart_, thcipriani - done, fixed, tested [17:21:58] dcausse: ebernhardson can I push this to evening SWAT since we're a ½ hour over-time? [17:22:33] thcipriani: ok for me, ebernhardson ? [17:22:39] thcipriani: yea i can work that [17:24:58] dcausse: ebernhardson thanks, appreciated! [17:24:59] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.48.29:10042/complete/: Timeout on connection while downloading http://10.64.48.29:10042/complete/ [17:25:38] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [17:26:27] (03PS1) 10Halfak: Adds myspell-et and myspell-uk to ORES base. [puppet] - 10https://gerrit.wikimedia.org/r/259021 [17:31:08] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [17:32:31] YuviPanda, have a minute to look at https://gerrit.wikimedia.org/r/#/c/259021/ for me? [17:33:14] halfak: have you verified those packages exist on debian jessie? [17:33:29] Yes. ores-misc is jessie, right? [17:33:37] yes [17:33:39] cool [17:33:41] That's the box where I very things :) [17:33:53] (03PS2) 10Yuvipanda: ores: Adds myspell-et and myspell-uk to ORES base. [puppet] - 10https://gerrit.wikimedia.org/r/259021 (owner: 10Halfak) [17:34:04] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Adds myspell-et and myspell-uk to ORES base. [puppet] - 10https://gerrit.wikimedia.org/r/259021 (owner: 10Halfak) [17:34:13] \o/ [17:34:36] akosiaris: I merged yours too [17:34:39] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:35:15] (03PS4) 10Alexandros Kosiaris: Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 [17:35:49] PROBLEM - SSH on cygnus is CRITICAL: Server answer [17:35:56] * halfak watches puppet install his packages [17:35:58] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [17:36:16] <_joe_> halfak: :) [17:36:34] o/ _joe_ [17:36:43] feels good to start self-serving with puppet :) [17:37:29] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:37:49] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [17:40:29] 10Ops-Access-Requests, 6operations: eventlog1001 access for user Madhuvishy - https://phabricator.wikimedia.org/T120731#1878156 (10Ottomata) This was approved at the ops meeting today. I will fill this request today. [17:41:04] YuviPanda: a thanks, sorry [17:43:50] (03CR) 10Jcrespo: "This will be applied tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/240043 (owner: 10Muehlenhoff) [17:45:57] PROBLEM - SSH on cygnus is CRITICAL: Server answer [17:47:58] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [17:53:33] 6operations, 6Analytics-Engineering, 10Wikimedia-Logstash: Convert Hadoop-Logstash logging to use Redis to address failures - https://phabricator.wikimedia.org/T85015#1878226 (10Milimetric) 5Open>3declined a:3Milimetric We don't think anyone's going to work on this at this point. We're filing some oth... [17:54:07] PROBLEM - SSH on cygnus is CRITICAL: Server answer [17:56:27] (03CR) 10Giuseppe Lavagetto: "I was thinking of going a different route: moving this info to an ENC-like yaml file. We should try to brainstorm on this and get it right" [puppet] - 10https://gerrit.wikimedia.org/r/258473 (https://phabricator.wikimedia.org/T119520) (owner: 10Filippo Giunchedi) [17:58:28] <_joe_> thcipriani: I'm going to be a few minutes late to the meeting, you can start without me though [17:58:41] _joe_: kk, thanks for the heads up [17:58:49] (03CR) 10Reedy: [C: 031] Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258387 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [18:00:22] (03PS4) 10Alexandros Kosiaris: monitoring: Use nagios_common for contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/259004 [18:05:57] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: / is CRITICAL: Could not fetch url http://10.64.32.153:10042/: Timeout on connection while downloading http://10.64.32.153:10042/ [18:06:59] mobrovac: do you know anything about ^^ ? [18:07:34] it's been acting up all day like that, was the case with citoid [18:07:45] need to look into it [18:08:07] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.48.29:10042/complete/: Timeout on connection while downloading http://10.64.48.29:10042/complete/ [18:09:34] cpu usage is normal on sca100x [18:10:26] is there a high rate of worker deaths? [18:11:30] nope, high rate of 400 reqs [18:12:01] the math extension is now hitting mathoid for checks [18:12:39] about 90 req/s currently [18:12:43] (03PS8) 10Dzahn: varnish: move file to module [puppet] - 10https://gerrit.wikimedia.org/r/253457 [18:13:03] (03PS9) 10Dzahn: varnish: move varnish-test-geoip into module [puppet] - 10https://gerrit.wikimedia.org/r/253457 [18:13:07] actually, that's the rate RB sees; mathoid would get a fraction of that [18:14:08] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [18:14:13] not if most of them are erroneous [18:14:35] yeah, good point [18:15:59] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [18:16:10] (03PS1) 10Jcrespo: Repool db1057 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259033 [18:16:38] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [18:16:45] 6operations, 6Analytics-Backlog, 6Security, 6Zero: Purge > 90 days stat1002:/a/squid/archive/zero - https://phabricator.wikimedia.org/T92343#1878361 (10Milimetric) [18:17:04] 6operations, 6Analytics-Backlog, 6Security, 6Zero: Purge > 90 days stat1002:/a/squid/archive/sampled - https://phabricator.wikimedia.org/T92342#1878362 (10Milimetric) [18:17:10] 6operations, 6Analytics-Backlog, 6Security: Purge > 90 days stat1002:/a/squid/archive/glam_nara - https://phabricator.wikimedia.org/T92340#1878363 (10Milimetric) [18:17:16] 6operations, 6Analytics-Backlog, 6Security, 6Zero, 7Mobile: Purge > 90 days stat1002:/a/squid/archive/mobile - https://phabricator.wikimedia.org/T92341#1878364 (10Milimetric) [18:17:22] 6operations, 6Analytics-Backlog, 6Security: Purge > 90 days stat1002:/a/squid/archive/api - https://phabricator.wikimedia.org/T92338#1878365 (10Milimetric) [18:17:44] 6operations, 6Analytics-Kanban, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1878366 (10Milimetric) [18:17:49] 6operations, 6Analytics-Backlog, 10Deployment-Systems, 6Services, 3Scap3: Deploy AQS with scap3 - https://phabricator.wikimedia.org/T114999#1878367 (10Milimetric) [18:18:17] (03PS1) 10Jcrespo: Depool db1018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259034 [18:19:39] (03CR) 10Jcrespo: [C: 032] Repool db1057 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259033 (owner: 10Jcrespo) [18:19:55] (03CR) 10Jcrespo: [C: 032] Depool db1018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259034 (owner: 10Jcrespo) [18:21:59] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1057 after maintenance; depool db1018 for maintenance (duration: 00m 29s) [18:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:24:35] 6operations, 6Analytics-Backlog, 6Security, 6Zero: Purge > 90 days stat1002:/a/squid/archive/sampled - https://phabricator.wikimedia.org/T92342#1878382 (10Milimetric) When doing this, please ping @milimetric because he (me) has to remove the tables created on top of a copy of this data in HDFS [18:25:37] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [5000000.0] [18:26:38] PROBLEM - SSH on cygnus is CRITICAL: Server answer [18:30:59] akosiaris: see that cygnus SSH issue there? it's odd, that's just a new ganeti VM i made that doesnt do anything [18:32:37] (03CR) 10Dzahn: [C: 032] varnish: move varnish-test-geoip into module [puppet] - 10https://gerrit.wikimedia.org/r/253457 (owner: 10Dzahn) [18:35:47] (03PS2) 10Dzahn: contint: rename slave-scripts class [puppet] - 10https://gerrit.wikimedia.org/r/258057 [18:35:58] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [18:36:20] (03PS1) 10Jcrespo: Reconfiguring db1018 (performance_schema, ferm, SSL, upgrade) [puppet] - 10https://gerrit.wikimedia.org/r/259038 [18:36:48] (03PS1) 10Jforrester: VisualEditor: Enable single edit tab mode on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259039 (https://phabricator.wikimedia.org/T121421) [18:37:18] (03CR) 10Dzahn: [C: 032] contint: rename slave-scripts class [puppet] - 10https://gerrit.wikimedia.org/r/258057 (owner: 10Dzahn) [18:37:30] !log restarting, upgrading and reconfiguring mysql on db1018 [18:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:38:11] (03PS2) 10Jcrespo: Reconfiguring db1018 (performance_schema, ferm, SSL, upgrade) [puppet] - 10https://gerrit.wikimedia.org/r/259038 [18:38:48] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 1 failures [18:38:48] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: Puppet has 1 failures [18:38:50] (03CR) 10Jforrester: [C: 04-1] "No rush." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258404 (owner: 10Jforrester) [18:39:27] (03CR) 10Jcrespo: [C: 032] Reconfiguring db1018 (performance_schema, ferm, SSL, upgrade) [puppet] - 10https://gerrit.wikimedia.org/r/259038 (owner: 10Jcrespo) [18:39:29] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 1 failures [18:39:54] i ran puppet on cp1044 and dont see an issue [18:39:58] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [18:40:48] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:40:48] RECOVERY - puppet last run on cp1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:40:57] 7Puppet, 6Analytics-Backlog, 10Analytics-Wikimetrics: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} - https://phabricator.wikimedia.org/T101763#1878462 (10Milimetric) p:5Low>3High [18:42:38] (03CR) 10Jforrester: "This should be scheduled…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237686 (owner: 10Legoktm) [18:43:37] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:44:36] (03CR) 10Jforrester: "This should be scheduled…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249402 (https://phabricator.wikimedia.org/T94029) (owner: 10Matthias Mullie) [18:46:07] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [18:46:38] (03PS1) 10Jforrester: VisualEditor: Centralise feedback from test2wiki to MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259041 (https://phabricator.wikimedia.org/T92661) [18:47:07] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [18:50:55] 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1878538 (10Dzahn) Alright, so @hoo you think we can close it as invalid then? [18:56:07] <_joe_> ah, when you press the hung up button instead of the microphone in an hangouts and you look like the asshole [18:57:08] PROBLEM - SSH on cygnus is CRITICAL: Server answer [18:58:58] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.48.29:10042/complete/: Timeout on connection while downloading http://10.64.48.29:10042/complete/ [18:59:58] (03PS1) 10Dzahn: ganglia: fix maps cluster naming [puppet] - 10https://gerrit.wikimedia.org/r/259043 (https://phabricator.wikimedia.org/T116234) [19:00:08] _joe_: I didn't know you had to press buttons to look like an asshole [19:00:13] * YuviPanda runs away very quickly [19:00:14] _joe_: hahaha [19:00:18] (03CR) 10jenkins-bot: [V: 04-1] ganglia: fix maps cluster naming [puppet] - 10https://gerrit.wikimedia.org/r/259043 (https://phabricator.wikimedia.org/T116234) (owner: 10Dzahn) [19:00:59] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [19:01:08] RECOVERY - Disk space on cygnus is OK: DISK OK [19:01:19] RECOVERY - NTP on cygnus is OK: NTP OK: Offset -0.005061268806 secs [19:01:19] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [19:01:28] RECOVERY - Check size of conntrack table on cygnus is OK: OK: nf_conntrack is 0 % full [19:02:38] RECOVERY - puppet last run on cygnus is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [19:02:45] !log reboot cygnus [19:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:03:47] (03PS2) 10Dzahn: ganglia: fix maps cluster naming [puppet] - 10https://gerrit.wikimedia.org/r/259043 (https://phabricator.wikimedia.org/T116234) [19:04:16] RECOVERY - DPKG on cygnus is OK: All packages OK [19:04:27] RECOVERY - RAID on cygnus is OK: OK: no RAID installed [19:04:33] (03CR) 10Dzahn: [C: 032] ganglia: fix maps cluster naming [puppet] - 10https://gerrit.wikimedia.org/r/259043 (https://phabricator.wikimedia.org/T116234) (owner: 10Dzahn) [19:04:46] RECOVERY - configured eth on cygnus is OK: OK - interfaces up [19:04:57] RECOVERY - dhclient process on cygnus is OK: PROCS OK: 0 processes with command name dhclient [19:05:17] RECOVERY - salt-minion processes on cygnus is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:08:39] 6operations, 6Discovery, 10Maps, 5Patch-For-Review: Capitalise "maps Cluster codfw" ganglia group - https://phabricator.wikimedia.org/T116234#1878569 (10Dzahn) It was "maps Cluster" vs. "Maps Caches". The patch above should have fixed it now. [19:09:07] 6operations, 6Discovery, 10Maps, 5Patch-For-Review: Capitalise "maps Cluster codfw" ganglia group - https://phabricator.wikimedia.org/T116234#1878570 (10Dzahn) a:3Dzahn [19:09:53] 6operations: Investigate idle/depooled eqiad api appserver - https://phabricator.wikimedia.org/T116254#1878579 (10Dzahn) [19:09:54] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1878580 (10Dzahn) [19:11:05] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1744570 (10Dzahn) [19:16:59] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1878668 (10Dzahn) Since this changed from a VM request to real hardware i'm adjusting the relevent tags acco... [19:17:11] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1878669 (10Dzahn) [19:19:03] 7Puppet, 6operations, 10Continuous-Integration-Config: translatewiki-puppetlint-strict does not honor puppet-lint.rc file in /puppet - https://phabricator.wikimedia.org/T116552#1878674 (10Dzahn) >>! In T116552#1752138, @yuvipanda wrote: > I already see `--no-80chars-check` in .puppet-lint.rc in operations/pu... [19:19:41] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1878677 (10Nuria) An update regarding hardware provisioning here will be great [19:20:07] !log rolling restart s2 codfw mysql servers [19:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:48] 6operations, 6Discovery, 10Maps, 5Patch-For-Review: Capitalise "maps Cluster codfw" ganglia group - https://phabricator.wikimedia.org/T116234#1878682 (10Dzahn) 5Open>3Resolved "Maps" is now after "Maps" , normal sorting in web ui. (the old one should also disappear) [19:23:20] 6operations: reclaim calcium to spares - https://phabricator.wikimedia.org/T116790#1878696 (10Dzahn) meanwhile it's a "role spare" but still in site.pp. is there anything that needs to be copied from here,@Robh? [19:25:02] 6operations, 10ops-codfw, 5Patch-For-Review: return pollux to spares - https://phabricator.wikimedia.org/T117423#1878714 (10Dzahn) what about site.pp and 2144 # LDAP servers relied on by OIT for mail 2145 node /(dubnium|pollux)\.wikimedia\.org/ { 2146 $cluster = 'openldap_corp_mirror' ? [19:26:52] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1878723 (10Dzahn) @Papaul ssh db2034.mgmt.eqiad.wmnet channel 0: open failed: administratively prohibited: open failed stdio forwarding failed ssh_exchange_identific... [19:29:08] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: puppet fail [19:30:06] (03PS1) 10RobH: new ganglia.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/259049 [19:31:41] (03PS1) 10Jcrespo: Reconfigure databases in s2 shard (p_s, ssl, ROW) [puppet] - 10https://gerrit.wikimedia.org/r/259050 [19:32:37] (03PS1) 10Ottomata: Clean up some unused udp2log stuff [puppet] - 10https://gerrit.wikimedia.org/r/259051 (https://phabricator.wikimedia.org/T97294) [19:33:02] (03CR) 10Jcrespo: [C: 032] Reconfigure databases in s2 shard (p_s, ssl, ROW) [puppet] - 10https://gerrit.wikimedia.org/r/259050 (owner: 10Jcrespo) [19:33:36] (03PS1) 10RobH: lists.wikimedia.org certificate update [puppet] - 10https://gerrit.wikimedia.org/r/259053 (https://phabricator.wikimedia.org/T120237) [19:37:11] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1878755 (10Ottomata) Hm, maybe we could host this on stat1001? It already hosts a couple of public an... [19:37:25] (03PS1) 10RobH: new wikitech.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/259055 [19:37:34] 6operations, 7Graphite, 7Monitoring: Add monitoring for analytics-statsv service - https://phabricator.wikimedia.org/T117994#1878758 (10Dzahn) What is the command line we would be checking? How do you identify in site.pp which servers are running this? The string "statsv" does not seem to appear there. [19:37:52] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1878767 (10Ottomata) stat1001 is still Precise though. We might want to upgrade it to Jessie before d... [19:40:22] (03PS1) 10RobH: new librenms.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/259057 [19:40:39] (03PS2) 10Ottomata: Clean up some unused udp2log stuff [puppet] - 10https://gerrit.wikimedia.org/r/259051 (https://phabricator.wikimedia.org/T97294) [19:40:53] (03CR) 10Ottomata: [C: 032 V: 032] Clean up some unused udp2log stuff [puppet] - 10https://gerrit.wikimedia.org/r/259051 (https://phabricator.wikimedia.org/T97294) (owner: 10Ottomata) [19:42:11] (03PS1) 10RobH: new icinga.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/259058 [19:44:01] (03PS1) 10Ottomata: Add madhuvishy to eventlogging-roots group (sudo on eventlog* hosts) [puppet] - 10https://gerrit.wikimedia.org/r/259059 (https://phabricator.wikimedia.org/T120731) [19:45:03] (03CR) 10BBlack: [C: 031] Gerrit: move static assets to *.cache.* filenames [puppet] - 10https://gerrit.wikimedia.org/r/257676 (owner: 10Chad) [19:45:37] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:47:25] (03CR) 10Ottomata: [C: 032] Add madhuvishy to eventlogging-roots group (sudo on eventlog* hosts) [puppet] - 10https://gerrit.wikimedia.org/r/259059 (https://phabricator.wikimedia.org/T120731) (owner: 10Ottomata) [19:47:30] 6operations: Enforce password requirements for account creation on wikitech - https://phabricator.wikimedia.org/T118386#1878827 (10Dzahn) I also think it's fixed now. I also actually tried it and got: ``` Account creation error Passwords must be at least 10 characters. ``` [19:47:42] 7Puppet, 3Scap3: Move scap.cfg things out of scap and into puppet - https://phabricator.wikimedia.org/T121435#1878828 (10demon) 3NEW [19:47:45] 6operations: Enforce password requirements for account creation on wikitech - https://phabricator.wikimedia.org/T118386#1878835 (10Dzahn) a:3Muehlenhoff [19:48:03] 6operations, 7Tracking: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747#1878837 (10Dzahn) [19:48:04] 6operations: Enforce password requirements for account creation on wikitech - https://phabricator.wikimedia.org/T118386#1878836 (10Dzahn) 5Open>3Resolved [19:48:36] 6operations, 10OCG-General-or-Unknown, 6Scrum-of-Scrums, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#1878840 (10Ottomata) I'm not sure who to poke about this, so I'm adding it to Scrum-of-Scrums board. Please find a taker in that meeting. [19:49:14] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: eventlog1001 access for user Madhuvishy - https://phabricator.wikimedia.org/T120731#1878846 (10Ottomata) 5Open>3Resolved a:3Ottomata [19:50:33] 6operations, 7Monitoring, 7Privacy, 7Security-Core: status.wikimedia.org should not load Google Analytics - https://phabricator.wikimedia.org/T115945#1878854 (10Ottomata) 5Open>3Invalid a:3Ottomata Marking as Invalid. This is not a WMF controlled site. [19:50:49] 7Puppet, 6operations: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#1878860 (10Dzahn) Afaik the classes should instead be: module/role/puppmaster/something.pp and we are fixing this by moving all things from manifests/role into module/role. [19:52:36] 6operations, 10Continuous-Integration-Infrastructure: scandium lost /srv - https://phabricator.wikimedia.org/T121400#1878865 (10Ottomata) a:3Ottomata [19:53:18] hashar: , yt? [19:53:54] (03PS1) 10Chad: More fixmes for scap/manifests/scripts.pp [puppet] - 10https://gerrit.wikimedia.org/r/259060 [19:55:27] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [19:59:43] 6operations, 10Continuous-Integration-Infrastructure: scandium lost /srv - https://phabricator.wikimedia.org/T121400#1878882 (10Ottomata) Ok, I did this. `/srv/ssd` is now mounted, but `/srv` is not. However, due to some previous job run, it looks like zuul was cloned at `/srv/ssd/zuul` when `/srv` was mount... [20:00:08] !log changing master of db2017 to be now db1018, instead of s2-master [20:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:40] 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1878910 (10Ottomata) p:5Triage>3Normal [20:05:32] 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1878923 (10Ottomata) p:5Triage>3Low a:3Dzahn [20:05:56] 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1869230 (10Ottomata) @Dzahn, since you are in touch about this, I am assigning to you. Feel free to change status as you see fit. [20:06:33] 6operations, 10hardware-requests: EQIAD/CODFW: 2 hardware access request for monitoring - https://phabricator.wikimedia.org/T120842#1878928 (10Ottomata) p:5Triage>3Normal a:3RobH [20:09:50] 6operations: kafkatee lagging on oxygen - https://phabricator.wikimedia.org/T99716#1878956 (10Ottomata) 5Open>3Invalid This hasn't happened in a while. Marking as invalid. [20:11:58] ottomata: hello, doing scandium aren't you? :)} [20:12:01] (03PS1) 10Chad: scap: Put configuration for scap in /etc/scap3.cfg [puppet] - 10https://gerrit.wikimedia.org/r/259071 (https://phabricator.wikimedia.org/T121435) [20:12:22] 6operations, 10Analytics: analytics1013 crashed, investigate... - https://phabricator.wikimedia.org/T97380#1878972 (10Ottomata) 5Open>3Resolved [20:13:44] 6operations: Can protactinium be reclaimed (was emergency gadolinium replacement) - https://phabricator.wikimedia.org/T89009#1878976 (10Ottomata) 5Open>3Resolved protactinium is marked as a spare. [20:15:09] 6operations, 10Continuous-Integration-Infrastructure: scandium lost /srv - https://phabricator.wikimedia.org/T121400#1878983 (10hashar) Thank you! I restarted the zuul-merger instance since /srv/ssd/zuul/git is now fine. `/srv/ssd/ssd` can be nuked entirely. I lack root access to do so. [20:15:34] ottomata: on scandium you can indeed rm all of /srv/ssd/ssd [20:15:42] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1878984 (10Ottomata) Anyone mind if I decline this task? [20:15:43] ok danke [20:15:49] !log scandium restarted zuul-merger [20:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:12] I totally overlooked that duplicate / nested mount issue :( [20:16:16] done [20:16:26] RECOVERY - zuul_merger_service_running on scandium is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [20:16:38] 6operations, 10Continuous-Integration-Infrastructure: scandium lost /srv - https://phabricator.wikimedia.org/T121400#1878985 (10Ottomata) 5Open>3Resolved [20:16:51] 6operations, 10Continuous-Integration-Infrastructure, 7WorkType-Maintenance: scandium lost /srv - https://phabricator.wikimedia.org/T121400#1878986 (10hashar) [20:17:01] ottomata: thanks ! [20:19:01] (03PS1) 10Dzahn: labtest: don't send SMS for test machines [puppet] - 10https://gerrit.wikimedia.org/r/259073 (https://phabricator.wikimedia.org/T120047) [20:19:36] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: Puppet has 1 failures [20:20:27] deploying graphoid update .... cc greg-g mobrovac gwicke [20:21:04] yurik: don't forget to !log it as well [20:21:14] mobrovac, sure, after i'm done [20:25:40] !log updated graphoid service [20:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:34:47] 6operations, 10hardware-requests: EQIAD/CODFW: 2 hardware access request for monitoring - https://phabricator.wikimedia.org/T120842#1879029 (10RobH) a:5RobH>3akosiaris So the requirements for these is FAR lower than the new dual CPU 4 * 3TB systems we just ordered for both sites. I can allocate one of the... [20:36:30] <_joe_> YuviPanda: well, this time I wasn't tryng to be an asshole [20:39:37] _joe_: hi, are you familiar with https://github.com/StackExchange/blackbox ? [20:40:01] <_joe_> matanya: nope [20:40:15] I think it can be useful for ops [20:40:28] mainly in the context of the private repo [20:40:45] <_joe_> matanya: we already use something like that for some sensitive repos [20:40:57] I think akosiaris was looking for a solution for this use case [20:41:10] but if you already have one, than great ! :) [20:41:26] what is that solution by the way? is it pass ? [20:45:00] i want to depool mw1259, i checked pybal-config and it's nowhere. is that because i should check in etcd now ? [20:45:10] or because it was never pooled [20:45:27] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:47:11] (03PS1) 10Ottomata: Modifying kafka replica alerting to be more lenient [puppet/kafka] - 10https://gerrit.wikimedia.org/r/259079 (https://phabricator.wikimedia.org/T121407) [20:47:31] (03CR) 10jenkins-bot: [V: 04-1] Modifying kafka replica alerting to be more lenient [puppet/kafka] - 10https://gerrit.wikimedia.org/r/259079 (https://phabricator.wikimedia.org/T121407) (owner: 10Ottomata) [20:47:51] (03PS2) 10Ottomata: Modifying kafka replica alerting to be more lenient [puppet/kafka] - 10https://gerrit.wikimedia.org/r/259079 (https://phabricator.wikimedia.org/T121407) [20:48:11] (03CR) 10jenkins-bot: [V: 04-1] Modifying kafka replica alerting to be more lenient [puppet/kafka] - 10https://gerrit.wikimedia.org/r/259079 (https://phabricator.wikimedia.org/T121407) (owner: 10Ottomata) [20:49:44] (03PS3) 10Ottomata: Modifying kafka replica alerting to be more lenient [puppet/kafka] - 10https://gerrit.wikimedia.org/r/259079 (https://phabricator.wikimedia.org/T121407) [20:49:51] !log puppet on neon disabled for certificate update [20:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:50:41] (03CR) 10Ottomata: [C: 032] Modifying kafka replica alerting to be more lenient [puppet/kafka] - 10https://gerrit.wikimedia.org/r/259079 (https://phabricator.wikimedia.org/T121407) (owner: 10Ottomata) [20:52:08] (03CR) 10Ottomata: [C: 032 V: 032] Update kafka submodule with alerting change [puppet] - 10https://gerrit.wikimedia.org/r/259083 (https://phabricator.wikimedia.org/T121407) (owner: 10Ottomata) [20:53:07] (03PS2) 10RobH: new icinga.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/259058 [20:54:20] !log rebooting mw1259 for BIOS settting [20:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:55:31] (03CR) 10RobH: [C: 032] new icinga.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/259058 (owner: 10RobH) [20:56:45] ok, lets see if i blow up icinga... [20:57:08] PROBLEM - Host mw1259 is DOWN: PING CRITICAL - Packet loss = 100% [20:57:10] !log mw1259 - enabled hyperthreading (logical processor) [20:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:11] too bad neon already takes forever for a puppet run [20:58:44] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1879090 (10Ottomata) a:3akosiaris [20:59:18] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1849958 (10Ottomata) @akosiaris can you help with this? [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151214T2100). Please do the needful. [21:00:07] RECOVERY - Host mw1259 is UP: PING OK - Packet loss = 0%, RTA = 2.92 ms [21:00:20] 6operations, 10ops-eqiad: mw1259 does not have hyperthreading enabled - https://phabricator.wikimedia.org/T120270#1879102 (10Dzahn) booted into BIOS and enabled HyperThreading (Logical Processors). it was off. now we have 16 CPUs: [ 2.119577] microcode: CPU0 sig=0x206c2, pf=0x1, revision=0x13 [ 2.12540... [21:00:23] 6operations, 10OCG-General-or-Unknown, 6Scrum-of-Scrums, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#1879105 (10ssastry) I'm going to poke @cscott about it tomorrow at the parsing team meeting. He will also likely be showing up for the Scrum of... [21:00:34] no mobileapps deploy /cc:mdholloway [21:01:00] 6operations, 10ops-eqiad: mw1259 does not have hyperthreading enabled - https://phabricator.wikimedia.org/T120270#1879110 (10Dzahn) 5Open>3Resolved a:3Dzahn [21:01:47] !log neon returned to normal puppet updates. icinga new cert is live. [21:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:02:41] 6operations, 10Deployment-Systems: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1879129 (10Dzahn) a:3Dzahn [21:03:03] deploying parsoid now [21:03:30] 6operations, 10Deployment-Systems: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#1856865 (10Dzahn) Let's start by agreeing on a specific (lower) UID for this and then update https://wikitech.wikimedia.org/wiki/UID So what number do we pick? [21:04:00] !log starting parsoid deploy [21:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:11:51] !log restarted parsoid on wtp1004 (~4 mins back) as a canary [21:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:12:04] all looking good. restarting parsoid on all nodes. [21:13:16] 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1879162 (10Papaul) @Dzahn I did alread hard reset but it didn't work same problem [21:15:56] !log finished deploy of parsoid sha df3171e6 [21:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:24] 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1879176 (10csteipp) I've had this conversation with ops a few times, but I'll document it here for reference. I think we do need to find a way to segment our ne... [21:24:42] 6operations, 10Traffic, 7HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1879207 (10BBlack) @Chmarkine - I'm not sure if it's wise or reasonable to remove old roots from PKP lists (except in strange policy cases like the recent one here: https://goo... [21:31:26] (03PS1) 10Rush: Revert "Reorder modules in common-account" [puppet] - 10https://gerrit.wikimedia.org/r/259148 [21:32:10] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "Reorder modules in common-account" [puppet] - 10https://gerrit.wikimedia.org/r/259148 (owner: 10Rush) [21:32:24] (03CR) 10Rush: [C: 031] "thanks daniel" [puppet] - 10https://gerrit.wikimedia.org/r/259073 (https://phabricator.wikimedia.org/T120047) (owner: 10Dzahn) [21:33:21] (03PS1) 10Ori.livneh: Remove test pages for `qlow`; add test page for I1df544f6f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259149 (https://phabricator.wikimedia.org/T115600) [21:33:42] (03CR) 10Ori.livneh: [C: 032] Remove test pages for `qlow`; add test page for I1df544f6f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259149 (https://phabricator.wikimedia.org/T115600) (owner: 10Ori.livneh) [21:34:08] (03Merged) 10jenkins-bot: Remove test pages for `qlow`; add test page for I1df544f6f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259149 (https://phabricator.wikimedia.org/T115600) (owner: 10Ori.livneh) [21:34:56] !log ori@tin Synchronized docroot and w: (no message) (duration: 00m 29s) [21:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:38:44] (03CR) 10Aaron Schulz: "Given that the lock manager servers used will be the eqiad one and MW will need to know about the other DC swift in even eqiad (for optimi" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197499 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [21:40:15] (03PS1) 10Ori.livneh: Fix-up for I4bad81e75 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259151 [21:40:28] (03CR) 10Ori.livneh: [C: 032] Fix-up for I4bad81e75 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259151 (owner: 10Ori.livneh) [21:40:49] (03Merged) 10jenkins-bot: Fix-up for I4bad81e75 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259151 (owner: 10Ori.livneh) [21:43:52] PROBLEM - Restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [21:44:13] PROBLEM - Restbase root url on xenon is CRITICAL: Connection refused [21:44:43] this is me testing ^^ [21:47:04] thcipriani: did you make further changes to get the android.json file from the MobileApp extension to update? Anyways, just wanted to let you know that it now shows the correct value. I've purged the cache for https://www.wikimedia.org/static/current/extensions/MobileApp/config/android.json [21:48:31] bearND: I did not make additional changes. Glad it is working now. [21:49:06] ok, was just wondering [21:50:03] bearND: I can't explain why, though. Same process different results. Maybe we were just lucky the first two times :\ [22:02:46] 6operations, 10ops-codfw, 5Patch-For-Review: rack 8 new misc systems - https://phabricator.wikimedia.org/T120885#1879266 (10Papaul) WMF6377 10.193.1.23 port g-5/0/9 WMF6378 10.193.1.24 port g-5/0/22 WMF6379 10.193.1.25 port g-5/0/33 WMF6380 10.193.1.26 port g-5/0/3... [22:03:12] 6operations, 10ops-codfw, 5Patch-For-Review: rack 8 new misc systems - https://phabricator.wikimedia.org/T120885#1879269 (10Papaul) [22:03:21] 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1879270 (10Dzahn) 5Open>3Invalid [22:05:32] PROBLEM - Restbase root url on cerium is CRITICAL: Connection refused [22:05:42] PROBLEM - Restbase endpoints health on cerium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [22:07:25] 6operations, 10ops-codfw, 5Patch-For-Review: rack 8 new misc systems - https://phabricator.wikimedia.org/T120885#1879273 (10Papaul) WMF6377 10.193.1.23 port ge-5/0/9 WMF6378 10.193.1.24 port ge-5/0/22 WMF6379 10.193.1.25 port ge-5/0/33 WMF6380 10.193.1.26 port ge-5/0/34 WMF6381 10.193.1.27 port ge-5/0/1 WMF6... [22:07:26] still testing ^^ [22:08:21] (03PS1) 10Dzahn: openldap/labs: make serpens/seaborgium backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/259155 (https://phabricator.wikimedia.org/T120919) [22:08:33] ACKNOWLEDGEMENT - Restbase endpoints health on cerium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) gwicke Testing on staging cluster. [22:08:33] ACKNOWLEDGEMENT - Restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) gwicke Testing on staging cluster. [22:08:45] 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1879278 (10Dzahn) @Hoo @Lydia_Pintscher if you disagree please reopen the ticket again. thanks [22:09:45] !log mathoid deploying 5b20fe1 [22:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:55] (03PS2) 10Dzahn: openldap/labs: make serpens/seaborgium backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/259155 (https://phabricator.wikimedia.org/T120919) [22:10:03] 6operations, 10ops-codfw, 5Patch-For-Review: rack 8 new misc systems - https://phabricator.wikimedia.org/T120885#1879279 (10Papaul) [22:10:24] (03CR) 10Dzahn: [C: 032] "installing bacula client" [puppet] - 10https://gerrit.wikimedia.org/r/259155 (https://phabricator.wikimedia.org/T120919) (owner: 10Dzahn) [22:10:41] 6operations, 10ops-codfw, 5Patch-For-Review: rack 8 new misc systems - https://phabricator.wikimedia.org/T120885#1879284 (10Papaul) a:5Papaul>3RobH [22:10:47] !log puppet disabled on uranium for ganglia cert update [22:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:11:15] (03PS2) 10RobH: new ganglia.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/259049 [22:12:29] (03CR) 10RobH: [C: 032] new ganglia.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/259049 (owner: 10RobH) [22:13:44] (03PS1) 10Eevans: enable EventBus logging channel (currently only in beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259156 (https://phabricator.wikimedia.org/T116786) [22:16:02] !log puppet enabled on uranium and resumed normal service, cert updated for ganglia [22:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:19:14] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: puppet fail [22:35:47] (03CR) 10Ori.livneh: [C: 032 V: 032] import debian directory [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257168 (owner: 10Ori.livneh) [22:36:01] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [22:36:13] RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15184 bytes in 0.006 second response time [22:37:45] (03PS1) 10Mobrovac: Logging: log entries pertaining to the Math extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259168 (https://phabricator.wikimedia.org/T121445) [22:39:41] (03CR) 10GWicke: [C: 032] Logging: log entries pertaining to the Math extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259168 (https://phabricator.wikimedia.org/T121445) (owner: 10Mobrovac) [22:40:06] (03Merged) 10jenkins-bot: Logging: log entries pertaining to the Math extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259168 (https://phabricator.wikimedia.org/T121445) (owner: 10Mobrovac) [22:40:41] (03PS1) 10Yuvipanda: Revert "mediawiki_singlenode: rename defined type" [puppet] - 10https://gerrit.wikimedia.org/r/259169 [22:40:42] (03PS3) 10Dzahn: openldap/labs: make serpens/seaborgium backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/259155 (https://phabricator.wikimedia.org/T120919) [22:40:51] chasemp: valhallasw`cloud ^^ [22:41:00] aaarg [22:41:01] actually no [22:41:02] that won't work [22:41:09] since there's been a lot of changes to it since [22:41:31] YuviPanda: why do you need it to be in the puppet repo? [22:41:52] I don't, so I'm just going to edit LDAP to remove them [22:42:03] valhallasw`cloud: I wnet looking in the repo to find out what the role was called [22:42:07] ah [22:42:10] and as I typed that I realize that's in the error message [22:42:12] * YuviPanda facepalms [22:42:15] ok [22:43:02] (03PS1) 10Mobrovac: Math ext Logging: fix ordering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259170 [22:43:29] valhallasw`cloud: hmm [22:43:32] valhallasw`cloud: there's only [22:43:37] eeRievei6emeibe [22:43:39] err [22:43:43] eeRievei6emeibe [22:43:44] err [22:43:47] fuck [22:43:50] stupid copy paste [22:43:54] puppetClass: role::deprecated::mediawiki::install [22:43:57] that [22:43:58] I only saw ????????????? ;-D [22:44:01] heh [22:44:24] it's the password for my local vagrant setup [22:44:26] so :P [22:45:22] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [22:45:23] RECOVERY - Restbase root url on cerium is OK: HTTP OK: HTTP/1.1 200 - 15184 bytes in 0.015 second response time [22:45:33] right, so remove that? [22:45:41] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [22:45:47] (03CR) 10BryanDavis: [C: 031] Math ext Logging: fix ordering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259170 (owner: 10Mobrovac) [22:46:26] yeah done [22:46:30] (03CR) 10GWicke: [C: 032] Math ext Logging: fix ordering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259170 (owner: 10Mobrovac) [22:46:53] (03Merged) 10jenkins-bot: Math ext Logging: fix ordering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259170 (owner: 10Mobrovac) [22:48:25] social-tools1.social-tools.eqiad.wmflabs: [22:48:26] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1879425 (10Tgr) Is there an alternative path towards making logging non-tiny amounts of data... [22:48:27] ---------- [22:48:29] what the fuck, salt [22:49:42] !log restbase: canary deploy of 9f31847ad to restbase1001 [22:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:49:59] gwicke / mobrovac: are you going to sync those config changes? [22:50:13] bd808: doing now [22:50:23] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1879426 (10ori) I don't think it's a problem; excessively large event payloads are a misuse... [22:50:44] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Enable logging for the Math extension (duration: 00m 29s) [22:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:50:51] bd808: ^^ [22:51:06] awesome [22:53:05] (03CR) 10Mobrovac: [C: 031] enable EventBus logging channel (currently only in beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259156 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [22:56:40] (03PS1) 10Dzahn: openldap: add backup with ldif files [puppet] - 10https://gerrit.wikimedia.org/r/259174 (https://phabricator.wikimedia.org/T120919) [22:57:31] PROBLEM - Restbase root url on restbase1001 is CRITICAL: Connection refused [22:57:38] (03CR) 10jenkins-bot: [V: 04-1] openldap: add backup with ldif files [puppet] - 10https://gerrit.wikimedia.org/r/259174 (https://phabricator.wikimedia.org/T120919) (owner: 10Dzahn) [22:57:42] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [22:57:53] restbase1001 is me deploying ^^ [22:59:13] (03PS2) 10Dzahn: openldap: add backup with ldif files [puppet] - 10https://gerrit.wikimedia.org/r/259174 (https://phabricator.wikimedia.org/T120919) [23:00:09] (03CR) 10jenkins-bot: [V: 04-1] openldap: add backup with ldif files [puppet] - 10https://gerrit.wikimedia.org/r/259174 (https://phabricator.wikimedia.org/T120919) (owner: 10Dzahn) [23:01:19] (03PS1) 10JGirault: Bump portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259176 [23:01:23] (03PS3) 10Dzahn: openldap: add backup with ldif files [puppet] - 10https://gerrit.wikimedia.org/r/259174 (https://phabricator.wikimedia.org/T120919) [23:02:24] (03CR) 10jenkins-bot: [V: 04-1] openldap: add backup with ldif files [puppet] - 10https://gerrit.wikimedia.org/r/259174 (https://phabricator.wikimedia.org/T120919) (owner: 10Dzahn) [23:06:02] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Puppet has 1 failures [23:09:26] (03PS4) 10Dzahn: openldap: add backup with ldif files [puppet] - 10https://gerrit.wikimedia.org/r/259174 (https://phabricator.wikimedia.org/T120919) [23:11:37] 6operations, 5Patch-For-Review: Add openldap/labs servers to backup - https://phabricator.wikimedia.org/T120919#1879471 (10Dzahn) I added backup::host to these nodes as a requirement. That installed bacula client etc already. Now pending the second patch i uploaded which adds a cron to run slapcat, a director... [23:11:56] 6operations, 5Patch-For-Review: Add openldap/labs servers to backup - https://phabricator.wikimedia.org/T120919#1879472 (10Dzahn) a:3Dzahn [23:17:17] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10MediaWiki-extensions-CentralNotice, 10Traffic: Eventlogging should transparently split large event payloads - https://phabricator.wikimedia.org/T114078#1879499 (10Tgr) "Excessively" seems to be currently defined as "more than 500 characters" (t... [23:17:55] bblack: you around? [23:18:52] dr0ptp4kt_: yes [23:19:02] bblack: can you hop on a hangout? [23:19:13] probably [23:19:31] bblack: will call you, if you didn't see it already [23:23:22] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15184 bytes in 0.022 second response time [23:23:32] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [23:34:56] !log restbase: starting deploy of 9f31847ad to restbase prod cluster [23:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:55] (03CR) 10Aaron Schulz: [C: 032] Set a high but finite "maxjobs" default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258429 (owner: 10Aaron Schulz) [23:38:49] (03Merged) 10jenkins-bot: Set a high but finite "maxjobs" default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258429 (owner: 10Aaron Schulz) [23:38:57] (03PS1) 10EBernhardson: Enable cirrus completion suggester beta feature in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259187 [23:39:40] !log aaron@tin Synchronized rpc/RunJobs.php: 82f4f9df64e42fd04bd32395 (duration: 00m 29s) [23:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:19] (03CR) 10EBernhardson: [C: 032] Enable cirrus completion suggester beta feature in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259187 (owner: 10EBernhardson) [23:41:44] (03Merged) 10jenkins-bot: Enable cirrus completion suggester beta feature in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259187 (owner: 10EBernhardson) [23:46:02] ostriches: I've got an emergency patch for wmf.8 [23:46:28] (03PS1) 10Yurik: delete GraphImgServiceAlways, add GraphEnableGZip [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259189 [23:47:08] bd808: can't wait for swat? [23:47:25] yeah, 13 minutes won't change the world [23:47:33] * bd808 puts it into swat [23:47:51] !log restbase: finished deploy of 9f31847ad to restbase prod cluster [23:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:48:12] bd808: thx. Figure it's easier to bundle there if it can stand the few mins wait. [23:49:12] jouncebot: refresh [23:49:14] I refreshed my knowledge about deployments. [23:51:47] twentyafterfour, thcipriani: Are there any pitfalls I should be aware of while deploying https://gerrit.wikimedia.org/r/#/c/255135 (adds new git submodule to mediawiki-config) ? [23:52:52] Make sure to init:) [23:53:29] I can't think of any gotchas wrt mediawiki-config submodules. [23:54:15] RoanKattouw: i did put it in this morning swat so releng could deploy it, but it ran overtime so got pushed back to eveninig [23:54:26] OK [23:54:50] ebernhardson: Yeah I feel better about doing it having actually reviewed the submodule itself, and knowing it was coming up this time [23:56:32] ostriches: is there anything special with the dual masters setup and config submodules? [23:56:49] (racking my brain for anything that might be weird) [23:57:25] It should just work [23:57:33] Since we sync .git [23:57:49] So our repo state should be consistent [23:59:10] thcipriani: although I'm thinking of swapping for `git annex sync --content` to save checking a bazillion mtimes on .git objects [23:59:51] in the brave new future of git-annex