[00:01:08] I !logged mw1222 as broken on may 7, may 6, april 10 ... [00:02:02] march 25, march 18, march 17. [00:02:11] uhhh that's a lot [00:02:31] I never bothered to open a ticket before, but !logged and moved on [00:03:04] 2 of the hosts that are broken as of today are scap rsync proxies and that is causing more brokenness [00:03:25] most of the time scap changes only effect the driver scripts on tin [00:03:35] but the newest changes need to go everywhere [00:05:37] (03PS4) 10Rush: WIP: Setup a node pool from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 [00:05:42] there was some issue for deploy repo perms [00:05:44] iirc [00:06:13] 6operations, 10Deployment-Systems, 6Release-Engineering: Corrupt /srv/deployment/scap/scap checkouts on WMF prod cluster - https://phabricator.wikimedia.org/T103441#1390321 (10fgiunchedi) so, audit time ``` # salt --output=txt mw2*.codfw.wmnet cmd.run 'find /srv/deployment/scap/scap/.git -size 0 -ls' mw206... [00:07:17] chasemp bd808 ^ [00:07:33] I don't see wikibugs [00:07:38] ah [00:08:15] chasemp: yeah I am unignoring bots now because of ^ [00:08:55] so a bunch from the trebuchet update today and a few from a long time ago [00:09:55] there may be something in the apache logs on tin from the ones that broke today [00:10:43] * bd808 can't read /var/log/apache2/tin.eqiad.wmnet_error.log [00:12:07] bd808: put a copy in tin you can read there isn't anything sensitve there [00:12:13] but it looks pretty interesting [00:13:18] godog: I have to go deal with kids here quick, is there anything specific I can do before I go? [00:13:26] my take is to redo the process on a node and see the outcome [00:14:27] 6operations, 10Deployment-Systems, 6Release-Engineering: Corrupt /srv/deployment/scap/scap checkouts on WMF prod cluster - https://phabricator.wikimedia.org/T103441#1390355 (10fgiunchedi) tin logs for `mw2086` around that time ``` tin.eqiad.wmnet_access.log:2620:0:860:102:92b1:1cff:fe25:954d - - [22/Jun/201... [00:14:51] chasemp: nope go ahead! [00:19:34] bd808: I can't find anything obvious, perhaps git interrupted in flight? [00:20:15] bd808: anyways I'll go ahead and remove those I think [00:20:24] cool [00:26:05] 6operations, 10Deployment-Systems, 6Release-Engineering: Corrupt /srv/deployment/scap/scap checkouts on WMF prod cluster - https://phabricator.wikimedia.org/T103441#1390399 (10fgiunchedi) I've removed the zero-size files and ran `deploy.fetch` + `deploy.checkout` on those machines [00:30:03] godog: mw2080 still doesn't have all the files that should be in /srv/deployment/scap/scap. Specifically scap.cfg is missing there [00:32:16] sigh [00:32:43] mw2187 seems to have the same problem [00:32:56] but git log shows all of the commits that I expect [00:33:29] I could try a no-op trebuchet deploy to see what happens [00:33:48] yep worth a try, I guess that'll refresh the deployment tag [00:34:16] k. I'll see what happens [00:36:07] Did virt1000 get decommed today? [00:37:48] mhh I don't think so, I can't ssh tho [00:39:43] (03CR) 10Krinkle: "Yeah, perhaps make that the default. Or group those results after a few lines of whitespace under a section "Private:". that way you get t" [puppet] - 10https://gerrit.wikimedia.org/r/214037 (owner: 10Alex Monk) [00:40:04] 6operations, 10Deployment-Systems, 6Release-Engineering: Corrupt /srv/deployment/scap/scap checkouts on WMF prod cluster - https://phabricator.wikimedia.org/T103441#1390448 (10bd808) Some spot checked servers were still missing files in the local checkouts following @fgiunchedi's forced updates. I did a no-o... [00:40:17] godog: 20 hosts failed to checkout after fetching [00:44:11] 6operations, 10Deployment-Systems, 6Release-Engineering: Corrupt /srv/deployment/scap/scap checkouts on WMF prod cluster - https://phabricator.wikimedia.org/T103441#1390469 (10fgiunchedi) running `deploy.checkout` manually on `mw2197` yields a few sha1 files missing ``` mw2197:~$ sudo salt-call deploy.check... [00:45:08] bd808: see my last update, I'm thinking we're better off deleting /srv/deployment/scap/scap at this point [00:45:33] works for me. cleaning up a corrupt git clone is always a pain [00:46:12] 6operations, 6Labs, 10wikitech.wikimedia.org: Expand list of people who can create new Labs project - https://phabricator.wikimedia.org/T101688#1390474 (10Legoktm) Do we currently have an issue with projects not being created in a timely manner? [00:48:46] bd808: yeah I've wiped the scap directory and ran fetch/checkout again, no machine failed afaict [00:50:28] 6operations, 10Deployment-Systems, 6Release-Engineering: Corrupt /srv/deployment/scap/scap checkouts on WMF prod cluster - https://phabricator.wikimedia.org/T103441#1390476 (10fgiunchedi) that didn't work as we expected, I've removed `/srv/deployment/scap` from the affected machines and ran `deploy.fetch` +... [00:54:12] bd808: how did it go otherwise? [00:54:23] or have you been unable to try it? [00:55:40] ori: I haven't tried yet. I wanted the other problems out of the way first [00:55:50] *nod* makes sense [00:56:00] * ori was just quickly skimming the backlog after looking away from IRC [00:56:23] godog: w00t only virt1000 failed and "!log shutting down virt1000" is in backscroll [00:57:13] bd808: \o/ but also https://static.spiceworks.com/shared/post/0008/2713/roy.jpg [00:57:30] 6operations, 10Deployment-Systems, 6Release-Engineering: Corrupt /srv/deployment/scap/scap checkouts on WMF prod cluster - https://phabricator.wikimedia.org/T103441#1390495 (10bd808) 5Open>3Resolved a:3bd808 A third trebuchet run showed only virt1000 failing and SAL lists it as shut down as of today. [00:57:53] godog: I'll wear that shirt tomorrow for good luck :) [00:58:15] I <3 my Mao RTFM shirt [00:58:20] hehehe sadly I have mine at home in Dublin [01:00:41] !log Pruned virt1000 from trebuchet minions list: redis-cli srem "deploy:scap/scap:minions" virt1000.wikimedia.org [01:00:46] Logged the message, Master [01:03:07] PROBLEM - salt-minion processes on labstore1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [01:12:11] (03PS1) 10Krinkle: Add static/ to docroot/wwwportal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220029 [01:13:20] ori: static/ is now cached agnostic from the hostname, right? [01:13:31] but in cache miss case, it will fetch it from the specified host. [01:13:57] that would explain why some static/ images are broken on www.wikimedia.org/static/ [01:14:00] since that directory doesn't exist [01:14:58] (03CR) 10Krinkle: [C: 032] Add static/ to docroot/wwwportal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220029 (owner: 10Krinkle) [01:15:05] (03Merged) 10jenkins-bot: Add static/ to docroot/wwwportal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220029 (owner: 10Krinkle) [01:16:02] 6operations, 6Release-Engineering, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1390541 (10Krinkle) [01:17:23] (03CR) 10Filippo Giunchedi: [C: 04-1] Use cronolog and logrotate to avoid Puppetmaster Apache reloads (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/219788 (owner: 10Ori.livneh) [01:17:41] !log krinkle Synchronized docroot and w: (no message) (duration: 00m 12s) [01:17:46] Logged the message, Master [01:19:22] 6operations, 6Release-Engineering, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1390550 (10Krinkle) > https://www.wikimedia.org/static/images/project-logos/enwikiversity.png Failed to load resource: the server... [01:19:31] 6operations, 6Release-Engineering, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1390554 (10Krinkle) p:5Triage>3High a:3Krinkle [01:19:55] ori: https://gerrit.wikimedia.org/r/#/c/220023/1 time for this? [01:21:55] 6operations, 6Release-Engineering, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1390555 (10Krinkle) 5Open>3Resolved [01:30:57] PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail [01:32:58] (03CR) 10Ori.livneh: puppetmaster: split frontend scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/220023 (owner: 10Filippo Giunchedi) [01:38:02] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review, 7Pybal: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#1390575 (10fgiunchedi) looks like the last ipvsadm release is 1.28 from feb 2015 at https://www.kernel.org/pub/linux/utils/kernel/ipvsadm/ not y... [01:44:43] (03PS2) 10Filippo Giunchedi: puppetmaster: split frontend scripts [puppet] - 10https://gerrit.wikimedia.org/r/220023 [01:47:17] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [01:48:00] (03PS3) 10Filippo Giunchedi: puppetmaster: split frontend scripts [puppet] - 10https://gerrit.wikimedia.org/r/220023 [01:55:58] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [01:59:28] PROBLEM - configured eth on restbase1008 is CRITICAL: Connection refused by host [01:59:37] PROBLEM - RAID on restbase1008 is CRITICAL: Connection refused by host [01:59:38] PROBLEM - Disk space on restbase1008 is CRITICAL: Connection refused by host [01:59:48] PROBLEM - dhclient process on restbase1008 is CRITICAL: Connection refused by host [01:59:58] PROBLEM - salt-minion processes on restbase1008 is CRITICAL: Connection refused by host [02:00:07] PROBLEM - puppet last run on restbase1008 is CRITICAL: Connection refused by host [02:00:27] PROBLEM - NTP on restbase1008 is CRITICAL: NTP CRITICAL: No response from NTP server [02:00:38] PROBLEM - puppet last run on restbase1009 is CRITICAL Puppet last ran 1 day ago [02:00:47] PROBLEM - DPKG on restbase1008 is CRITICAL: Connection refused by host [02:01:09] PROBLEM - RAID on restbase1009 is CRITICAL Active: 5, Working: 5, Failed: 1, Spare: 0 [02:23:11] !log l10nupdate Synchronized php-1.26wmf10/cache/l10n: (no message) (duration: 06m 47s) [02:23:18] Logged the message, Master [02:26:45] !log LocalisationUpdate completed (1.26wmf10) at 2015-06-23 02:26:44+00:00 [02:26:49] Logged the message, Master [02:38:02] (03CR) 10Hoo man: "> If there's already an rdf in the directory, it seems we regenerate it. Do we want that?" [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [03:33:35] !log xtrabackup clone db2023 to db1045 [03:33:40] Logged the message, Master [03:41:48] PROBLEM - puppet last run on analytics1012 is CRITICAL Puppet has 1 failures [03:56:08] RECOVERY - puppet last run on analytics1012 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [04:10:13] akosiaris: somehow, https://gerrit.wikimedia.org/r/#/c/219781/3 isn't reflecting. I restarted cxserver instances too. [04:10:52] akosiaris: should be like, https://cxserver-beta.wmflabs.org/list/mt/ur/hi (note: 'no-mt'), but it is, https://cxserver.wikimedia.org/list/mt/ur/hi :/ [04:32:47] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [04:33:38] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [04:34:13] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review, 7Pybal: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#1390862 (10BBlack) I didn't even see the 1.28 package hiding out on kernel.org! It does have the appropriate flag in it. Perhaps we can put th... [04:39:44] (03PS1) 10Cscott: Set $wgVisualEditorParsoidDomain for Parsoid v2 API. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220061 [04:40:48] (03PS1) 10KartikMistry: CX: Fix indent for defaults [puppet] - 10https://gerrit.wikimedia.org/r/220063 [04:41:25] (03PS2) 10Cscott: Set $wgVisualEditorParsoidDomain for Parsoid v2 API. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220061 [04:41:57] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [04:49:31] (03PS1) 10Cscott: Labs should not use protocol-relative URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220068 [04:53:17] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jun 23 04:53:17 UTC 2015 (duration 53m 16s) [04:53:22] Logged the message, Master [05:01:23] (03PS1) 10KartikMistry: Enable 'frwiki-recommender' campaign in frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220071 (https://phabricator.wikimedia.org/T101944) [05:16:32] (03CR) 10Santhosh: [C: 031] Enable 'frwiki-recommender' campaign in frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220071 (https://phabricator.wikimedia.org/T101944) (owner: 10KartikMistry) [05:23:33] 6operations, 6Performance-Team, 10Traffic, 5Patch-For-Review, 7Varnish: Ensure {text-domain}/w/load.php requests do not bypass cache for session cookies - https://phabricator.wikimedia.org/T101892#1390925 (10Krinkle) 5Open>3Resolved a:3Krinkle [05:26:30] 6operations, 6Analytics-Engineering: Varnish caching around datasets.wikimedia.org is causing breakages - https://phabricator.wikimedia.org/T103423#1390932 (10Ironholds) A note that this has now been fixed manually by (I think) Roan for one file but not for both, so the dashboards are still broken. I'd really... [05:42:49] 6operations, 10Traffic, 7HTTPS, 10Security-General: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298#1390977 (10Krinkle) [05:43:32] 6operations, 10Traffic, 7HTTPS, 10Security-General: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298#1105632 (10Krinkle) (Moved to Security-General, because its publicly visible and not related to MediaWiki core) [05:44:10] 6operations, 6Labs, 10Labs-Infrastructure, 10OCG-General-or-Unknown: salt on deployment-pdf02.deployment-prep.eqiad.wmflabs is wedged - https://phabricator.wikimedia.org/T103473#1390982 (10cscott) 3NEW [05:46:28] 6operations, 6Labs, 10Labs-Infrastructure, 10OCG-General-or-Unknown: salt on deployment-pdf02.deployment-prep.eqiad.wmflabs is wedged - https://phabricator.wikimedia.org/T103473#1390993 (10cscott) Oh, I forgot to add: ``` cscott@deployment-pdf02:/srv/deployment/ocg/ocg/mw-ocg-service$ git log commit 2b2816... [06:22:03] kart_: I see the config has registry: {"defaults": {"hi-ur": "no-mt", "ur-hi": "no-mt"} on both servers so it's been populated correctly [06:22:19] kart_: perhaps beta runs a version that supports that and production doesn't ? [06:29:48] PROBLEM - puppet last run on mw2059 is CRITICAL puppet fail [06:30:31] 6operations, 10Wikimedia-Etherpad, 7Database: Change character set on etherpad- lite database to utf8mb4_bin - https://phabricator.wikimedia.org/T103417#1391063 (10akosiaris) Elijah Sparrow on https://github.com/ether/etherpad-lite/issues/2522#issuecomment-114310039 reported that the utf8mb4_bin change fixed... [06:30:39] PROBLEM - puppet last run on cp1056 is CRITICAL Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on cp3037 is CRITICAL Puppet has 1 failures [06:31:47] PROBLEM - puppet last run on mc2011 is CRITICAL Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on etcd1003 is CRITICAL Puppet has 1 failures [06:32:15] <_joe_> akosiaris: good morning sir [06:32:43] _joe_: hey [06:34:40] <_joe_> so, regarding lvs [06:34:48] PROBLEM - puppet last run on db1040 is CRITICAL Puppet has 1 failures [06:34:52] PROBLEM - puppet last run on db1002 is CRITICAL Puppet has 1 failures [06:34:52] PROBLEM - puppet last run on elastic1027 is CRITICAL Puppet has 1 failures [06:34:52] <_joe_> I think me and chase are mostly done [06:35:08] PROBLEM - puppet last run on db1018 is CRITICAL Puppet has 1 failures [06:35:08] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 1 failures [06:35:19] PROBLEM - puppet last run on labvirt1003 is CRITICAL Puppet has 1 failures [06:35:27] PROBLEM - puppet last run on analytics1030 is CRITICAL Puppet has 1 failures [06:35:38] PROBLEM - puppet last run on wtp2008 is CRITICAL Puppet has 1 failures [06:35:58] PROBLEM - puppet last run on db1015 is CRITICAL Puppet has 1 failures [06:36:25] _joe_: nice! [06:36:38] PROBLEM - puppet last run on mw1088 is CRITICAL Puppet has 1 failures [06:36:38] PROBLEM - puppet last run on mw2163 is CRITICAL Puppet has 1 failures [06:36:38] PROBLEM - puppet last run on mw1175 is CRITICAL Puppet has 1 failures [06:36:38] PROBLEM - puppet last run on mw2003 is CRITICAL Puppet has 1 failures [06:36:48] PROBLEM - puppet last run on mw1226 is CRITICAL Puppet has 1 failures [06:37:18] PROBLEM - puppet last run on mw1176 is CRITICAL Puppet has 1 failures [06:37:27] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:37:28] PROBLEM - puppet last run on mw2212 is CRITICAL Puppet has 1 failures [06:37:28] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:37:28] PROBLEM - puppet last run on mw2104 is CRITICAL Puppet has 1 failures [06:37:35] <_joe_> meaning, we have one patch pending that you should consider in order not to step on our toes - https://gerrit.wikimedia.org/r/#/c/219481/ [06:37:37] PROBLEM - puppet last run on mw2066 is CRITICAL Puppet has 1 failures [06:37:49] PROBLEM - puppet last run on mw1025 is CRITICAL Puppet has 1 failures [06:38:28] PROBLEM - puppet last run on mw1042 is CRITICAL Puppet has 1 failures [06:38:29] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:39:21] (03PS1) 10Hydriz: Fix URL to interwiki cache on noc.wikimedia.org [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/220075 [06:45:28] RECOVERY - puppet last run on db1002 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:57] RECOVERY - puppet last run on mw1226 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:58] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on mc2011 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on labvirt1003 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on etcd1003 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1176 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:27] RECOVERY - puppet last run on wtp2008 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on mw2104 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on mw2066 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on db1015 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:47:12] 6operations, 10Wikimedia-Etherpad, 7Database: Change character set on etherpad- lite database to utf8mb4_bin - https://phabricator.wikimedia.org/T103417#1391105 (10jcrespo) >>! In T103417#1391063, @akosiaris wrote: > Elijah Sparrow on https://github.com/ether/etherpad-lite/issues/2522#issuecomment-114310039... [06:47:18] RECOVERY - puppet last run on db1040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:18] RECOVERY - puppet last run on elastic1027 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:18] RECOVERY - puppet last run on mw1088 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on mw1042 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on mw1175 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on mw2163 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:28] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:47:37] RECOVERY - puppet last run on db1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:37] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:57] RECOVERY - puppet last run on analytics1030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:09] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:18] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:18] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:48:37] RECOVERY - puppet last run on mw1025 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:08] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:28] RECOVERY - puppet last run on mw2059 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:09] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [06:57:27] 6operations, 6Analytics-Engineering: Varnish caching around datasets.wikimedia.org is causing breakages - https://phabricator.wikimedia.org/T103423#1391121 (10Catrope) >>! In T103423#1390932, @Ironholds wrote: > A note that this has now been fixed manually by (I think) Roan for one file but not for both, so th... [06:58:38] ! log added jsch_0.1.50-1ubuntu1~wmfprecise1 to precise-wikimedia on carbon [06:59:55] (03PS1) 10KartikMistry: CX: Enable CX as default expect where it is not deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220078 (https://phabricator.wikimedia.org/T103316) [07:00:01] (03CR) 10jenkins-bot: [V: 04-1] CX: Enable CX as default expect where it is not deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220078 (https://phabricator.wikimedia.org/T103316) (owner: 10KartikMistry) [07:02:34] ! log updated jsch on gallium and lanthanum to support modern SSH key exchange in Jenkins (actually that happened yesterday, but I forgot to log it back then) [07:02:51] logbot seems dead? [07:03:48] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [07:05:22] (03PS2) 10KartikMistry: CX: Enable CX as default expect where it is not deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220078 (https://phabricator.wikimedia.org/T103316) [07:05:27] 6operations, 7Availability, 7Varnish: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#1391201 (10Krinkle) [07:07:39] (03PS3) 10KartikMistry: CX: Enable CX as default except where it is not deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220078 (https://phabricator.wikimedia.org/T103316) [07:26:40] (03PS1) 10Giuseppe Lavagetto: role::conftool::master: introduction, apply on palladium [puppet] - 10https://gerrit.wikimedia.org/r/220080 [07:27:55] (03CR) 10Alexandros Kosiaris: [C: 031] racktables: increase default php memory limit [puppet] - 10https://gerrit.wikimedia.org/r/217724 (https://phabricator.wikimedia.org/T102092) (owner: 10Filippo Giunchedi) [07:27:58] (03CR) 10Giuseppe Lavagetto: [C: 032] role::conftool::master: introduction, apply on palladium [puppet] - 10https://gerrit.wikimedia.org/r/220080 (owner: 10Giuseppe Lavagetto) [07:35:08] PROBLEM - puppet last run on palladium is CRITICAL Puppet has 1 failures [07:36:22] <_joe_> this is me ^^ nothing serious though [07:40:01] akosiaris, running it on a slave for testing, it is better than I thought- only 10 minutes ETA. will keep you updated [07:40:22] (03PS1) 10Giuseppe Lavagetto: conftool: fix ssl_dir [puppet] - 10https://gerrit.wikimedia.org/r/220082 [07:40:48] 6operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#1391234 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [07:40:49] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: fix ssl_dir [puppet] - 10https://gerrit.wikimedia.org/r/220082 (owner: 10Giuseppe Lavagetto) [07:41:00] (03CR) 10Giuseppe Lavagetto: [V: 032] conftool: fix ssl_dir [puppet] - 10https://gerrit.wikimedia.org/r/220082 (owner: 10Giuseppe Lavagetto) [07:41:05] jynus: wow, that's fast. Nice!. Thanks [07:42:27] RECOVERY - puppet last run on palladium is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:42:52] http://etherpad.wikimedia.org/p/Legends_of_Lavanya [07:43:00] I realized that every column only contains edits, not etherpads [07:43:13] seems like our etherpad installation is where all RPGers meet up [07:43:20] so it is posible to purge/archive them (it has a timestamp) [07:44:04] jynus: every column ? there are only 2 columns. [07:44:18] sorry, every row [07:44:38] ah, yes [07:44:44] well, not only edits [07:44:56] it's highly dependent on the key column [07:45:12] but the majority should indeed be edits [07:45:17] (03PS1) 10Giuseppe Lavagetto: conftool: fix variable name [puppet] - 10https://gerrit.wikimedia.org/r/220083 [07:45:28] <_joe_> sigh. [07:46:04] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: fix variable name [puppet] - 10https://gerrit.wikimedia.org/r/220083 (owner: 10Giuseppe Lavagetto) [07:47:44] akosiaris: you have the ruby-foo ? [07:47:53] need advise on https://gerrit.wikimedia.org/r/#/c/219187/1/modules/statsdlb/manifests/init.pp [07:49:46] matanya: the original version is forcing type casting to string [07:50:20] the reason is that the first argument of validate_re() needs to be a string [07:51:05] or at least that's how I am reading godog's comment [07:51:35] i spoke to him, and he said he is not sure if the behaviour changes [07:51:57] and i should find someone with ruby-foo to comment on that [07:52:12] it makes sense, but i am not sure where to go from here [07:52:14] I am not yet either. [07:52:16] still looking [07:55:11] (03PS2) 10KartikMistry: CX: Fix indent for defaults [puppet] - 10https://gerrit.wikimedia.org/r/220063 [07:55:38] akosiaris: https://gerrit.wikimedia.org/r/#/c/220063 seems okay? [07:58:53] <_joe_> ugh that hiera file is *ugly* [07:59:05] <_joe_> I'd use the [] syntax for such long arrays [07:59:14] <_joe_> that file is unreadable to the human eye [07:59:38] matanya: I 'll comment on that change [07:59:46] thank you [08:02:52] kart_: I honestly do not know. I have no idea what the structure cxserver is parsing. Syntactically, it's correct. you do move default under mt but that is about it [08:03:57] kart_: semantically, mt['defaults']['hi-ur'] having the value 'no-mt' is counter intuitive but if that's your decision, I am fine with it [08:08:10] akosiaris, online schema change fails due to the schema [08:08:17] we may need maintenance [08:08:28] PROBLEM - YARN NodeManager Node-State on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:37] jynus: sigh... [08:10:00] let me check how much on the slaves first [08:10:02] well, that should be doable [08:10:02] kart_: so, it's your call for https://gerrit.wikimedia.org/r/#/c/220063/2 [08:10:09] RECOVERY - YARN NodeManager Node-State on analytics1020 is OK YARN NodeManager analytics1020.eqiad.wmnet:8041 Node-State: RUNNING [08:10:09] should I merge or not ? [08:10:54] we could do a failover, but db1001 has more things, I would prefer to set etherpad as read-only [08:11:27] (03PS1) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [08:11:29] (03PS1) 10Alexandros Kosiaris: Specify etherpad.wikimedia.org logging [puppet] - 10https://gerrit.wikimedia.org/r/220086 [08:11:31] (03PS1) 10Alexandros Kosiaris: etherpad: Log the incoming Request original IP address [puppet] - 10https://gerrit.wikimedia.org/r/220087 [08:12:28] (03PS6) 10Addshore: rsync wikidata json dumps to labs /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) [08:18:16] we can do something else, change the configuration temporarelly from m1-master to m1-slave [08:18:26] and work with the failover [08:19:43] 6operations, 5Patch-For-Review: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1391297 (10akosiaris) A full catalog compilation on ruby1.9 showed no other failures. Seems like our puppet tree is finally ruby 1.9 compliant. W... [08:22:32] 6operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#1391306 (10MoritzMuehlenhoff) All our 3.2 kernels (and also sodium's 2.6.32) have the livelock fix (6b43ae8a619d17c4935c3320d2ef9e92bdeed05d). The system with the oldest kernel (es1006) was really... [08:23:46] (03PS2) 10Alexandros Kosiaris: etherpad: Log the incoming Requests original IP address [puppet] - 10https://gerrit.wikimedia.org/r/220087 [08:24:04] (03PS3) 10Alexandros Kosiaris: etherpad: Log the incoming request's IP address [puppet] - 10https://gerrit.wikimedia.org/r/220087 [08:28:15] akosiaris: please merge. It is no-mt as defaults. User can choose if they want to use or not. [08:29:00] (03CR) 10Alexandros Kosiaris: [C: 032] CX: Fix indent for defaults [puppet] - 10https://gerrit.wikimedia.org/r/220063 (owner: 10KartikMistry) [08:34:35] 6operations, 10MediaWiki-Sites, 10SEO, 5HTTPS-by-default, and 4 others: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402#1391347 (10Nemo_bis) Getting quite serious currently, https://www.google.it/search?q="wikipedia.org/%... [08:36:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] statsdlb: minor lint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/219187 (owner: 10Matanya) [08:37:05] (03PS1) 10Giuseppe Lavagetto: conftool: separate master role out [puppet] - 10https://gerrit.wikimedia.org/r/220090 [08:38:16] 6operations, 10MediaWiki-Sites, 10SEO, 5HTTPS-by-default, and 4 others: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402#696646 (10Nemo_bis) [08:38:56] (03PS2) 10Giuseppe Lavagetto: conftool: separate master role out [puppet] - 10https://gerrit.wikimedia.org/r/220090 [08:39:09] 6operations, 10MediaWiki-Sites, 10SEO, 5HTTPS-by-default, and 4 others: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402#696646 (10Nemo_bis) [08:41:05] akosiaris: thanks. It looks good now. [08:41:59] apergos: https://phabricator.wikimedia.org/T102039 in case you haven't seen cscott's latest update [08:42:09] <_joe_> Nemo_bis: I can assure you it's being taken seriously [08:46:04] 6operations, 7discovery-system: confctl fails if only one data is set - https://phabricator.wikimedia.org/T103481#1391401 (10Joe) 3NEW [08:46:16] 6operations, 7discovery-system: confctl fails if only one data is set - https://phabricator.wikimedia.org/T103481#1391408 (10Joe) a:3Joe [08:47:47] akosiaris: I talked to him last night about it [08:47:51] thanks though [08:47:53] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: separate master role out [puppet] - 10https://gerrit.wikimedia.org/r/220090 (owner: 10Giuseppe Lavagetto) [08:48:28] 6operations, 10Mathoid, 10RESTBase, 6Services: Document and hook up public mathoid end point in RB - https://phabricator.wikimedia.org/T102030#1391414 (10mobrovac) [08:48:31] 6operations, 10Mathoid, 10MediaWiki-Vagrant, 6Services: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1391411 (10mobrovac) 5Open>3Resolved [08:49:46] 6operations, 10MediaWiki-Sites, 10SEO, 5HTTPS-by-default, and 4 others: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402#1391426 (10Nemo_bis) [08:51:11] 6operations, 7discovery-system: conftool-syncer is too slow in production - https://phabricator.wikimedia.org/T103482#1391432 (10Joe) 3NEW a:3Joe [08:57:17] PROBLEM - puppet last run on mw2021 is CRITICAL puppet fail [08:59:28] PROBLEM - RAID on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:48] PROBLEM - YARN NodeManager Node-State on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:04:18] RECOVERY - YARN NodeManager Node-State on analytics1020 is OK YARN NodeManager analytics1020.eqiad.wmnet:8041 Node-State: RUNNING [09:04:57] RECOVERY - RAID on analytics1020 is OK no disks configured for RAID [09:07:35] (03PS4) 10Hashar: CX: Enable CX as default except where it is not deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220078 (https://phabricator.wikimedia.org/T103316) (owner: 10KartikMistry) [09:07:47] (03CR) 10Hashar: "Removed link to T103322 , unrelated :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220078 (https://phabricator.wikimedia.org/T103316) (owner: 10KartikMistry) [09:08:01] 6operations, 10Continuous-Integration-Infrastructure: Jessie does not have libvips15 - https://phabricator.wikimedia.org/T103322#1391497 (10hashar) [09:09:33] !log failing over etherpad to db1016 [09:09:38] Logged the message, Master [09:09:48] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.008 second response time [09:09:59] known ^ [09:10:06] (03PS1) 10Giuseppe Lavagetto: role::conftool::master: fix puppet scoping (!!!) [puppet] - 10https://gerrit.wikimedia.org/r/220092 [09:11:08] PROBLEM - etherpad_lite_process_running on etherpad1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [09:11:25] (03CR) 10Giuseppe Lavagetto: [C: 032] role::conftool::master: fix puppet scoping (!!!) [puppet] - 10https://gerrit.wikimedia.org/r/220092 (owner: 10Giuseppe Lavagetto) [09:11:36] 6operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#1391521 (10MoritzMuehlenhoff) The Java fix is present in all our openjdk-7 packages and the few systems with a openjdk-8 backport. This covers the complex services like Hadoop, Cassandra and Elast... [09:12:13] ACKNOWLEDGEMENT - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.002 second response time alexandros kosiaris failover to d1016 [09:12:28] ACKNOWLEDGEMENT - etherpad_lite_process_running on etherpad1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js alexandros kosiaris failover to db1016 [09:12:58] RECOVERY - etherpad_lite_process_running on etherpad1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [09:17:09] RECOVERY - puppet last run on mw2021 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:18:18] PROBLEM - etherpad_lite_process_running on etherpad1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [09:18:48] hashar: Thanks :) [09:20:48] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 7926 bytes in 0.017 second response time [09:21:05] kart_: you are welcome. While you are around you have a task to setup some new wikis on the beta cluster. I am wondering whether we can mark it closed or if you need something else to be done ( wikis for CX https://phabricator.wikimedia.org/T90683 ) [09:21:35] hashar: I'm going to relook it. We certainly need it. [09:21:45] hashar: meanwhile, https://phabricator.wikimedia.org/T103486 :) [09:22:07] RECOVERY - etherpad_lite_process_running on etherpad1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [09:22:16] hashar: Setup new wiki is low priority, but surely needed. [09:22:37] PROBLEM - YARN NodeManager Node-State on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:50] 6operations, 5Continuous-Integration-Isolation: Figure out fine sudo rules for the nodepool service - https://phabricator.wikimedia.org/T102281#1391598 (10hashar) Nodepool creates new images using python-diskimage-builder. Turns out that script rely on having root access and the nodepool-puppet manifests have... [09:26:08] RECOVERY - YARN NodeManager Node-State on analytics1020 is OK YARN NodeManager analytics1020.eqiad.wmnet:8041 Node-State: RUNNING [09:26:23] kart_: can you post a quick status on https://phabricator.wikimedia.org/T90683 and maybe lower the priority ? That one keep coming in during our triages :-] [09:27:17] PROBLEM - puppet last run on etherpad1001 is CRITICAL puppet fail [09:29:07] RECOVERY - puppet last run on etherpad1001 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:32:27] 6operations, 10Continuous-Integration-Infrastructure: Remove Java 6 from CI Jenkins slaves - https://phabricator.wikimedia.org/T103491#1391656 (10hashar) 3NEW [09:36:48] PROBLEM - YARN NodeManager Node-State on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:16] ! log added jsch_0.1.50-1ubuntu1~wmfprecise1 to precise-wikimedia on carbon [09:41:43] !log added jsch_0.1.50-1ubuntu1~wmfprecise1 to precise-wikimedia on carbon [09:41:47] Logged the message, Master [09:41:55] !log updated jsch on gallium and lanthanum to support modern SSH key exchange in Jenkins (actually that happened yesterday, but I forgot to log it back then) [09:41:59] Logged the message, Master [09:43:53] (03PS1) 10KartikMistry: CX: Add wikis for deployment on 20150623 [puppet] - 10https://gerrit.wikimedia.org/r/220095 (https://phabricator.wikimedia.org/T103316) [09:46:59] 6operations, 10Continuous-Integration-Infrastructure: Remove Java 6 from CI Jenkins slaves - https://phabricator.wikimedia.org/T103491#1391724 (10hashar) Here are the Jenkins jobs JDK from `ssh gallium.wikimedia.org grep jdk /var/lib/jenkins/jobs/*/config.xml` | Job name | Jenkins XML config |--|-- | analytic... [09:47:00] (03CR) 10Phuedx: [C: 04-1] "What Krinkle said. Also, like the WikiGrok config, should we store the static data in mobile.php and just have the feature flags in Initia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) (owner: 10Jdlrobson) [10:07:24] 6operations, 10Wikimedia-Etherpad, 7Database: Change character set on etherpad- lite database to utf8mb4_bin - https://phabricator.wikimedia.org/T103417#1391756 (10jcrespo) I'm running on the passive node each time: ``` ALTER TABLE etherpadlite.store CHANGE `key` `key` varchar(100) CHARSET utf8mb4 COLLATE u... [10:08:58] RECOVERY - YARN NodeManager Node-State on analytics1020 is OK YARN NodeManager analytics1020.eqiad.wmnet:8041 Node-State: RUNNING [10:10:31] 6operations, 10Continuous-Integration-Infrastructure: Remove Java 6 from CI Jenkins slaves - https://phabricator.wikimedia.org/T103491#1391763 (10hashar) The Jenkins main configuration file has: ``` lang=xml Ubuntu - OpenJdk 6 /usr/lib/jvm/java-6-openjdk-amd64/... [10:10:55] 6operations, 10Wikimedia-Etherpad, 7Database: Change character set on etherpad- lite database to utf8mb4_bin - https://phabricator.wikimedia.org/T103417#1391765 (10jcrespo) I had to also allow `slave_type_conversions = ALL_NON_LOSSY` on the slave. [10:14:11] (03PS1) 10Hashar: contint: no more install openjdk-6 [puppet] - 10https://gerrit.wikimedia.org/r/220098 (https://phabricator.wikimedia.org/T103491) [10:16:28] PROBLEM - puppet last run on cp4020 is CRITICAL puppet fail [10:17:18] (03CR) 10Alexandros Kosiaris: "ping? any news ?" [puppet] - 10https://gerrit.wikimedia.org/r/219134 (owner: 10ArielGlenn) [10:26:50] (03PS1) 10Giuseppe Lavagetto: action: parse arguments correctly, warn if incorrect [software/conftool] - 10https://gerrit.wikimedia.org/r/220099 (https://phabricator.wikimedia.org/T103481) [10:26:53] (03PS1) 10Giuseppe Lavagetto: syncer: performance improvements [software/conftool] - 10https://gerrit.wikimedia.org/r/220100 (https://phabricator.wikimedia.org/T103482) [10:28:43] (03PS2) 10Hashar: contint: no more install openjdk-6 [puppet] - 10https://gerrit.wikimedia.org/r/220098 (https://phabricator.wikimedia.org/T103491) [10:29:40] 6operations, 10Wikimedia-Etherpad, 7Database: Change character set on etherpad- lite database to utf8mb4_bin - https://phabricator.wikimedia.org/T103417#1391812 (10jcrespo) I want to run ` pt-table-sync` on that table before call it done. I am not 100% sure all changes went though from the master to the slav... [10:34:19] RECOVERY - puppet last run on cp4020 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [10:42:33] (03PS2) 10Giuseppe Lavagetto: syncer: performance improvements [software/conftool] - 10https://gerrit.wikimedia.org/r/220100 (https://phabricator.wikimedia.org/T103482) [10:43:11] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] action: parse arguments correctly, warn if incorrect [software/conftool] - 10https://gerrit.wikimedia.org/r/220099 (https://phabricator.wikimedia.org/T103481) (owner: 10Giuseppe Lavagetto) [10:43:51] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] syncer: performance improvements [software/conftool] - 10https://gerrit.wikimedia.org/r/220100 (https://phabricator.wikimedia.org/T103482) (owner: 10Giuseppe Lavagetto) [10:52:29] (03PS1) 10Giuseppe Lavagetto: debian: Version bump to 0.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/220110 [10:54:48] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] debian: Version bump to 0.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/220110 (owner: 10Giuseppe Lavagetto) [10:59:23] (03PS4) 10Yuvipanda: ssh: Extend all the cipher goodies to precise as well [puppet] - 10https://gerrit.wikimedia.org/r/218411 (https://phabricator.wikimedia.org/T102401) [11:00:05] (03PS5) 10Yuvipanda: ssh: Extend all the cipher goodies to precise as well [puppet] - 10https://gerrit.wikimedia.org/r/218411 (https://phabricator.wikimedia.org/T102401) [11:00:07] (03PS3) 10Yuvipanda: labs: Update to newest openssh-server only on labs precise instances [puppet] - 10https://gerrit.wikimedia.org/r/218627 (https://phabricator.wikimedia.org/T102401) [11:01:45] (03PS2) 10Alexandros Kosiaris: Specify etherpad.wikimedia.org logging [puppet] - 10https://gerrit.wikimedia.org/r/220086 [11:01:47] (03PS4) 10Alexandros Kosiaris: etherpad: Log the incoming request's IP address [puppet] - 10https://gerrit.wikimedia.org/r/220087 [11:01:49] (03PS2) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [11:06:43] (03PS1) 10Yuvipanda: quarry: Add query killer role [puppet] - 10https://gerrit.wikimedia.org/r/220111 [11:09:02] <_joe_> YuviPanda: you should refrain from using "kill" too much you know? [11:09:03] <_joe_> :P [11:09:14] _joe_: heh :) [11:09:25] _joe_: if quarry dies later this week, I'll know :) [11:10:20] (03PS2) 10Yuvipanda: quarry: Add query killer role [puppet] - 10https://gerrit.wikimedia.org/r/220111 [11:10:27] (03CR) 10Yuvipanda: [C: 032 V: 032] quarry: Add query killer role [puppet] - 10https://gerrit.wikimedia.org/r/220111 (owner: 10Yuvipanda) [11:13:08] 6operations, 10Wikimedia-Etherpad, 7Database: Change character set on etherpad- lite database to utf8mb4_bin - https://phabricator.wikimedia.org/T103417#1391858 (10jcrespo) Integrity has been checked on the slaves, closing as the scope of this ticket has been fulfilled. [11:13:18] 6operations, 10Wikimedia-Etherpad, 7Database: Change character set on etherpad- lite database to utf8mb4_bin - https://phabricator.wikimedia.org/T103417#1391859 (10jcrespo) 5Open>3Resolved [11:19:28] (03PS1) 10Jcrespo: Repool es1002, depool es1003 for regular maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220116 [11:24:23] (03CR) 10Jcrespo: [C: 032] Repool es1002, depool es1003 for regular maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220116 (owner: 10Jcrespo) [11:25:40] !log jynus Synchronized wmf-config/db-eqiad.php: Repool es1002, depool es1003 (duration: 00m 12s) [11:25:44] Logged the message, Master [11:28:30] (03CR) 10Muehlenhoff: labs: Update to newest openssh-server only on labs precise instances (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/218627 (https://phabricator.wikimedia.org/T102401) (owner: 10Yuvipanda) [11:33:49] !log jynus Synchronized wmf-config/db-eqiad.php: Repool es1002, depool es1003 (part 2/2) (duration: 00m 12s) [11:33:53] Logged the message, Master [11:35:14] (03CR) 10Yuvipanda: labs: Update to newest openssh-server only on labs precise instances (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/218627 (https://phabricator.wikimedia.org/T102401) (owner: 10Yuvipanda) [11:41:53] (03PS4) 10Yuvipanda: labs: Update to newest openssh-server only on labs precise instances [puppet] - 10https://gerrit.wikimedia.org/r/218627 (https://phabricator.wikimedia.org/T102401) [11:41:59] moritzm: ^ [11:43:14] cleaner too! [11:44:07] 6operations, 6Labs, 10Labs-Infrastructure, 10OCG-General-or-Unknown: salt on deployment-pdf02.deployment-prep.eqiad.wmflabs is wedged - https://phabricator.wikimedia.org/T103473#1391917 (10Krenair) > Probably some sort of permissions problem on pdf02? I don't have root on the ocg machines, so I can't fix i... [11:47:51] (03CR) 10Muehlenhoff: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/218627 (https://phabricator.wikimedia.org/T102401) (owner: 10Yuvipanda) [11:50:36] (03PS5) 10Yuvipanda: labs: Update to newest openssh-server only on labs precise instances [puppet] - 10https://gerrit.wikimedia.org/r/218627 (https://phabricator.wikimedia.org/T102401) [11:50:55] (03CR) 10Yuvipanda: [C: 032] labs: Update to newest openssh-server only on labs precise instances [puppet] - 10https://gerrit.wikimedia.org/r/218627 (https://phabricator.wikimedia.org/T102401) (owner: 10Yuvipanda) [11:54:51] (03PS5) 10Alex Monk: Remove dependency on echowikis.dblist [puppet] - 10https://gerrit.wikimedia.org/r/139581 (https://phabricator.wikimedia.org/T59375) (owner: 10Withoutaname) [11:55:02] (03PS6) 10Alex Monk: Remove dependency on echowikis.dblist [puppet] - 10https://gerrit.wikimedia.org/r/139581 (https://phabricator.wikimedia.org/T59375) (owner: 10Withoutaname) [11:56:43] (03CR) 10Alex Monk: "Yuvi, Andrew, Coren: Is this OK with you guys? If so I'd like to get it done rather than sit in the config queue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209744 (https://phabricator.wikimedia.org/T98567) (owner: 10Alex Monk) [11:57:13] (03CR) 10Alex Monk: "Bump... Andrew, this needs your +2." [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) (owner: 10Alex Monk) [12:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150623T1200). Please do the needful. [12:01:58] wait what? [12:02:25] Oh arbitrary access for ruwiki and cswiki [12:02:40] My calendar says I'll do that tonight [12:03:05] Uhm... I guess I'll just do it now, then [12:03:53] (03PS2) 10Alex Monk: Separate private wiki results in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/214037 [12:06:10] (03PS3) 10Alex Monk: Separate private wiki results in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/214037 [12:07:22] (03PS4) 10Alex Monk: Separate private wiki results in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/214037 [12:07:56] (03PS1) 10Hoo man: Arbitrary access for ruwiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220120 (https://phabricator.wikimedia.org/T102122) [12:08:05] hoo: you doing that? [12:08:18] Yes, I'll do it now, rather than tonight [12:08:24] ok :) [12:08:28] Also Jan will deploy the quality stuff on his own [12:08:31] (03PS1) 10Prtksxna: Remove $wgPopupsSurveyLink as trial is complete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220121 (https://phabricator.wikimedia.org/T103283) [12:08:32] k [12:08:32] earlier than planned [12:08:52] (03CR) 10Hoo man: [C: 032] Arbitrary access for ruwiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220120 (https://phabricator.wikimedia.org/T102122) (owner: 10Hoo man) [12:08:58] (03Merged) 10jenkins-bot: Arbitrary access for ruwiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220120 (https://phabricator.wikimedia.org/T102122) (owner: 10Hoo man) [12:10:02] !log hoo Synchronized arbitraryaccess.dblist: Arbitrary access for ruwiki and cswiki. T102122 (duration: 00m 12s) [12:10:06] Logged the message, Master [12:12:12] print( mw.wikibase.getEntity( 'Q42' ):getLabel( 'en' )) [12:12:12] Douglas Adams [12:12:13] :) [12:15:23] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Error from DataBinding 'hiera' while looking up 'puppetmaster::logstash::timeout': undefined method `empty?' for nil:NilClass on node deployment-salt.deployment-prep.eqiad.wmflabs [12:15:27] in deployment-salt [12:15:28] :'( [12:16:32] hahaha [12:16:38] fuck you too, deployment-prep [12:16:46] > Error: Failed to apply catalog: Could not find dependency Class[Role::Access_new_install] for File[/usr/local/sbin/install-console] at /etc/puppet/modules/puppetmaster/manifests/scripts.pp:62 [12:16:48] that was fixed earlier [12:16:57] and I'm intermittently getting one or the other [12:17:21] * YuviPanda switches to a different project to do testing [12:17:28] moritzm: did you upload packages to carbon yet? [12:18:17] !log uploaded openssh_6.6p1-2ubuntu2~wmfprecise2 to precise-wikimedia on apt.wikimedia.org [12:18:20] Logged the message, Master [12:18:27] YuviPanda: see above :-) [12:21:31] ah cool :D [12:21:36] am trying on tools-precise-dev [12:21:50] note that the binary packages have been slightly reorganised compared to precise [12:22:40] openssh-sftp-server is a new binary package built from src:openssh [12:23:10] 6operations: puppetmaster self: Could not find dependency Class[Role::Access_new_install] for File[/usr/local/sbin/install-console] at /etc/puppet/modules/puppetmaster/manifests/scripts.pp:62 - https://phabricator.wikimedia.org/T103499#1392004 (10hashar) 3NEW [12:23:23] (03CR) 10Hashar: "godog: that breaks puppet master self :( T103499" [puppet] - 10https://gerrit.wikimedia.org/r/217016 (owner: 10Filippo Giunchedi) [12:24:09] 6operations: puppetmaster self: Could not find dependency Class[Role::Access_new_install] for File[/usr/local/sbin/install-console] at /etc/puppet/modules/puppetmaster/manifests/scripts.pp:62 - https://phabricator.wikimedia.org/T103499#1392014 (10yuvipanda) Same as deployment-prep [12:25:07] <_joe_> YuviPanda: need puppet help? [12:25:11] <_joe_> for ^^ I mean [12:25:15] yeah [12:25:24] deployment-salt exhibits this behavior as well [12:25:38] PROBLEM - puppet last run on virt1002 is CRITICAL Puppet has 1 failures [12:26:20] _joe_: thanks :) [12:26:39] moritzm: I see the new version getting installed around now :) [12:28:40] where do I need to look for the "CRITICAL Puppet has 1 failures" error on virt1002, /var/log/puppet is empty [12:29:31] moritzm, try executing it manually. also look at the config [12:29:31] virt100[23] are the two hosts I made a test-update (but on 1003 puppet ran successfully a minute ago) [12:29:44] moritzm: There's a puppet.log in 7var/log directly [12:30:58] RECOVERY - puppet last run on virt1002 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [12:31:33] <_joe_> ok YuviPanda I see the problem [12:32:04] hmm, a manual run went fine, I'll check the /var/log/puppet.log [12:33:52] moritzm: possibly dpkg clashed with you running it [12:33:54] happens fairly often [12:34:46] yeah, that coincides with the log [12:35:31] (03PS1) 10Giuseppe Lavagetto: puppetmaster::scripts: include reimaging scripts only in production [puppet] - 10https://gerrit.wikimedia.org/r/220126 (https://phabricator.wikimedia.org/T103499) [12:35:38] <_joe_> YuviPanda: served! ^^ [12:35:52] <_joe_> YuviPanda: take a look, see if it seems correct to you as well [12:37:04] (03CR) 10Yuvipanda: [C: 031] "long term, move them to a different class and include in role?" [puppet] - 10https://gerrit.wikimedia.org/r/220126 (https://phabricator.wikimedia.org/T103499) (owner: 10Giuseppe Lavagetto) [12:38:56] (03CR) 10coren: [C: 031] "WFM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209744 (https://phabricator.wikimedia.org/T98567) (owner: 10Alex Monk) [12:40:14] (03CR) 10Giuseppe Lavagetto: [C: 032] "that may work, yes." [puppet] - 10https://gerrit.wikimedia.org/r/220126 (https://phabricator.wikimedia.org/T103499) (owner: 10Giuseppe Lavagetto) [12:41:01] <_joe_> YuviPanda: merged, you should be able to recover now [12:42:44] _joe_: \o/ thanks [12:43:29] moritzm: by affects all hosts you mean prod *has* upgraded ssh? [12:44:15] hmm, I do see it [12:44:20] palladium has new package >_> [12:44:39] > ESC[mNotice: /Stage[main]/Ssh::Client/Package[openssh-client]/ensure: ensure changed '1:5.9p1-5ubuntu1.4' to '1:6.6p1-2ubuntu2~wmfprecise2'ESC[ [12:44:41] WTF puppet [12:44:45] yeah, and I just found the reason: [12:45:32] !log rebooting es1003 [12:45:34] beside the openssh-server definition, there's also one for openssh-client => latest (in modules/ssh/manifests/client.pp) [12:45:36] Logged the message, Master [12:45:56] moritzm: bah... [12:45:57] :( [12:46:05] so wait, upgrading that upgrades openssh-server too? [12:46:06] and when that is updated, it pulls it the reverse deps for openssh-server [12:46:38] but all is well, I logged into 15-20 machines and all fine [12:48:33] yeah [12:48:35] sorry I didn't catch that [12:48:37] I'm not really convinced of any puppet module ensuring "latest", though. for any such puppet def: if there's ever a broken update in Ubuntu or Debian it would spread across all systems automatically... [12:49:09] "any puppet module with a package definition ensuring "latest" I meant) [12:51:12] yep- 1) do not blindly update systems 2) do not blindly start systems [12:51:29] right [12:51:37] labs also has some form of unattended upgrades going, IIRC [12:54:40] !log ssh on precise hosts has been updated to a backport of 6.6p1-2ubuntu2 (the version from trusty). this allows us to use modern crypto (plus labs can simplify key handling) [12:54:44] Logged the message, Master [12:59:20] (03PS1) 10Jcrespo: Upgrading es1003 to mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/220132 [13:00:27] PROBLEM - puppet last run on db2047 is CRITICAL puppet fail [13:00:37] (03PS2) 10Jcrespo: Upgrading es1003 to mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/220132 [13:01:28] I just need a parse/sanity check +1 for ^ [13:01:43] I've just upgraded that machine [13:02:08] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Generally LGTM and it should work well, see my comments for a lot of small necessary improvements" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/219481 (owner: 10Rush) [13:02:56] aside from lint and parser validate, I mean [13:04:49] 6operations, 6Labs, 10Labs-Infrastructure, 10OCG-General-or-Unknown: salt on deployment-pdf02.deployment-prep.eqiad.wmflabs is wedged - https://phabricator.wikimedia.org/T103473#1392107 (10cscott) ``` ocg-render-admins: gid: 721 description: admins for pdf render (rt 6468) members: [cscott, s... [13:07:31] 7Blocked-on-Operations, 6operations, 7Availability, 7Performance: Make redis/redisdb roles support multiple instances on the same servers - https://phabricator.wikimedia.org/T100714#1392116 (10chasemp) The basics of this, at least on Trusty, are simple. The logistics including ferm, ganglia, graphite, etc... [13:07:49] 6operations, 10Beta-Cluster, 10OCG-General-or-Unknown: salt on deployment-pdf02.deployment-prep.eqiad.wmflabs is wedged - https://phabricator.wikimedia.org/T103473#1392118 (10yuvipanda) [13:08:31] 6operations, 10Beta-Cluster, 10OCG-General-or-Unknown: salt on deployment-pdf02.deployment-prep.eqiad.wmflabs is wedged - https://phabricator.wikimedia.org/T103473#1392122 (10Krenair) Are you saying this relies on the OCG hosts in production? Because you're projectadmin on the deployment-prep, which should a... [13:12:48] RECOVERY - puppet last run on db2047 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:17:43] 7Blocked-on-Operations, 6operations, 7Availability, 7Performance: Make redis/redisdb roles support multiple instances on the same servers - https://phabricator.wikimedia.org/T100714#1392140 (10chasemp) a:5chasemp>3aaron @aaron backatcha so it gets noticed man [13:19:02] 6operations, 10Beta-Cluster, 10OCG-General-or-Unknown: salt on deployment-pdf02.deployment-prep.eqiad.wmflabs is wedged - https://phabricator.wikimedia.org/T103473#1392153 (10cscott) Well, I'll be: ``` cscott@deployment-pdf02:~$ sudo -s root@deployment-pdf02:~# ``` I guess I was already `sudo`ed to `ocg` be... [13:25:25] moritzm: so... next step is to verify that all hosts have the new package? [13:25:26] at leas tfor labs [13:25:31] *at least for labs [13:27:39] YuviPanda: yeah, for labs that seems like the proper next step. for prod there's a few hosts which hadn't had a puppet run for some days, we should sort these out first [13:27:58] moritzm: yes, and for labs there are several hosts that haven't had puppet runs in decades [13:28:00] err [13:28:02] days / months / years [13:28:20] (03CR) 10Alexandros Kosiaris: [C: 031] Upgrading es1003 to mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/220132 (owner: 10Jcrespo) [13:29:01] (03PS3) 10Andrew Bogott: Get rid of unnecessary WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) (owner: 10Alex Monk) [13:29:10] (03CR) 10Andrew Bogott: [C: 032] Get rid of unnecessary WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) (owner: 10Alex Monk) [13:30:17] (03CR) 10Andrew Bogott: [C: 032] Re-enable OAuth on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209744 (https://phabricator.wikimedia.org/T98567) (owner: 10Alex Monk) [13:30:19] (03CR) 10jenkins-bot: [V: 04-1] Re-enable OAuth on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209744 (https://phabricator.wikimedia.org/T98567) (owner: 10Alex Monk) [13:30:54] moritzm: for those instances it doesn't matter anyway either - they aren't running puppet, so won't get the new LDAPKeys cod3e [13:31:03] 6operations, 10Beta-Cluster, 10OCG-General-or-Unknown: salt on deployment-pdf02.deployment-prep.eqiad.wmflabs is wedged - https://phabricator.wikimedia.org/T103473#1392170 (10cscott) Happiness: ``` Repo: ocg/ocg Tag: ocg/ocg-sync-20150623-132307 2/2 minions completed checkout Details: ``` I had to manually... [13:32:18] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [13:32:45] 6operations, 10Beta-Cluster, 10OCG-General-or-Unknown: salt on deployment-pdf02.deployment-prep.eqiad.wmflabs is wedged - https://phabricator.wikimedia.org/T103473#1392173 (10cscott) 5Open>3Resolved a:3cscott [13:33:19] (03CR) 10Jcrespo: [C: 032] Upgrading es1003 to mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/220132 (owner: 10Jcrespo) [13:33:48] RECOVERY - Host mw1085 is UPING OK - Packet loss = 0%, RTA = 1.00 ms [13:35:02] 6operations, 6Analytics-Engineering: Varnish caching around datasets.wikimedia.org is causing breakages - https://phabricator.wikimedia.org/T103423#1392193 (10Ironholds) Ack. James, you lie! [13:41:35] (03CR) 10Alex Monk: "Weird... It rebases for me locally and Gerrit thinks it's OK. Will try again." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209744 (https://phabricator.wikimedia.org/T98567) (owner: 10Alex Monk) [13:41:58] (03PS4) 10Alex Monk: Re-enable OAuth on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209744 (https://phabricator.wikimedia.org/T98567) [13:42:44] andrewbogott, should I deploy that? [13:43:19] Krenair: sure [13:43:35] (03CR) 10Alex Monk: [C: 032] Re-enable OAuth on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209744 (https://phabricator.wikimedia.org/T98567) (owner: 10Alex Monk) [13:43:41] (03Merged) 10jenkins-bot: Re-enable OAuth on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209744 (https://phabricator.wikimedia.org/T98567) (owner: 10Alex Monk) [13:44:17] 6operations, 6Phabricator: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1392276 (10chasemp) I can pretty easily generate a file on iridium, @arielglenn what is the normal way to drop files for dumps daily? [13:44:32] worked that time.. [13:44:33] !log krenair Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/209744/ (duration: 00m 12s) [13:44:37] Logged the message, Master [13:44:57] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/209744/ (duration: 00m 13s) [13:45:01] Logged the message, Master [13:45:46] andrewbogott, hmm... who should be able to make people OAuth administrators? [13:46:38] Krenair: I don’t know anything [13:46:42] Me, maybe [13:46:46] Or YuviPanda [13:47:17] oh [13:47:22] apparently YuviPanda is already in the group [13:47:29] okay then [13:47:45] yeah, i was in that group earlier [13:47:47] when we had OAuth on [13:47:49] thanks Krenair [13:48:12] aude: did you finish the wikidata deploy? [13:49:08] I think hoo did it? [13:49:15] cscott: yes [13:49:17] Deskanaz, aude, hoo, greg-g: i'd like to do a quick deploy of OCG to production, if it's a good time. Yesterday's deploy during the usual window only managed to update beta. [13:49:23] ok [13:49:34] or maybe not. I see a commit author'd by hoo [13:49:38] 6operations, 7discovery-system: Ensure alerts and notifications on confd failure modes - https://phabricator.wikimedia.org/T103360#1392335 (10Joe) Upstart is not screwed. From my test: - confd starts correctly, as the process is spawned and it starts working. Thus upstart correctly reports that the process w... [13:50:26] 6operations, 7discovery-system: Ensure alerts and notifications on confd failure modes - https://phabricator.wikimedia.org/T103360#1392351 (10Joe) The same exact thing happens with upstart for the 99% of our services and puppet, btw. [13:50:56] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [13:51:40] is it too early for greg-g to respond? [13:52:34] probably, isn't it 06:52 in SF? [13:54:37] well, here goes. :) [13:57:20] !log updated OCG to version d7c698d5bf730d34057945e912ac75dc542dd788 [13:57:24] Logged the message, Master [13:58:08] moritzm: do you want to file a task about gettig rid of ensure => latest from our repo? [13:59:07] yes, I'll have a look at what other packages are doing that and open a Phab task [13:59:52] moritzm: cool [14:00:42] 6operations, 5Continuous-Integration-Isolation: Figure out fine sudo rules for the nodepool service - https://phabricator.wikimedia.org/T102281#1392401 (10chasemp) >>! In T102281#1391598, @hashar wrote: > Nodepool creates new images using python-diskimage-builder. Turns out that script rely on having root acce... [14:02:46] (03PS1) 10coren: Puppetize toolserver.org legacy server [puppet] - 10https://gerrit.wikimedia.org/r/220134 (https://phabricator.wikimedia.org/T85165) [14:03:45] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [14:04:38] !log reverted OCG to version ca4f64852de5b1de782b292b50038fbd2dd84266 (bundler failing with exit code 8) [14:04:42] Logged the message, Master [14:04:49] you win some, you lose some [14:17:30] 6operations: Recover home folders and /data/project from wikimetrics1 - https://phabricator.wikimedia.org/T103530#1392471 (10mforns) 3NEW [14:18:46] (03CR) 10Hoo man: [C: 04-1] rsync wikidata json dumps to labs /public/dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) (owner: 10Addshore) [14:18:52] 6operations, 6Labs: Recover home folders and /data/project from wikimetrics1 - https://phabricator.wikimedia.org/T103530#1392478 (10Krenair) [14:22:02] why is `node-request` installed in labs (deployment-pdf01), but not in production (ocg1001)? [14:25:32] 6operations, 6Phabricator: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1392535 (10ArielGlenn) So, we're talking about one file being overwritten every day? I'd just rsync it over,via cron. If you give me the full path of the file, I can set that up in puppet. We... [14:26:54] In general there is no connection between the two, so why is it strange? [14:28:00] Silly answer would be different OS release and different dependencies in packages [14:30:39] (03PS1) 10Jcrespo: Repool es1003, depool es1004 for regular maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220139 [14:31:46] 6operations, 10ops-eqiad: analytics1016 down due to power issue(?) - https://phabricator.wikimedia.org/T103544#1392596 (10Ottomata) 3NEW a:3Cmjohnson [14:32:27] ACKNOWLEDGEMENT - Host analytics1016 is DOWN: PING CRITICAL - Packet loss = 100% ottomata https://phabricator.wikimedia.org/T103544 [14:33:06] (03PS1) 10Alexandros Kosiaris: url_downloader: Increase request body/header size [puppet] - 10https://gerrit.wikimedia.org/r/220140 (https://phabricator.wikimedia.org/T97042) [14:35:04] cscott, is npm installed on ocg1001? [14:36:17] (03CR) 10Jcrespo: [C: 032] Repool es1003, depool es1004 for regular maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220139 (owner: 10Jcrespo) [14:37:09] 6operations, 7discovery-system: Ensure alerts and notifications on confd failure modes - https://phabricator.wikimedia.org/T103360#1392652 (10Joe) So, from my tests, and your results, I get what follows. If we set up today an icinga nrpe check that runs every ~ 10 minutes and runs a confd --onetime --noop, and... [14:37:15] 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, and 2 others: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1392653 (10Dzahn) {F182545} [14:38:02] andrewbogott: chase: this is a screenshot from icinga, there is a pop-up and a link in it. sounds like a browser issue rather than the language icinga is written in [14:38:09] https://phabricator.wikimedia.org/T101803#1392653 [14:38:55] what's wrong in this screenshot? [14:39:20] !log jynus Synchronized wmf-config/db-eqiad.php: Repool es1003, depool es1004 (duration: 00m 12s) [14:39:24] Logged the message, Master [14:39:44] nothing. it's a reply to 13:53 < andrewbogott> I wonder if the icinga people have heard of this thing called a ‘link’? [14:39:50] ah [14:39:54] 13:54 < andrewbogott> helpfully icinga shows a chat bubble when I hover over the ‘acknowledged’ graphic. The chat bubble is truncated by my browser window so I cannot read the text [14:40:00] mutante: that works unless the mouseover is happening at the bottom of the window [14:41:30] mutante: mostly I was confused by icinga giving me a message ID rather than the message itself. And then not having a link from the id to the message. What good is an ID? [14:41:30] andrewbogott: here's the link https://phabricator.wikimedia.org/T101803 [14:42:21] i dont see that message id and was confused how that is supposed to be related to the programming language [14:43:00] (03CR) 10Alexandros Kosiaris: [C: 032] url_downloader: Increase request body/header size [puppet] - 10https://gerrit.wikimedia.org/r/220140 (https://phabricator.wikimedia.org/T97042) (owner: 10Alexandros Kosiaris) [14:43:38] andrewbogott: ooh,, and now i see that the sync issue was related to DB things? interesting! and nice that it's solved [14:43:55] mutante: it was related to the DB being huge, mostly :) [14:44:51] andrewbogott: hmm.. does that mean "without history"? [14:45:05] looking at https://gerrit.wikimedia.org/r/#/c/219827/2/modules/openstack/files/mw-xml.sh [14:45:51] the "every time a dump was loaded data was dropped but not deleted physically" does explain though, heh [14:46:29] mutante: yes, now we try to sync as little history as possible [14:46:59] the diffs for multiple edits on the same day will look ridiculous [14:47:24] but it's kind of required for this host to work, so.. [14:50:17] what does that "CRITICAL: 100.00% of data above the critical threshold" error actually mean by the way? [14:51:14] (03CR) 10Hashar: "A lot of work happened on labnodepool1001. It is now time to puppetize them in this change and in follow ups." [puppet] - 10https://gerrit.wikimedia.org/r/201728 (https://phabricator.wikimedia.org/T89143) (owner: 10Hashar) [14:51:58] mutante, yes, the config was old so mysql didn't purge deleted entries [14:52:19] andrewbogott, everithing ok yesterday in the end? [14:52:29] jynus: gotcha, thanks for the fix [14:52:29] did you delete the old backup? [14:53:24] jynus: yes, I’m happy with how it’s working now. And I removed the old backup. [14:54:05] if you find the same problem again, now it is easier to execute "ALTER TABLE table ENGINE=INNODB, FORCE" [14:54:22] that will defragment the table [14:54:32] akosiaris: can you merge https://gerrit.wikimedia.org/r/#/c/220095/ [14:54:33] (should not be run in a productoin, though) [14:54:57] !log reprepro: including nginx 1.9.2-1~bpo8+1 to jessie-wikimedia/backports [14:55:01] Logged the message, Master [14:56:15] or godog can you merge: https://gerrit.wikimedia.org/r/#/c/220095/ [14:58:30] kart_: looks like you're around for SWAT in a couple minutes :) [14:58:53] thcipriani: yes. [14:59:03] who in Ops can merge, https://gerrit.wikimedia.org/r/#/c/220095/ ? [14:59:15] hopefully anyone in ops could ;) [14:59:44] yes. anyone :) [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150623T1500). Please do the needful. [15:00:33] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220071 (https://phabricator.wikimedia.org/T101944) (owner: 10KartikMistry) [15:01:07] (03Merged) 10jenkins-bot: Enable 'frwiki-recommender' campaign in frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220071 (https://phabricator.wikimedia.org/T101944) (owner: 10KartikMistry) [15:01:49] ok. Don't know who is around :/ [15:02:05] (03PS2) 10Jcrespo: CX: Add wikis for deployment on 20150623 [puppet] - 10https://gerrit.wikimedia.org/r/220095 (https://phabricator.wikimedia.org/T103316) (owner: 10KartikMistry) [15:02:13] * marktraceur looks around [15:02:16] there you go :D [15:02:23] cool [15:02:27] Is nobody doing SWAT? [15:02:34] Christ [15:02:36] marktraceur: I'm SWATing [15:02:36] OK I'll do it [15:02:38] Oh [15:02:41] gj marktraceur [15:02:46] I was about to say [15:03:03] (03CR) 10Jcrespo: [C: 032] CX: Add wikis for deployment on 20150623 [puppet] - 10https://gerrit.wikimedia.org/r/220095 (https://phabricator.wikimedia.org/T103316) (owner: 10KartikMistry) [15:03:25] I saw uncertainty in the masses and assumed it was because SWAT went unclaimed [15:03:46] (03PS1) 10Andrew Bogott: Turn on autoupdate_master by default. [puppet] - 10https://gerrit.wikimedia.org/r/220147 [15:03:48] Looking for opsen gods [15:04:18] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable 'frwiki-recommender' campaign in frwiki [[gerrit:220071]] (duration: 00m 13s) [15:04:23] Logged the message, Master [15:04:24] ^ kart_ check please [15:04:29] done [15:04:46] YuviPanda: is my comment regarding hiera on https://gerrit.wikimedia.org/r/#/c/220147/ accurate, syntatically correct, etc? [15:05:32] thcipriani: it is just good, campaign will start in an hour :) [15:05:38] thcipriani: so we're okay! [15:05:50] kart_: okie doke, on to the next then [15:07:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220078 (https://phabricator.wikimedia.org/T103316) (owner: 10KartikMistry) [15:07:38] (03Merged) 10jenkins-bot: CX: Enable CX as default except where it is not deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220078 (https://phabricator.wikimedia.org/T103316) (owner: 10KartikMistry) [15:09:07] (03CR) 10Alex Monk: "Can we please get this scheduled or something?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200038 (owner: 10Cscott) [15:09:18] cscott: 7am is a bit early usually :) [15:09:58] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Enable CX as default except where it is not deployed [[gerrit:220078]] (duration: 00m 12s) [15:10:03] Logged the message, Master [15:10:04] ^ kart_ check please [15:10:23] cscott: unless it's Mon/Thurs and I'm on the bus and it's summer time (so the sun is up and I can't sleep) ;) [15:10:37] thcipriani: okay! [15:10:40] (03CR) 10Jcrespo: Ensure apt update before sql libraries install [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/195779 (https://phabricator.wikimedia.org/T91545) (owner: 10Thcipriani) [15:12:52] thcipriani: looks good. https://ja.wikipedia.org/wiki/%E5%88%A9%E7%94%A8%E8%80%85:KartikMistry/Apache_Indian [15:13:23] kart_: awesome. Thanks! [15:13:51] YuviPanda, Coren, I’m looking at ldap puppetVar: realm=labs. That could be moved to hiera, yes? Or turned into a fact that looks at the domain? [15:14:23] andrewbogott: Looking at the domain seems brittle to me, but hiera sounds like a good idea. [15:15:07] so that would mean that I’d have to replace every reference to realm with a hiera lookup, right? Because we can’t be sure that realm.pp will define things early enough in the puppet run… [15:15:27] thcipriani: oh. Thanks to you! :) [15:15:30] Ah. Ew. [15:16:45] Hm. [15:16:58] I like that even less. [15:17:44] akosiaris or _joe_, any idea how to move $::realm into hiera without having to add hiera(‘realm’) in a thousand places? [15:22:19] hiera doesn't resolve superglobals ($::foo) AFAIK [15:22:39] you have to tie the data to a class parameter somehow [15:24:25] yeah, although there is realm.pp [15:25:10] I don’t really understand how realm.pp works. What’s to ensure that its vars are definied before they’re referenced? [15:25:40] I think the import in site.pp does that [15:25:54] realm.pp has this: [15:25:56] if $::realm == undef { [15:25:57] $realm = 'production' [15:25:58] } [15:26:10] But, what good is that $realm? It’s surely a different variable from $::realm that’s used elsewhere [15:26:18] ah. that's factor magic. [15:26:28] ? [15:26:34] which part? [15:26:43] $::realm normally being set [15:27:00] the bit in realm.pp is just for when factor isn't setup right [15:27:12] ok, hang on... [15:27:24] there are a few things you just said that I think are wrong :) [15:27:33] First, on labs at least $::realm comes from ldap [15:27:47] second — I’m pretty sure that setting $realm = ‘production [15:27:53] does not change the value of $::realm [15:27:58] so I think that line does nothing at all [15:28:30] the code is in the global scope so $::realm and $realm are the same thing there [15:28:32] (03CR) 10Filippo Giunchedi: "yep, tried to do that in https://gerrit.wikimedia.org/r/#/c/220023/ but this works as well" [puppet] - 10https://gerrit.wikimedia.org/r/220126 (https://phabricator.wikimedia.org/T103499) (owner: 10Giuseppe Lavagetto) [15:28:45] (03Abandoned) 10Filippo Giunchedi: puppetmaster: split frontend scripts [puppet] - 10https://gerrit.wikimedia.org/r/220023 (owner: 10Filippo Giunchedi) [15:29:06] bd808: hm [15:29:20] so that may or may not help with my problem, I can’t tell :) [15:29:30] Having it in a fact would be ideal, do you know if/where that fact is defined? [15:29:35] PROBLEM - mysqld processes on es1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:30:59] in ldap I think in labs. and exposed to factor by this -- https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/puppet.pp#L30 [15:31:22] https://docs.puppetlabs.com/guides/ldap_nodes.html [15:31:56] (03CR) 10Yuvipanda: [C: 04-1] "Shouldn't be merged until the script autostashes -otherwise people with uncommited changes will lose them" [puppet] - 10https://gerrit.wikimedia.org/r/220147 (owner: 10Andrew Bogott) [15:32:03] andrewbogott: but I'm reaching the wild guess stage of this [15:32:25] bd808: oh, ok. That doesn’t use factor though, it’s a different route for defining things. I was confused by the factor bit. [15:32:54] ah right. node classifiers are core puppet and not factor bolt-ons [15:33:13] (03CR) 10Andrew Bogott: "won't the rebase just fail if there are local changes? It doesn't do a --hard does it?" [puppet] - 10https://gerrit.wikimedia.org/r/220147 (owner: 10Andrew Bogott) [15:33:25] (03CR) 10Yuvipanda: "It does" [puppet] - 10https://gerrit.wikimedia.org/r/220147 (owner: 10Andrew Bogott) [15:33:41] (03CR) 10Andrew Bogott: "omg!" [puppet] - 10https://gerrit.wikimedia.org/r/220147 (owner: 10Andrew Bogott) [15:34:35] (03PS1) 10Filippo Giunchedi: puppetmaster: don't depend scripts on role::access_new_install [puppet] - 10https://gerrit.wikimedia.org/r/220154 [15:34:46] PROBLEM - DPKG on es1004 is CRITICAL: Connection refused by host [15:35:16] PROBLEM - puppet last run on es1004 is CRITICAL: Connection refused by host [15:36:00] Coren: I’m back to thinking that fact-based-on-domain is still the best option. Can you think of a better way for an instance to introspect and know that it’s on labs? [15:36:32] ignore 1004, I scheduled for downtime 1003 instead of the right host, 1004 [15:37:20] 6operations, 6Analytics-Kanban: Varnish caching around datasets.wikimedia.org is causing breakages - https://phabricator.wikimedia.org/T103423#1392899 (10kevinator) [15:39:01] 6operations, 6Analytics-Kanban: Varnish caching around datasets.wikimedia.org is causing breakages - https://phabricator.wikimedia.org/T103423#1392903 (10Milimetric) a:3Milimetric [15:39:07] PROBLEM - puppet last run on mw1142 is CRITICAL Puppet has 1 failures [15:39:19] 6operations, 6Analytics-Kanban: Varnish caching around datasets.wikimedia.org is causing breakages - https://phabricator.wikimedia.org/T103423#1389620 (10Milimetric) I'm claiming this task and closing the original task that added caching. [15:40:57] _joe_ YuviPanda re: https://gerrit.wikimedia.org/r/#/c/220126/ I've also posted https://gerrit.wikimedia.org/r/#/c/220154/ since puppet was also broken on the backends [15:42:12] (03PS1) 10Faidon Liambotis: wmflib: os_version handling of Debian point-releases [puppet] - 10https://gerrit.wikimedia.org/r/220156 [15:42:16] (03PS2) 10Andrew Bogott: Turn on autoupdate_master by default. [puppet] - 10https://gerrit.wikimedia.org/r/220147 [15:42:18] (03PS1) 10Andrew Bogott: Have git-sync-upstream error out if there are local changes. [puppet] - 10https://gerrit.wikimedia.org/r/220157 [15:42:25] (03PS2) 10Filippo Giunchedi: puppetmaster: don't depend scripts on role::access_new_install [puppet] - 10https://gerrit.wikimedia.org/r/220154 (https://phabricator.wikimedia.org/T103499) [15:42:32] ori: ^ my ruby is rusty :) [15:43:56] RECOVERY - DPKG on es1004 is OK: All packages OK [15:47:15] Coren: related… right now labs instances all include base and role::labs::instance in ldap. Any reason why base shouldn’t just automatically include role::labs::instance if realm==labs? [15:47:21] Or, alternatively, role::labs::instance include base? [15:47:22] (03CR) 10Jforrester: CX: Enable CX as default except where it is not deployed (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220078 (https://phabricator.wikimedia.org/T103316) (owner: 10KartikMistry) [15:49:13] !log rebooting es1004 [15:49:17] Logged the message, Master [15:50:47] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1392993 (10Eevans) At this point, 2.1.7 is overall looking like a regression; I suggest we downgrade to 2.1.3, and regroup after the new... [15:53:01] (03PS1) 10Andrew Bogott: Always include role::labs::instance if realm is 'labs' [puppet] - 10https://gerrit.wikimedia.org/r/220160 (https://phabricator.wikimedia.org/T103357) [15:53:10] (03PS1) 10Jforrester: Follow-up 94e5fd2: Default wmgUseContentTranslation true only on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220161 [15:53:12] (03CR) 10Nemo bis: "Yes, CX is Wikipedia-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220078 (https://phabricator.wikimedia.org/T103316) (owner: 10KartikMistry) [15:55:19] deploying quick SWAT fix now shouldn't hold back next window [15:55:36] RECOVERY - puppet last run on mw1142 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:52] (03CR) 10Thcipriani: [C: 032] "SWAT fix" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220161 (owner: 10Jforrester) [15:55:58] (03Merged) 10jenkins-bot: Follow-up 94e5fd2: Default wmgUseContentTranslation true only on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220161 (owner: 10Jforrester) [15:56:17] Whee. [15:57:41] (03PS1) 10Jcrespo: Updating es1004 to mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/220162 [15:57:42] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: Follow-up 94e5fd2: Default wmgUseContentTranslation true only on Wikipedias [[gerrit:220161]] (duration: 00m 16s) [15:57:46] Logged the message, Master [15:57:47] ^ James_F [15:57:56] thcipriani: Confirmed. [15:58:06] cool, thanks! [15:58:30] (03CR) 10BBlack: [C: 04-1] "May want to keep squeeze/lenny, or fix the one ref I found instead:" [puppet] - 10https://gerrit.wikimedia.org/r/220156 (owner: 10Faidon Liambotis) [16:00:04] bd808, ori: Dear anthropoid, the time has come. Please deploy Scap HHVM restart test (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150623T1600). [16:00:29] thcipriani: are you done? I'm in no rush [16:01:07] bd808: yup, finished, thanks! [16:01:24] !log staggered upgrade of cp* fleet to nginx 1.9.2 [16:01:28] Logged the message, Master [16:04:32] (03CR) 10Jforrester: CX: Enable CX as default except where it is not deployed (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220078 (https://phabricator.wikimedia.org/T103316) (owner: 10KartikMistry) [16:06:32] 6operations, 6Analytics-Kanban: Varnish caching around datasets.wikimedia.org is causing breakages - https://phabricator.wikimedia.org/T103423#1393049 (10Milimetric) This may seem harsh, but the resolution is to use a cache buster on the end of your URL. Most of our dashboarding was built with that in mind.... [16:06:46] 6operations, 6Analytics-Kanban: Varnish caching around datasets.wikimedia.org is causing breakages - https://phabricator.wikimedia.org/T103423#1393050 (10Milimetric) 5Open>3Resolved [16:06:50] I'm going to test scap with a flag that tells it to restart HHVM. The test will be against a small group of servers that ori carved out listed in /etc/dsh/group/scap-test [16:07:08] (03PS1) 10Dzahn: static-bugzilla: additional redirects [puppet] - 10https://gerrit.wikimedia.org/r/220164 (https://phabricator.wikimedia.org/T103425) [16:07:28] (03CR) 10Dzahn: [C: 04-1] static-bugzilla: additional redirects [puppet] - 10https://gerrit.wikimedia.org/r/220164 (https://phabricator.wikimedia.org/T103425) (owner: 10Dzahn) [16:08:09] (03PS2) 10Dzahn: Wikidata build: use deep copy instead of git submodule [puppet] - 10https://gerrit.wikimedia.org/r/219814 (owner: 10JanZerebecki) [16:09:42] 6operations, 6Analytics-Kanban: Varnish caching around datasets.wikimedia.org is causing breakages - https://phabricator.wikimedia.org/T103423#1393078 (10Ironholds) It's nothing to do with my quarterly presentations. Okay, cache busting it is - should be trivial to work out. [16:10:15] bblack: heh, funny [16:10:24] ok, i'm back with ocg fun. [16:10:25] :) [16:10:25] bblack: lenny's "chromium" isn't the browser, it's another package [16:10:39] !log bd808 Started scap: no-op sync to scap-test dsh group; Testing HHVM restart [16:10:39] I think the intent was just debian-vs-lenny there [16:10:43] err debian-vs-ubuntu [16:10:43] Logged the message, Master [16:10:47] yup [16:10:49] (03PS2) 10Faidon Liambotis: wmflib: os_version handling of Debian point-releases [puppet] - 10https://gerrit.wikimedia.org/r/220156 [16:11:07] bd808, ori: you're doing the scap hhvm restart test? [16:11:20] yes, should be pretty quick [16:11:24] (03CR) 10BBlack: [C: 031] wmflib: os_version handling of Debian point-releases [puppet] - 10https://gerrit.wikimedia.org/r/220156 (owner: 10Faidon Liambotis) [16:11:59] ori: ok, could you ping me when you're done? i'd like to squeeze in an OCG deploy before the mediawiki train, if possible. [16:12:06] cscott: kk [16:12:43] getting to the good part now... [16:12:54] !log bd808 scap failed: AttributeError 'Scap' object has no attribute '_get_apache_list' (duration: 02m 15s) [16:12:58] Logged the message, Master [16:13:03] bah [16:13:19] where did I miss that rename? [16:14:23] (03PS1) 10BryanDavis: Fix reference to _get_apache_list [tools/scap] - 10https://gerrit.wikimedia.org/r/220166 [16:14:27] ori: ^ [16:14:28] (03PS2) 10Jcrespo: Updating es1004 to mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/220162 [16:14:38] (03CR) 10Dzahn: "yes, but without it we also can't upgrade to jessie. it's for the bug "Mediawiki font packages: switch to Jessie"" [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [16:15:03] (03Abandoned) 10Dzahn: mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [16:15:30] (03CR) 10Jcrespo: [C: 032] Updating es1004 to mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/220162 (owner: 10Jcrespo) [16:16:09] (03CR) 10BryanDavis: [C: 032] "Trivial fix to a method rename." (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/220166 (owner: 10BryanDavis) [16:16:30] (03Merged) 10jenkins-bot: Fix reference to _get_apache_list [tools/scap] - 10https://gerrit.wikimedia.org/r/220166 (owner: 10BryanDavis) [16:16:38] bd808: looking [16:17:27] (03Abandoned) 10Matanya: statsdlb: minor lint [puppet] - 10https://gerrit.wikimedia.org/r/219187 (owner: 10Matanya) [16:17:56] (03Restored) 10Dzahn: mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [16:18:04] 6operations, 7discovery-system: Install etcd in multiple rows/racks - https://phabricator.wikimedia.org/T101713#1393113 (10RobH) [16:19:01] ori: I merged already. it was trivial from the manual rebase I did to the parent patch [16:19:25] updating beta cluster now [16:19:51] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1393116 (10Dzahn) Can we rename your Gerrit user instead? It would require some steps but is documented on https://wikitech.wikimedia.org/wiki/Renaming_users [16:21:04] bd808: thanks [16:21:43] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1393120 (10Krenair) Would have to get in the queue first: T85913 [16:21:46] * bd808 now has to use trebuchet in prod again and thus crosses fingers [16:22:05] 482/482 minions completed fetch [16:22:12] that's awesome [16:22:33] 482/482 minions completed checkout [16:22:43] RECOVERY - mysqld processes on es1004 is OK: PROCS OK: 1 process with command name mysqld [16:22:55] 6operations, 10ops-codfw: cp2024 console + disk issues - https://phabricator.wikimedia.org/T103090#1393126 (10Papaul) 5Open>3Resolved @Bblack Disk replacement complete, OS installation complete. [16:22:57] !log updated scap to 947b93f (Fix reference to _get_apache_list) [16:23:01] Logged the message, Master [16:23:08] ok. ori are you ready to try again? [16:23:45] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1393133 (10RobH) I'll add that this isn't just Daniel's suggestion; this was discussed during the operations meeting and having a personal user account of a staff... [16:25:19] !log bd808 Started scap: no-op sync to scap-test dsh group; Testing HHVM restart take 2 [16:25:24] Logged the message, Master [16:25:33] bd808: yeah [16:26:45] !log bd808 Finished scap: no-op sync to scap-test dsh group; Testing HHVM restart take 2 (duration: 01m 26s) [16:26:49] Logged the message, Master [16:26:55] boo [16:26:59] "16:26:45 scap-hhvm-restart failed: an integer is required" [16:27:01] what happened? [16:27:30] bd808: psutil.pid_exists(hhvm_pid) [16:27:30] psutil needs the string cast to an int apparently [16:27:35] yeah [16:27:52] * bd808 has spent too much time in php land [16:28:20] submitting a patch or should i? [16:28:28] I'm on it [16:29:38] (03PS3) 10Faidon Liambotis: wmflib: os_version handling of Debian point-releases [puppet] - 10https://gerrit.wikimedia.org/r/220156 [16:29:53] (03CR) 10Faidon Liambotis: [C: 032 V: 032] wmflib: os_version handling of Debian point-releases [puppet] - 10https://gerrit.wikimedia.org/r/220156 (owner: 10Faidon Liambotis) [16:30:02] i got paravoid to write ruby [16:30:03] success [16:30:07] (03PS1) 10BryanDavis: Cast pid read from file to an int [tools/scap] - 10https://gerrit.wikimedia.org/r/220168 [16:30:09] I've done it before [16:30:35] I have a bunch of commits to os_version [16:30:40] all trivial, but this one is too :) [16:30:46] andrewbogott: I'd have gone the other way 'round, (role::labs::instance include base) to simplify - I know that /right now/ everything in realm labs is an instance, but there is no requirement that it be the case in general. [16:31:01] ori: review https://gerrit.wikimedia.org/r/#/c/220168/1 please? [16:31:12] Coren: sure, ok. [16:31:27] (03CR) 10Ori.livneh: [C: 032] Cast pid read from file to an int [tools/scap] - 10https://gerrit.wikimedia.org/r/220168 (owner: 10BryanDavis) [16:31:39] YuviPanda: It may actually be possible to upgrade the master/shadow straight to trusty. I'm trying to make it break now, but it looks promising. [16:31:46] hoi [16:31:47] (03Merged) 10jenkins-bot: Cast pid read from file to an int [tools/scap] - 10https://gerrit.wikimedia.org/r/220168 (owner: 10BryanDavis) [16:31:56] Coren: rebuild or upgrade in place? [16:32:10] is someone around who I can ask a security related question to oAuth ? [16:32:13] YuviPanda: Rebuild - I wouldn't trust a dist-upgrade with our images. [16:32:17] hm, I need to restart the puppetmaster now, right [16:32:18] +1 [16:32:37] paravoid: for the os_version changes? [16:32:39] yeah [16:32:43] GerardM-: moritzm perhaps [16:32:53] paravoid: I'm not sure - when I made the ipresolve changes I don't remember restarting them [16:33:13] YuviPanda: I'm mostly worried about the config and what happens if one is trusty and the other precise, but it looks like they didn't mess with the actual config between versions (which would make sense as you'd /want/ to be able to upgrade a grid in stages) [16:33:16] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1393175 (10RobH) [16:33:18] if they do need restarting, I'm curious how the ipresolv changes looks ok. [16:33:35] (03PS2) 10Andrew Bogott: Always include base in role::labs::instance [puppet] - 10https://gerrit.wikimedia.org/r/220160 (https://phabricator.wikimedia.org/T103357) [16:33:40] some stupid thing that we've encountered before [16:33:47] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#964633 (10RobH) [16:33:49] YuviPanda: We know that clients play nice between versions. [16:34:08] Coren: right. [16:34:29] uhoh [16:35:02] PROBLEM - puppet last run on mw2186 is CRITICAL puppet fail [16:35:03] PROBLEM - puppet last run on lvs2003 is CRITICAL puppet fail [16:35:09] !log updated scap to da64a65 (Cast pid read from file to an int) [16:35:13] Logged the message, Master [16:35:21] (03PS1) 10Faidon Liambotis: wmflib: brown paper bag fix for os_version [puppet] - 10https://gerrit.wikimedia.org/r/220170 [16:35:25] (03CR) 10jenkins-bot: [V: 04-1] wmflib: brown paper bag fix for os_version [puppet] - 10https://gerrit.wikimedia.org/r/220170 (owner: 10Faidon Liambotis) [16:35:30] (03PS2) 10Faidon Liambotis: wmflib: brown paper bag fix for os_version [puppet] - 10https://gerrit.wikimedia.org/r/220170 [16:35:42] does anyone know where https://commons.wikimedia.org/skins-1.5/monobook/bullet.gif has been movd? [16:35:44] ori: third time's a charm [16:35:49] !log bd808 Started scap: no-op sync to scap-test dsh group; Testing HHVM restart take 3 [16:35:52] Logged the message, Master [16:36:01] (03CR) 10Faidon Liambotis: [C: 032 V: 032] wmflib: brown paper bag fix for os_version [puppet] - 10https://gerrit.wikimedia.org/r/220170 (owner: 10Faidon Liambotis) [16:37:01] !log bd808 Finished scap: no-op sync to scap-test dsh group; Testing HHVM restart take 3 (duration: 01m 12s) [16:37:06] Logged the message, Master [16:37:18] "16:37:01 scap-hhvm-restart failed: [Errno 2] No such file or directory" [16:37:19] ori: I'll kill you [16:37:32] ? [16:37:52] what did i do? [16:37:53] (03PS1) 10Faidon Liambotis: wmflib: brown paper bag fix for os_version² [puppet] - 10https://gerrit.wikimedia.org/r/220171 [16:37:54] you jinxed it [16:37:57] (03CR) 10jenkins-bot: [V: 04-1] wmflib: brown paper bag fix for os_version² [puppet] - 10https://gerrit.wikimedia.org/r/220171 (owner: 10Faidon Liambotis) [16:38:03] (03PS2) 10Faidon Liambotis: wmflib: brown paper bag fix for os_version² [puppet] - 10https://gerrit.wikimedia.org/r/220171 [16:38:07] haha :) [16:38:24] (03CR) 10Faidon Liambotis: [C: 032 V: 032] wmflib: brown paper bag fix for os_version² [puppet] - 10https://gerrit.wikimedia.org/r/220171 (owner: 10Faidon Liambotis) [16:38:44] forever in history [16:39:11] bd808: [16:39:16] (03CR) 10Yuvipanda: "+1" [puppet] - 10https://gerrit.wikimedia.org/r/220171 (owner: 10Faidon Liambotis) [16:39:20] /srv/deployment/scap/scap/scap/main.py:512 [16:39:23] should specify shell=true [16:39:36] godog: strontium's puppet is broken [16:39:43] ori: doh [16:39:57] all of them should [16:40:25] anything I can do to help? [16:40:32] bd808: should we let cscott go ahead? [16:40:39] (03PS1) 10Yuvipanda: tools: Install python3-scipy [puppet] - 10https://gerrit.wikimedia.org/r/220172 (https://phabricator.wikimedia.org/T103136) [16:40:51] ori: yeah. cscott all yours [16:40:55] w00t [16:40:57] (03PS2) 10Yuvipanda: tools: Install python3-scipy [puppet] - 10https://gerrit.wikimedia.org/r/220172 (https://phabricator.wikimedia.org/T103136) [16:41:05] hopefully this time's the charm [16:41:06] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Install python3-scipy [puppet] - 10https://gerrit.wikimedia.org/r/220172 (https://phabricator.wikimedia.org/T103136) (owner: 10Yuvipanda) [16:41:55] (03PS3) 10Faidon Liambotis: puppetmaster: don't depend scripts on role::access_new_install [puppet] - 10https://gerrit.wikimedia.org/r/220154 (https://phabricator.wikimedia.org/T103499) (owner: 10Filippo Giunchedi) [16:42:03] (03CR) 10Faidon Liambotis: [C: 032 V: 032] puppetmaster: don't depend scripts on role::access_new_install [puppet] - 10https://gerrit.wikimedia.org/r/220154 (https://phabricator.wikimedia.org/T103499) (owner: 10Filippo Giunchedi) [16:42:06] (03PS1) 10Yuvipanda: tools: Install python3-scipy only in Trusty [puppet] - 10https://gerrit.wikimedia.org/r/220173 [16:42:16] (03PS2) 10Yuvipanda: tools: Install python3-scipy only in Trusty [puppet] - 10https://gerrit.wikimedia.org/r/220173 [16:42:23] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Install python3-scipy only in Trusty [puppet] - 10https://gerrit.wikimedia.org/r/220173 (owner: 10Yuvipanda) [16:43:53] (03PS12) 10Giuseppe Lavagetto: varnish: add generation of the dynamic list of directors [puppet] - 10https://gerrit.wikimedia.org/r/217818 (https://phabricator.wikimedia.org/T97975) [16:44:05] (03PS1) 10Faidon Liambotis: localssl: add reuseport to nginx listen directive [puppet] - 10https://gerrit.wikimedia.org/r/220174 [16:44:07] (03PS1) 10BryanDavis: Add shell=True to subprocess.check_call() calls [tools/scap] - 10https://gerrit.wikimedia.org/r/220175 [16:44:12] (03PS3) 10Andrew Bogott: Always include base in role::labs::instance [puppet] - 10https://gerrit.wikimedia.org/r/220160 (https://phabricator.wikimedia.org/T103357) [16:44:25] (03CR) 10Faidon Liambotis: [C: 04-1] "Untested." [puppet] - 10https://gerrit.wikimedia.org/r/220174 (owner: 10Faidon Liambotis) [16:45:02] !log updated OCG to version db7a56965233a74c73917c78b5c8c84c867321d9 [16:45:07] Logged the message, Master [16:46:07] (03PS4) 10Andrew Bogott: Always include base in role::labs::instance [puppet] - 10https://gerrit.wikimedia.org/r/220160 (https://phabricator.wikimedia.org/T103357) [16:47:52] looks good, whoo. [16:48:54] (03CR) 10Ori.livneh: [C: 032] Add shell=True to subprocess.check_call() calls [tools/scap] - 10https://gerrit.wikimedia.org/r/220175 (owner: 10BryanDavis) [16:49:17] (03Merged) 10jenkins-bot: Add shell=True to subprocess.check_call() calls [tools/scap] - 10https://gerrit.wikimedia.org/r/220175 (owner: 10BryanDavis) [16:50:33] bd808: updated scap [16:51:08] (03CR) 10Legoktm: "Why -1?" [puppet] - 10https://gerrit.wikimedia.org/r/220164 (https://phabricator.wikimedia.org/T103425) (owner: 10Dzahn) [16:51:10] ori: one more try? [16:52:02] !log bd808 Started scap: no-op sync to scap-test dsh group; Testing HHVM restart take 4 [16:52:08] Logged the message, Master [16:52:27] bd808: CalledProcessError: Command 'sudo -n -- /sbin/start apache2' returned non-zero exit status 1 [16:52:35] /sbin/start doesn't know about the apache2 name [16:52:38] you need to use service [16:53:28] is /sbin/start only for upstart things? [16:53:32] !log bd808 Finished scap: no-op sync to scap-test dsh group; Testing HHVM restart take 4 (duration: 01m 30s) [16:53:37] Logged the message, Master [16:54:20] (03PS1) 10Ori.livneh: Use service instead of start to start apache2 [tools/scap] - 10https://gerrit.wikimedia.org/r/220177 [16:54:23] (03CR) 10jenkins-bot: [V: 04-1] Use service instead of start to start apache2 [tools/scap] - 10https://gerrit.wikimedia.org/r/220177 (owner: 10Ori.livneh) [16:55:33] (03PS1) 10Andrew Bogott: Rip out the labsstatus report. [puppet] - 10https://gerrit.wikimedia.org/r/220178 [16:56:28] (03PS1) 10Ori.livneh: Update for Id19ed540bee9f4 [puppet] - 10https://gerrit.wikimedia.org/r/220179 [16:57:17] (03PS2) 10Faidon Liambotis: localssl: add reuseport to nginx listen directive [puppet] - 10https://gerrit.wikimedia.org/r/220174 [16:57:28] (03CR) 10Yuvipanda: "Already in https://gerrit.wikimedia.org/r/#/c/217838/" [puppet] - 10https://gerrit.wikimedia.org/r/220178 (owner: 10Andrew Bogott) [16:57:56] (03PS2) 10Ori.livneh: Update for Id19ed540bee9f4 [puppet] - 10https://gerrit.wikimedia.org/r/220179 [16:58:12] (03CR) 10BBlack: [C: 031] localssl: add reuseport to nginx listen directive [puppet] - 10https://gerrit.wikimedia.org/r/220174 (owner: 10Faidon Liambotis) [16:58:33] (03PS2) 10Ori.livneh: Use service instead of start to start apache2 [tools/scap] - 10https://gerrit.wikimedia.org/r/220177 [16:58:38] ^ bd808 [16:58:54] (03CR) 10Ori.livneh: [C: 032] Update for Id19ed540bee9f4 [puppet] - 10https://gerrit.wikimedia.org/r/220179 (owner: 10Ori.livneh) [16:59:47] (03Abandoned) 10Andrew Bogott: Rip out the labsstatus report. [puppet] - 10https://gerrit.wikimedia.org/r/220178 (owner: 10Andrew Bogott) [16:59:52] (03PS3) 10Andrew Bogott: labs: Disable wikitech puppet status reporter [puppet] - 10https://gerrit.wikimedia.org/r/217838 (owner: 10Yuvipanda) [17:00:17] ori: we need to fix the sudoers rules too [17:00:29] bd808: already done, applying puppet on the hosts [17:00:50] (03CR) 10BryanDavis: [C: 032] Use service instead of start to start apache2 [tools/scap] - 10https://gerrit.wikimedia.org/r/220177 (owner: 10Ori.livneh) [17:01:06] (03CR) 10Andrew Bogott: [C: 032] labs: Disable wikitech puppet status reporter [puppet] - 10https://gerrit.wikimedia.org/r/217838 (owner: 10Yuvipanda) [17:01:13] (03Merged) 10jenkins-bot: Use service instead of start to start apache2 [tools/scap] - 10https://gerrit.wikimedia.org/r/220177 (owner: 10Ori.livneh) [17:01:20] bd808: don't try again yet [17:01:35] I've got a call to be on [17:01:42] np [17:01:57] bd808: what's the invocation you're using? [17:01:59] i'll give it a shot [17:02:03] (unless you prefer i wait) [17:02:13] scap --verbose -D dsh_targets:scap-test --restart [17:02:56] don't step on twentyafterfour though [17:03:10] * ori nods [17:03:13] RECOVERY - puppet last run on mw1154 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:13] RECOVERY - puppet last run on tmh1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:13] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:03:21] (03PS1) 10Filippo Giunchedi: puppetmaster: don't depend scripts on role::access_new_install [puppet] - 10https://gerrit.wikimedia.org/r/220180 (https://phabricator.wikimedia.org/T103499) [17:03:22] RECOVERY - puppet last run on mw2137 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:23] RECOVERY - puppet last run on ms-be2014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:23] RECOVERY - puppet last run on zirconium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:23] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:03:24] RECOVERY - puppet last run on mw2117 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:24] RECOVERY - puppet last run on mw2156 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:32] paravoid: yep, there's been two/three code reviews between yesterday and now, https://gerrit.wikimedia.org/r/220180 should fix it [17:03:32] RECOVERY - puppet last run on mw2128 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:32] RECOVERY - puppet last run on mw1037 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:03:33] RECOVERY - puppet last run on mw1103 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:03:43] RECOVERY - puppet last run on mw2078 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:03:53] RECOVERY - puppet last run on mw1199 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:53] RECOVERY - puppet last run on mw2195 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:53] RECOVERY - puppet last run on mw2120 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:03:54] RECOVERY - puppet last run on mw2176 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:03:54] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:04:12] godog: I merged another one just a few minutes ago [17:04:39] !log ori Started scap: (no message) [17:04:43] Logged the message, Master [17:05:29] paravoid: ah! nevermind then, thanks! [17:06:02] !log ori scap aborted: (no message) (duration: 01m 23s) [17:06:07] Logged the message, Master [17:06:17] godog: if things are broken, you shouldn't wait for a code review [17:07:52] paravoid: *nod* [17:08:33] twentyafterfour: OK with you if I squeeze another quick scap test? (<5 mins) [17:09:04] 7Puppet, 6Labs: Fix Puppet timestamp updater for wikitech - https://phabricator.wikimedia.org/T97082#1393289 (10Andrew) 5Open>3Resolved moot due to 220ffc9cef589efa0cde2defdc1e57a5fbf853e2 [17:09:16] * ori goes for it, should be quick [17:09:18] !log ori Started scap: (no message) [17:09:32] ori: go for it [17:10:35] andrewbogott: we'd need https://gerrit.wikimedia.org/r/#/c/217839/ as well I think [17:10:52] !log ori Finished scap: (no message) (duration: 01m 34s) [17:10:56] Logged the message, Master [17:11:13] YuviPanda: I am the king of duplicating your work today https://gerrit.wikimedia.org/r/#/c/220165/ [17:11:20] andrewbogott: haha :) [17:12:59] YuviPanda: I’ll compare/contrast those two patches… getting lunch first though [17:13:08] andrewbogott: cool :) I also made https://gerrit.wikimedia.org/r/#/c/220181/ today [17:13:13] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1393301 (10fgiunchedi) agreed, 2.1.3 will give us the metrics back (and/or for longer at least) [17:13:46] (03PS1) 10BBlack: Add Filipe da Silva's multicert patches, forward-ported [software/nginx] (wmf) - 10https://gerrit.wikimedia.org/r/220182 (https://phabricator.wikimedia.org/T86654) [17:14:33] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1393306 (10GWicke) +1 from me as well. [17:14:44] !log ori Started scap: (no message) [17:15:43] 7Blocked-on-Operations, 6operations, 10Parsoid: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1393314 (10cscott) But `git deploy service restart` worked fine when I was doing my OCG deploys today. So it's not totally b... [17:16:13] (03CR) 10BBlack: [C: 032 V: 032] Add Filipe da Silva's multicert patches, forward-ported [software/nginx] (wmf) - 10https://gerrit.wikimedia.org/r/220182 (https://phabricator.wikimedia.org/T86654) (owner: 10BBlack) [17:16:26] !log ori Finished scap: (no message) (duration: 01m 42s) [17:16:31] Logged the message, Master [17:16:32] bblack: \o/ [17:16:52] bd808: [17:16:54] 17:15:51 Started restart_hhvm [17:16:54] restart_hhvm: 100% (ok: 12; fail: 0; left: 0) [17:16:54] 17:16:26 Finished restart_hhvm (duration: 00m 35s) [17:16:56] 17:16:26 Finished scap: (no message) (duration: 01m 42s) [17:17:02] thank you, you wonderful person you [17:18:38] twentyafterfour: all yours [17:21:35] (03CR) 10Tim Landscheidt: [C: 031] "I tested this with:" [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [17:22:05] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [17:22:56] (03CR) 10coren: "Indeed, part of the rationale for using rename() is that NFS implements POSIX file semantics strictly and minimally - rename() is one of t" [puppet] - 10https://gerrit.wikimedia.org/r/218880 (https://phabricator.wikimedia.org/T92561) (owner: 10coren) [17:24:05] ori: so it finally worked? [17:27:54] (03PS3) 10Faidon Liambotis: localssl: add reuseport to nginx listen directive [puppet] - 10https://gerrit.wikimedia.org/r/220174 [17:28:08] 7Puppet, 6Labs: Fix Puppet timestamp updater for wikitech - https://phabricator.wikimedia.org/T97082#1393354 (10scfc) 5Resolved>3Invalid (And here as well: The assignee didn't fix it, so it's not resolved :-).) [17:28:44] (03CR) 10Faidon Liambotis: [C: 032] localssl: add reuseport to nginx listen directive [puppet] - 10https://gerrit.wikimedia.org/r/220174 (owner: 10Faidon Liambotis) [17:30:55] (03PS1) 10Faidon Liambotis: localssl: fix spacing with reuseport [puppet] - 10https://gerrit.wikimedia.org/r/220184 [17:31:00] (03CR) 10jenkins-bot: [V: 04-1] localssl: fix spacing with reuseport [puppet] - 10https://gerrit.wikimedia.org/r/220184 (owner: 10Faidon Liambotis) [17:31:07] (03PS2) 10Faidon Liambotis: localssl: fix spacing with reuseport [puppet] - 10https://gerrit.wikimedia.org/r/220184 [17:31:29] (03CR) 10Faidon Liambotis: [C: 032 V: 032] localssl: fix spacing with reuseport [puppet] - 10https://gerrit.wikimedia.org/r/220184 (owner: 10Faidon Liambotis) [17:42:30] (03PS1) 10Faidon Liambotis: wmflib/os_version: strip .* from self_release [puppet] - 10https://gerrit.wikimedia.org/r/220187 [17:42:39] last commit for the day, clearly not my day [17:42:44] (03PS2) 10Faidon Liambotis: wmflib/os_version: strip .* from self_release [puppet] - 10https://gerrit.wikimedia.org/r/220187 [17:43:01] (03CR) 10Faidon Liambotis: [C: 032 V: 032] wmflib/os_version: strip .* from self_release [puppet] - 10https://gerrit.wikimedia.org/r/220187 (owner: 10Faidon Liambotis) [17:45:04] bd808: yes [17:45:34] brilliant [17:45:41] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review, 7Pybal: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#1393448 (10fgiunchedi) hah I was actually wrong, 1.28 is in debian experimental, https://packages.debian.org/source/experimental/ipvsadm see als... [17:46:18] (03PS1) 10Catrope: Add Flow_test_talk namespace to en beta too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220188 [17:47:23] bd808: so, for the next step, how about we repool the servers in the scap-test group, so we see how this works with Pybal? [17:47:57] ori: that seems reasonable to me [17:52:07] (03PS1) 10BBlack: bump libssl-dev req to 1.0.2+ [software/nginx] (wmf) - 10https://gerrit.wikimedia.org/r/220189 [17:55:48] (03PS3) 10Filippo Giunchedi: racktables: increase default php memory limit [puppet] - 10https://gerrit.wikimedia.org/r/217724 (https://phabricator.wikimedia.org/T102092) [17:57:06] !log repooled scap-test servers (mw1170-mw1175 and mw1270-mw1275) [17:57:11] Logged the message, Master [17:59:44] (03PS13) 10Giuseppe Lavagetto: varnish: add generation of the dynamic list of directors [puppet] - 10https://gerrit.wikimedia.org/r/217818 (https://phabricator.wikimedia.org/T97975) [18:00:04] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150623T1800). Please do the needful. [18:02:06] twentyafterfour: can i do one last <2m scap? [18:02:12] to the test group [18:02:19] should literally take two minutes [18:02:45] (03PS4) 10Filippo Giunchedi: racktables: increase default php memory limit [puppet] - 10https://gerrit.wikimedia.org/r/217724 (https://phabricator.wikimedia.org/T102092) [18:03:01] (03CR) 10BBlack: [C: 032 V: 032] bump libssl-dev req to 1.0.2+ [software/nginx] (wmf) - 10https://gerrit.wikimedia.org/r/220189 (owner: 10BBlack) [18:03:31] (03PS5) 10Filippo Giunchedi: racktables: increase default php memory limit [puppet] - 10https://gerrit.wikimedia.org/r/217724 (https://phabricator.wikimedia.org/T102092) [18:03:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] racktables: increase default php memory limit [puppet] - 10https://gerrit.wikimedia.org/r/217724 (https://phabricator.wikimedia.org/T102092) (owner: 10Filippo Giunchedi) [18:04:49] (03PS14) 10Giuseppe Lavagetto: varnish: add generation of the dynamic list of directors [puppet] - 10https://gerrit.wikimedia.org/r/217818 (https://phabricator.wikimedia.org/T97975) [18:06:23] (03CR) 10Quiddity: [C: 031] "LGTM, but does anything need to be done to remove the changes in the other patchset https://gerrit.wikimedia.org/r/#/c/172486/ - or can th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220121 (https://phabricator.wikimedia.org/T103283) (owner: 10Prtksxna) [18:08:46] !log ori Started scap: (no message) [18:11:45] !log reloading nginx on all cp* for reuseport [18:11:49] Logged the message, Master [18:11:53] weee [18:12:33] paravoid: there may be also be a perf hit due to session cache loss, right? [18:12:42] no, why? [18:12:55] I don't think the session cache gets purged on reload [18:13:20] !log ori Finished scap: (no message) (duration: 04m 34s) [18:13:21] but the 1.9.2 upgrade I did before yes, had that effect [18:13:25] Logged the message, Master [18:14:00] twentyafterfour: {{done}} [18:18:33] (reload is done) [18:19:33] woot [18:20:07] and yes, the session cache usually survives reload, but not restart, although I'm not sure about the special case of USR2-based restart [18:20:47] are we going to get rid of the https=1 X-Analytics header? [18:21:00] it's still useful, kinda [18:21:06] they're logging the 301-over-HTTP hits too [18:21:22] I guess. [18:21:24] I think? [18:21:32] yeah I don't see why it wouldn't be logged [18:21:40] unless we collectively decide to do those in nginx [18:22:00] well eventually we will, but I'd like to solve the DNS issues first [18:22:12] (such that we don't have a bunch of wiki-redirect domains that aren't SSL-valid) [18:22:15] if possible [18:22:30] yeah :/ [18:22:41] on another note, we should fix rcstream's https too [18:22:55] and default-redirect that to https as well [18:23:10] ori *cough* [18:23:26] I think the current http-over-443 thing had to be just a mistake when it was first configured, that nobody noticed/cared. [18:24:06] i'm not sure [18:24:18] unless that's some godawful intentional thing required for websockets somehow [18:25:04] (that they want to speak HTTP first over port 443, switch to websocket, then encrypt? or something crazy like that) [18:27:06] !log twentyafterfour Started scap: New deployment branch: 1.26wmf11 [18:27:10] Logged the message, Master [18:28:08] (03PS5) 10Rush: Setup a node pool file from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 [18:28:13] (03PS1) 10Filippo Giunchedi: racktables: notify apache2 Service [puppet] - 10https://gerrit.wikimedia.org/r/220209 (https://phabricator.wikimedia.org/T102092) [18:28:13] I'll fix it [18:28:24] yeah, it's just a mistake [18:28:25] (03CR) 10Rush: Setup a node pool file from etcd for lvs cluster (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/219481 (owner: 10Rush) [18:28:33] i looked over the ws docs, no mention of any quirk like that [18:28:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] racktables: notify apache2 Service [puppet] - 10https://gerrit.wikimedia.org/r/220209 (https://phabricator.wikimedia.org/T102092) (owner: 10Filippo Giunchedi) [18:30:00] (03PS1) 10Merlijn van Deen: [ssh, WIP] allow login from tools-login [puppet] - 10https://gerrit.wikimedia.org/r/220214 (https://phabricator.wikimedia.org/T103552) [18:30:04] YuviPanda: ^ [18:30:53] (03CR) 10jenkins-bot: [V: 04-1] [ssh, WIP] allow login from tools-login [puppet] - 10https://gerrit.wikimedia.org/r/220214 (https://phabricator.wikimedia.org/T103552) (owner: 10Merlijn van Deen) [18:31:19] (03PS1) 10Faidon Liambotis: rcstream: fix TLS configuration [puppet] - 10https://gerrit.wikimedia.org/r/220222 [18:31:42] !log start rolling-downgrade of cassandra to 2.1.3 T102015 [18:31:47] Logged the message, Master [18:32:27] ori, bblack ^ [18:34:11] (03PS6) 10Rush: Setup a node pool file from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 [18:34:25] (03CR) 10Ori.livneh: [C: 04-1] "Looks good, but I need to give the docs and sample code a once-over to make sure all the code uses wss://." [puppet] - 10https://gerrit.wikimedia.org/r/220222 (owner: 10Faidon Liambotis) [18:34:35] (03CR) 10Faidon Liambotis: "Yes, you should make this conditional then." [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [18:34:42] ori: I haven't made this mandatory yet [18:34:54] ori: i.e. no 302/301, no HSTS [18:35:04] this just fixes HTTPS which is broken atm [18:35:18] oh, ok [18:35:25] (03PS7) 10Rush: Setup a node pool file from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 [18:35:30] (03PS2) 10Andrew Bogott: Have git-sync-upstream error out if there are local changes. [puppet] - 10https://gerrit.wikimedia.org/r/220157 [18:35:38] (03CR) 10Ori.livneh: [C: 031] " ori: I haven't made this mandatory yet ori: i.e. no 302/301, no HSTS" [puppet] - 10https://gerrit.wikimedia.org/r/220222 (owner: 10Faidon Liambotis) [18:35:43] (03PS3) 10Andrew Bogott: Turn on autoupdate_master by default. [puppet] - 10https://gerrit.wikimedia.org/r/220147 [18:36:13] (03PS5) 10Andrew Bogott: Always include base in role::labs::instance [puppet] - 10https://gerrit.wikimedia.org/r/220160 (https://phabricator.wikimedia.org/T103357) [18:36:46] (03CR) 10Andrew Bogott: [C: 032] Always include base in role::labs::instance [puppet] - 10https://gerrit.wikimedia.org/r/220160 (https://phabricator.wikimedia.org/T103357) (owner: 10Andrew Bogott) [18:39:39] (03CR) 10Yuvipanda: "Wheeee." [puppet] - 10https://gerrit.wikimedia.org/r/220214 (https://phabricator.wikimedia.org/T103552) (owner: 10Merlijn van Deen) [18:42:12] YuviPanda: I'm actually thinking of hard-coding the key, and then providing role::ssh::hostbased_from_tools and role::ssh::hostbased_from_bastion roles [18:42:34] bikeshed later, and hba first [18:42:45] err [18:42:58] I meant, let's get it done first and then bikeshed :) [18:43:26] also, no, it really is tools-bastion-01.tools.eqiad.wmflabs [18:43:34] errr [18:43:36] why? [18:43:44] thta's what sshd says the host is called [18:43:53] you're ssshing to tools-bastion-01? [18:43:57] also nslookup 10.68.17.228 [18:43:58] and not to bastion-01.bastion.eqiad.wmflabs [18:44:05] from [18:44:07] bastion-01.bastion.eqiad.wmflabs is what you want [18:44:15] ooooh [18:44:16] yeah [18:44:19] because every time someone gets added to any project they get added to the bsation project as well [18:44:24] and have access there [18:44:27] no, you're right. [18:44:31] I misread what you meant [18:44:39] right [18:44:51] but, as noted, that means bastion needs keysign enabled, and I need to build that ssh config [18:44:56] yeah [18:45:00] WIP etc [18:45:06] I should harden bastion-restricted, though. [18:45:19] that's Ops only bastion, should have agent forwarding and stuff disabled... [18:45:28] and match prod. [18:45:44] I was planning to wait for a response from moritz [18:45:45] shouldn't affect anyone outside the ops team tho [18:45:45] yeah [18:45:50] valhallasw: yeah, that's a good idea too [18:46:38] anyway, gotta go now. [18:46:38] bye [18:46:43] see ya [18:48:33] ok scap still broken? [18:48:56] rsync: rename failed for "/srv/mediawiki/php-1.26wmf10/resources/src/mediawiki/mediawiki.Title.js" (from php-1.26wmf10/resources/src/mediawiki/.~tmp~/mediawiki.Title.js): No such file or directory (2) [18:49:23] twentyafterfour: broken from which time? (seriously) afaik bd808 and godog left it in a good state last night [18:49:27] dunno how long it's been broken tho [18:49:43] that's an rsync hiccup [18:49:44] I don't know, lots of errors [18:49:49] it should not be broken [18:49:52] specific hosts? [18:50:30] bd808: lots of hosts and lots of files so I'm still not sure of a pattern [18:53:44] !log twentyafterfour Finished scap: New deployment branch: 1.26wmf11 (duration: 26m 37s) [18:53:48] Logged the message, Master [18:53:54] twentyafterfour: looks like the commonality is that the source for the rsyncs that broke was mw2187.codfw.wmnet [18:54:31] hnmm [18:54:31] and mw2080.codfw.wmnet as well [18:54:34] https://phabricator.wikimedia.org/P828 [18:55:37] and from mw2001.codfw.wmnet [18:56:45] looks like all the files were resources/* [18:57:16] there are a couple of eqiad hosts in there too. That error is rsync gagging in the post-sync rename phase [18:57:29] yes [18:57:46] should I try it again to see if it succeeds on try #2? [18:58:09] It wouldn't hurt [18:58:26] but also the definition of insanity ;) [18:58:35] !log twentyafterfour Started scap: New deployment branch: 1.26wmf11 try #2 (13 apaches failed) [18:58:39] Logged the message, Master [18:59:32] $scap === $insanity ? retry() : fail() [19:00:40] goto: $retry() [19:02:25] !log twentyafterfour Finished scap: New deployment branch: 1.26wmf11 try #2 (13 apaches failed) (duration: 03m 50s) [19:02:29] Logged the message, Master [19:02:48] I didn't see any errors scroll by that time [19:04:06] mw2187.codfw.wmnet has twice as many clients as the rest of the rsync proxy pool [19:05:00] Is it in a row where we don't have other proxies and have a lot of mw servers? [19:07:03] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1393760 (10fgiunchedi) I've downgraded restbase100[1-6] back to 2.1.3 [19:11:47] !log running apache graceful-stop on mw1042 to test mod_status behavior during graceful stop [19:11:51] Logged the message, Master [19:14:38] (03PS1) 10BryanDavis: Use utils.sudo_check_call instead of subprocess.check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/220240 [19:14:41] (03PS1) 10BryanDavis: Set --restart batch size to 5% of total hosts [tools/scap] - 10https://gerrit.wikimedia.org/r/220241 [19:18:46] PROBLEM - puppet last run on ganeti2004 is CRITICAL puppet fail [19:20:18] 6operations, 5Patch-For-Review: racktables object field is not working - https://phabricator.wikimedia.org/T102092#1393826 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi fixed by the two code reviews above [19:32:10] 7Puppet, 6operations, 6Labs: Labs puppet breaks for projects without a Hiera: page on wikitech - https://phabricator.wikimedia.org/T101913#1393877 (10RobH) [19:32:12] (03PS1) 10Ori.livneh: monitoring::service: add 'max_check_attempts' parameter [puppet] - 10https://gerrit.wikimedia.org/r/220277 [19:32:16] (03PS1) 10Rush: phab::migration => phab::tools with dump [puppet] - 10https://gerrit.wikimedia.org/r/220278 [19:33:27] (03CR) 10jenkins-bot: [V: 04-1] monitoring::service: add 'max_check_attempts' parameter [puppet] - 10https://gerrit.wikimedia.org/r/220277 (owner: 10Ori.livneh) [19:33:36] (03CR) 10Rush: "add fyi reviewers :)" [puppet] - 10https://gerrit.wikimedia.org/r/220278 (owner: 10Rush) [19:34:25] (03PS2) 10Ori.livneh: monitoring::service: add 'max_check_attempts' parameter [puppet] - 10https://gerrit.wikimedia.org/r/220277 [19:35:06] RECOVERY - puppet last run on ganeti2004 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [19:35:07] (03CR) 10jenkins-bot: [V: 04-1] monitoring::service: add 'max_check_attempts' parameter [puppet] - 10https://gerrit.wikimedia.org/r/220277 (owner: 10Ori.livneh) [19:35:29] (03PS3) 10Jdlrobson: Enable browse prototype on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) [19:36:09] (03Abandoned) 10Ori.livneh: monitoring::service: add 'max_check_attempts' parameter [puppet] - 10https://gerrit.wikimedia.org/r/220277 (owner: 10Ori.livneh) [19:37:20] (03CR) 10Rush: [C: 032] phab::migration => phab::tools with dump [puppet] - 10https://gerrit.wikimedia.org/r/220278 (owner: 10Rush) [19:38:30] 6operations, 6Phabricator: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1393896 (10chasemp) [19:42:10] (03PS1) 10Ori.livneh: Add a retry for wikipedia check_http checks. [puppet] - 10https://gerrit.wikimedia.org/r/220283 [19:42:49] (03CR) 10Giuseppe Lavagetto: [C: 031] "The pybal::pool definition is a monument to all that is wrong in the puppet DSL - which of course is not your fault, kudos for being able " [puppet] - 10https://gerrit.wikimedia.org/r/219481 (owner: 10Rush) [19:43:09] (03CR) 10Ori.livneh: [C: 032] Use utils.sudo_check_call instead of subprocess.check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/220240 (owner: 10BryanDavis) [19:43:31] (03Merged) 10jenkins-bot: Use utils.sudo_check_call instead of subprocess.check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/220240 (owner: 10BryanDavis) [19:46:10] (03CR) 10Ori.livneh: [C: 04-1] Set --restart batch size to 5% of total hosts (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/220241 (owner: 10BryanDavis) [19:48:51] (03PS2) 10Ori.livneh: Set --restart batch size to 5% of total hosts [tools/scap] - 10https://gerrit.wikimedia.org/r/220241 (owner: 10BryanDavis) [19:49:34] (03CR) 10Ori.livneh: [C: 032] Set --restart batch size to 5% of total hosts [tools/scap] - 10https://gerrit.wikimedia.org/r/220241 (owner: 10BryanDavis) [19:49:55] (03Merged) 10jenkins-bot: Set --restart batch size to 5% of total hosts [tools/scap] - 10https://gerrit.wikimedia.org/r/220241 (owner: 10BryanDavis) [19:52:15] !log updated scap to master [19:52:20] Logged the message, Master [19:57:01] (03PS2) 10Ori.livneh: Add a retry for wikipedia check_http checks. [puppet] - 10https://gerrit.wikimedia.org/r/220283 [19:58:06] !log ori Started scap: (no message) [19:58:10] Logged the message, Master [19:59:52] !log ori scap failed: OSError [Errno 10] No child processes (duration: 01m 46s) [19:59:56] Logged the message, Master [20:00:37] (03PS3) 10Rush: Add a retry for wikipedia check_http checks. [puppet] - 10https://gerrit.wikimedia.org/r/220283 (owner: 10Ori.livneh) [20:00:46] (03CR) 10Rush: [C: 031] "Considering the odd middle ground we find ourselves in with scap / hhvm / pybal interaction this seems reasonable as an indicator for actu" [puppet] - 10https://gerrit.wikimedia.org/r/220283 (owner: 10Ori.livneh) [20:03:15] (03CR) 10Tim Landscheidt: "AOL to moving the configuration to Hiera. Also, multiple bastion hosts could and do exist, so HBA should work from all of them as well (i" [puppet] - 10https://gerrit.wikimedia.org/r/220214 (https://phabricator.wikimedia.org/T103552) (owner: 10Merlijn van Deen) [20:06:51] 6operations: deploy nembus as ldap server in codfw - https://phabricator.wikimedia.org/T84751#1393994 (10Dzahn) [20:11:42] 6operations, 7HTTPS, 7LDAP: SSL certificates on LDAP servers expiring 2015-09-20 - https://phabricator.wikimedia.org/T103590#1394015 (10Dzahn) 3NEW [20:11:55] andrewbogott: while I remember to ask, is there any ticket tracking deployment of a labs service in codfw? [20:12:06] been wondering about that for a while [20:12:51] JohnFLewis: no, we aren’t planning to add labs to codfw. [20:13:13] 6operations, 7HTTPS, 7LDAP: SSL certificates on LDAP servers expiring 2015-09-20 - https://phabricator.wikimedia.org/T103590#1394029 (10Dzahn) p:5Triage>3Low priority low for now, should raise when date gets closer, but it's a monitoring critical now due to thresholds [20:13:36] really? I just assumed with the deployment of labs-y things to codfw like labstore and dns recursors [20:13:48] (03CR) 10Ori.livneh: [C: 032 V: 032] Add a retry for wikipedia check_http checks. [puppet] - 10https://gerrit.wikimedia.org/r/220283 (owner: 10Ori.livneh) [20:14:06] 6operations, 7HTTPS, 7LDAP: SSL certificates on LDAP servers expiring 2015-09-20 - https://phabricator.wikimedia.org/T103590#1394032 (10Dzahn) [20:16:45] (03CR) 10Mattflaschen: [C: 032] Add Flow_test_talk namespace to en beta too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220188 (owner: 10Catrope) [20:16:53] (03Merged) 10jenkins-bot: Add Flow_test_talk namespace to en beta too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220188 (owner: 10Catrope) [20:17:16] ACKNOWLEDGEMENT - Certificate expiration on neptunium is CRITICAL: SSL CRITICAL - Certificate ldap-eqiad.wikimedia.org valid until 2015-09-20 19:41:02 +0000 (expires in 88 days) daniel_zahn https://phabricator.wikimedia.org/T103590#1394015 [20:17:57] ACKNOWLEDGEMENT - Certificate expiration on nembus is CRITICAL: SSL CRITICAL - Certificate ldap-codfw.wikimedia.org valid until 2015-09-20 19:36:03 +0000 (expires in 88 days) daniel_zahn https://phabricator.wikimedia.org/T103590 [20:18:10] wikiversions.json has unstaged changes on tin. [20:18:14] Stashing temporarily [20:19:05] !log mattflaschen Synchronized wmf-config/InitialiseSettings-labs.php: Beta-only change to add Flow_test to enwiki (duration: 00m 11s) [20:19:09] Logged the message, Master [20:20:20] andrewbogott: is there any reason why not or is it just 'not at the minute but maybe in the future' or? [20:23:35] twentyafterfour: wikiversions.json wasn't committed? ^ [20:24:04] matt_flaschen: cool, "Unmerged changes on repository mediawiki_config" monitoring recovered [20:24:17] for some reason that did not show up on IRC [20:24:35] 6operations, 6Phabricator: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1394076 (10chasemp) a:5chasemp>3ArielGlenn Ok a job should run nightly and dump: > du -sh /srv/dumps/phabricator_public.dump > 115M /srv/dumps/phabricator_public.dump [20:24:53] legoktm: matt_flaschen: wikiversions.json update still pending, that was a temp change for testwiki [20:25:12] so no biggie that it was reverted [20:26:54] (03PS1) 1020after4: group0 wikis to 1.26wmf11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220295 [20:28:13] (03CR) 1020after4: [C: 032] group0 wikis to 1.26wmf11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220295 (owner: 1020after4) [20:28:19] (03Merged) 10jenkins-bot: group0 wikis to 1.26wmf11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220295 (owner: 1020after4) [20:31:26] JohnFLewis: We couldn’t think of any real benefits. [20:32:29] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.26wmf11 [20:32:31] andrewbogott: and the labs{db|control|store} servers deployed in codfw will be used within the eqiad set up or? (sorry if I'm annoying but something I noticed and want to look into :) ) [20:32:34] Logged the message, Master [20:32:46] 6operations, 3Discovery-Cirrus-Sprint: Import Elasticsearch 1.6.0 deb into wmf apt - https://phabricator.wikimedia.org/T102008#1394100 (10Manybubbles) Poking any opsen - can import? [20:35:02] JohnFLewis: yeah, they’re used as backups for eqiad systems. [20:35:14] Certainly in the case of file backups it’s useful to have them in a different datacenter. [20:35:24] okay :) [20:35:28] labcontrol could just as easily be in eqiad [20:36:14] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Sites, 10SEO, 7Mobile: Impossible to switch from mobile to desktop - https://phabricator.wikimedia.org/T103592#1394104 (10Jdlrobson) [20:37:06] 7Puppet, 5Patch-For-Review, 3Readership-Web, 3Readership-Web-Next-Sprint-50-X______________: Certain urls do not redirect to mobile - https://phabricator.wikimedia.org/T103158#1394111 (10Jdlrobson) I believe this has caused a serious regression: T103592 [20:37:09] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Sites, 10SEO, 7Mobile: Impossible to switch from mobile to desktop - https://phabricator.wikimedia.org/T103592#1394113 (10Paladox) Seems to have been fixed but what ever it was needs to be backported to 1.26 wmf10 mediawiki was just now updated to... [20:37:22] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1394115 (10Niedzielski) 5Open>3Resolved @Dzahn, @Krenair, @RobH, ok that's a bummer but I can work around it. Thanks for the help! [20:37:25] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Sites, 10SEO, 7Mobile: Impossible to switch from mobile to desktop - https://phabricator.wikimedia.org/T103592#1394119 (10Jdlrobson) [20:37:56] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Sites, 10SEO, 7Mobile: Impossible to switch from mobile to desktop - https://phabricator.wikimedia.org/T103592#1394122 (10Paladox) Happens on wikimedia but doesn't happen now in mediawiki running on wmf 11. [20:39:10] 7Puppet, 5Patch-For-Review, 3Readership-Web, 3Readership-Web-Next-Sprint-50-X______________: Certain urls do not redirect to mobile - https://phabricator.wikimedia.org/T103158#1383579 (10Jdlrobson) Not as serious as first thought - seems to only impact certain urls.. e.g. MediaWiki main page [20:39:49] jdlrobson: what's the thing with X______________ in the project name? [20:40:21] some sort of progress indication for the sprint? would it make sense for wikibugs to somehow remove it? [20:40:30] 7Puppet, 5Patch-For-Review, 3Readership-Web, 3Readership-Web-Next-Sprint-50-X______________: Certain urls do not redirect to mobile - https://phabricator.wikimedia.org/T103158#1394137 (10Paladox) MediaWiki now works seems to be a bug in MobileFrontend on the wmf 10 branch because just a few minutes ago it... [20:41:40] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Sites, 10SEO, 7Mobile: Impossible to switch from mobile to desktop on certain pages - https://phabricator.wikimedia.org/T103592#1394145 (10Jdlrobson) [20:41:47] (03PS3) 10Dzahn: Wikidata build: use deep copy instead of git submodule [puppet] - 10https://gerrit.wikimedia.org/r/219814 (owner: 10JanZerebecki) [20:42:11] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Sites, 10SEO, 7Mobile: Impossible to switch from mobile to desktop on certain pages - https://phabricator.wikimedia.org/T103592#1394089 (10Jdlrobson) Thanks for this report! It could be related to the title query string parameter in the url. [20:42:45] valhallasw: we use a project for sprints. That's the current sprint we are working in. It's a temporary 2 week project to capture work to do. [20:43:02] jdlrobson: yeah, that makes sense, but what's the X_____ for? :-) [20:43:32] it's because we haven't named it. It's suppose to be a movie beginning with X and we haven't come up with one yet ;) [20:43:41] aaaah :D [20:44:38] 6operations, 3Discovery-Cirrus-Sprint: Import Elasticsearch 1.6.0 deb into wmf apt - https://phabricator.wikimedia.org/T102008#1394166 (10fgiunchedi) @manybubbles, we can import 1.6 no problem, how long do you think it'll take to validate it? I'm asking because we'll be replacing 1.3 with 1.6 in the repo, thus... [20:44:47] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1394169 (10BBlack) I haven't followed all of the lead-up to this hardware deployment closely, but I was under the impression we already had a well-tested, reliable solution for hi... [20:44:54] (03PS1) 10MaxSem: Restrict redirection regexes [puppet] - 10https://gerrit.wikimedia.org/r/220297 (https://phabricator.wikimedia.org/T103592) [20:44:57] jdlrobson: https://en.wikipedia.org/wiki/List_of_films:_X%E2%80%93Z I'm afraid there's not that much choice either :D [20:46:04] 6operations, 3Discovery-Cirrus-Sprint: Import Elasticsearch 1.6.0 deb into wmf apt - https://phabricator.wikimedia.org/T102008#1394186 (10Manybubbles) >>! In T102008#1394166, @fgiunchedi wrote: > @manybubbles, we can import 1.6 no problem, how long do you think it'll take to validate it? I'm asking because we'... [20:46:21] jdlrobson: X-Files > X-Men :) [20:48:17] (03CR) 10Dzahn: [C: 032] Wikidata build: use deep copy instead of git submodule [puppet] - 10https://gerrit.wikimedia.org/r/219814 (owner: 10JanZerebecki) [20:48:19] (03PS1) 10BryanDavis: Guard against https://bugs.python.org/issue1731717 [tools/scap] - 10https://gerrit.wikimedia.org/r/220300 [20:49:27] (03PS1) 10Ori.livneh: Handle ECHILD in ssh.py [tools/scap] - 10https://gerrit.wikimedia.org/r/220301 [20:49:47] (03CR) 10jenkins-bot: [V: 04-1] Handle ECHILD in ssh.py [tools/scap] - 10https://gerrit.wikimedia.org/r/220301 (owner: 10Ori.livneh) [20:50:19] ori: you need to import errno [20:50:59] right [20:51:18] (03PS2) 10Ori.livneh: Handle ECHILD in ssh.py [tools/scap] - 10https://gerrit.wikimedia.org/r/220301 [20:51:21] (03Abandoned) 10BryanDavis: Guard against https://bugs.python.org/issue1731717 [tools/scap] - 10https://gerrit.wikimedia.org/r/220300 (owner: 10BryanDavis) [20:51:43] bd808: I'm sorry, I hadn't realized you already had a patch [20:51:50] I would have just amended or explained my idea in words [20:52:00] no worries. it was just a couple of line [20:52:34] If it was an artistic masterpiece I would have told you to wait ;) [20:52:51] waitpid() [20:53:15] 6operations, 3Discovery-Cirrus-Sprint: Import Elasticsearch 1.6.0 deb into wmf apt - https://phabricator.wikimedia.org/T102008#1394212 (10fgiunchedi) >>! In T102008#1394186, @Manybubbles wrote: >>>! In T102008#1394166, @fgiunchedi wrote: >> @manybubbles, we can import 1.6 no problem, how long do you think it'l... [20:54:41] (03CR) 10Dzahn: [C: 04-1] "The first redirect target, https://doc.wikimedia.org/mediawiki-core/master/php/html gets a 404 Not Found ?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/219228 (owner: 10Chad) [20:56:48] (03PS2) 10MaxSem: Restrict redirection regexes [puppet] - 10https://gerrit.wikimedia.org/r/220297 (https://phabricator.wikimedia.org/T103592) [20:56:58] (03CR) 10Dzahn: [C: 031] nova: fix string containing only a variable [puppet] - 10https://gerrit.wikimedia.org/r/219189 (owner: 10Matanya) [20:59:13] (03PS3) 10BBlack: Restrict redirection regexes [puppet] - 10https://gerrit.wikimedia.org/r/220297 (https://phabricator.wikimedia.org/T103592) (owner: 10MaxSem) [20:59:13] mutante: yeh i think we are leaning on X Files... but we need to vote ;-) [21:00:51] jdlrobson: http://www.imdb.com/title/tt0104797/?ref_=fn_al_tt_1 [21:01:00] X, Malcolm [21:01:17] haha. not sure that's allowed [21:02:53] (03CR) 10BBlack: [C: 032] Restrict redirection regexes [puppet] - 10https://gerrit.wikimedia.org/r/220297 (https://phabricator.wikimedia.org/T103592) (owner: 10MaxSem) [21:03:32] wow bblack and MaxSem super fast :) [21:05:28] (03PS1) 10Dzahn: virt1000: remove from site.pp and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/220304 (https://phabricator.wikimedia.org/T102005) [21:06:18] (03PS1) 10Andrew Bogott: Turn on puppet autosigning on labs. [puppet] - 10https://gerrit.wikimedia.org/r/220305 (https://phabricator.wikimedia.org/T102504) [21:06:20] (03PS1) 10Andrew Bogott: Switch on salt auto_accept for labs. [puppet] - 10https://gerrit.wikimedia.org/r/220306 (https://phabricator.wikimedia.org/T102504) [21:06:49] (03PS2) 10Andrew Bogott: nova: fix string containing only a variable [puppet] - 10https://gerrit.wikimedia.org/r/219189 (owner: 10Matanya) [21:07:22] (03CR) 10Andrew Bogott: [C: 032] nova: fix string containing only a variable [puppet] - 10https://gerrit.wikimedia.org/r/219189 (owner: 10Matanya) [21:07:32] thanks andrewbogott [21:07:38] (and mutante ) [21:08:00] same to you :) [21:09:11] (03PS1) 10Hashar: contint: for Jessie s/ruby1.9.3/ruby2.1/ [puppet] - 10https://gerrit.wikimedia.org/r/220308 (https://phabricator.wikimedia.org/T103600) [21:09:30] (03CR) 10Andrew Bogott: "Since this is a rename rather than a decommission, it's easier to save this until the rename." [puppet] - 10https://gerrit.wikimedia.org/r/220304 (https://phabricator.wikimedia.org/T102005) (owner: 10Dzahn) [21:09:50] (03CR) 10Hashar: "Then we can have some bundle jobs running ruby2.1 :-}" [puppet] - 10https://gerrit.wikimedia.org/r/220308 (https://phabricator.wikimedia.org/T103600) (owner: 10Hashar) [21:10:27] (03PS1) 10Dzahn: virtscripts: replace virt1000 with labcontrol1002 [puppet] - 10https://gerrit.wikimedia.org/r/220309 (https://phabricator.wikimedia.org/T1002005) [21:14:05] (03PS1) 10John F. Lewis: add planet1001 as a VM [puppet] - 10https://gerrit.wikimedia.org/r/220310 (https://phabricator.wikimedia.org/T101730) [21:14:21] (03PS2) 10John F. Lewis: add planet1001 as a VM [puppet] - 10https://gerrit.wikimedia.org/r/220310 (https://phabricator.wikimedia.org/T101730) [21:14:23] (03PS1) 10Dzahn: remove virt1000 [dns] - 10https://gerrit.wikimedia.org/r/220311 (https://phabricator.wikimedia.org/T1002005) [21:16:02] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Sites, 10SEO, and 2 others: Impossible to switch from mobile to desktop on certain pages - https://phabricator.wikimedia.org/T103592#1394328 (10Paladox) 5Open>3Resolved [21:16:17] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Sites, 10SEO, and 2 others: Impossible to switch from mobile to desktop on certain pages - https://phabricator.wikimedia.org/T103592#1394089 (10Paladox) Problem now fixed. Wikipedia working again. [21:18:09] (03PS3) 10BryanDavis: Handle ECHILD in ssh.py [tools/scap] - 10https://gerrit.wikimedia.org/r/220301 (owner: 10Ori.livneh) [21:19:02] (03PS2) 10Hashar: contint: for Jessie s/ruby1.9.3/ruby2.1/ [puppet] - 10https://gerrit.wikimedia.org/r/220308 (https://phabricator.wikimedia.org/T103600) [21:19:16] (03CR) 10BryanDavis: [C: 032] Handle ECHILD in ssh.py [tools/scap] - 10https://gerrit.wikimedia.org/r/220301 (owner: 10Ori.livneh) [21:19:36] (03Merged) 10jenkins-bot: Handle ECHILD in ssh.py [tools/scap] - 10https://gerrit.wikimedia.org/r/220301 (owner: 10Ori.livneh) [21:19:48] bd808: thanks! [21:22:42] (03PS1) 10Dzahn: move public IP from virt1000 to labcontrol1002 [dns] - 10https://gerrit.wikimedia.org/r/220314 [21:23:07] PROBLEM - YARN NodeManager Node-State on analytics1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:23:36] (03CR) 10Hashar: [C: 031 V: 032] "Forgot ruby1.9.1-dev which is only on Ubuntu. Jessie receives ruby2.1-dev" [puppet] - 10https://gerrit.wikimedia.org/r/220308 (https://phabricator.wikimedia.org/T103600) (owner: 10Hashar) [21:23:39] (03PS2) 10Dzahn: move public IP from virt1000 to labcontrol1002 [dns] - 10https://gerrit.wikimedia.org/r/220314 [21:24:47] RECOVERY - YARN NodeManager Node-State on analytics1014 is OK YARN NodeManager analytics1014.eqiad.wmnet:8041 Node-State: RUNNING [21:25:04] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1394369 (10JohnLewis) 3NEW [21:25:35] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1394377 (10JohnLewis) [21:25:38] 6operations: VM for static-bugzilla - https://phabricator.wikimedia.org/T101734#1394376 (10JohnLewis) [21:25:44] (03PS1) 10BryanDavis: Ensure that the minimum batch size used by cluster_ssh is 1 [tools/scap] - 10https://gerrit.wikimedia.org/r/220316 [21:26:09] 6operations: Move static-bugzilla from zirconium to gantei - https://phabricator.wikimedia.org/T101734#1394379 (10JohnLewis) [21:26:27] (03CR) 10Dzahn: "to move the public IP, probably after the local IP/mgmt" [dns] - 10https://gerrit.wikimedia.org/r/220314 (owner: 10Dzahn) [21:30:17] (03CR) 10Dzahn: [C: 032] "thanks John" [puppet] - 10https://gerrit.wikimedia.org/r/220310 (https://phabricator.wikimedia.org/T101730) (owner: 10John F. Lewis) [21:30:36] PROBLEM - puppet last run on tin is CRITICAL Puppet has 1 failures [21:31:08] (03CR) 10Ori.livneh: [C: 032] Ensure that the minimum batch size used by cluster_ssh is 1 [tools/scap] - 10https://gerrit.wikimedia.org/r/220316 (owner: 10BryanDavis) [21:31:32] (03Merged) 10jenkins-bot: Ensure that the minimum batch size used by cluster_ssh is 1 [tools/scap] - 10https://gerrit.wikimedia.org/r/220316 (owner: 10BryanDavis) [21:34:07] RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:34:22] !log ori Synchronized php-1.26wmf11/extensions/SyntaxHighlight_GeSHi: 3c8bb2c493: Update SyntaxHighlight_GeSHi for cherry-pick (duration: 00m 13s) [21:34:27] Logged the message, Master [21:36:20] !log updated scap to 33f3002 (Ensure that the minimum batch size used by cluster_ssh is 1) [21:36:24] Logged the message, Master [21:36:55] ori: https://www.mediawiki.org/wiki/Manual:LocalSettings.php a bunch of that has no highlighting [21:37:10] because of style="overflow:auto;" ? [21:37:46] no, [21:37:50] it's using geshi html? [21:38:07] null edit worked [21:38:11] oh durr [21:38:15] have to cherry-pick the hook [21:38:24] or sync it, if you have already [21:39:03] oh, I thought the hook made it in [21:39:20] nope [21:40:20] !log legoktm Synchronized php-1.26wmf11/includes/parser/ParserCache.php: (no message) (duration: 00m 13s) [21:40:24] Logged the message, Master [21:40:26] !log starting instance planet1001 on ganeti1003 - cant get console [21:40:30] Logged the message, Master [21:41:11] hmm, still broken [21:42:22] example? [21:42:31] https://www.mediawiki.org/wiki/Manual:CommonSettings.php [21:42:39]