[00:00:00] /srv/mediawiki-staging/.git/refs/remotes/readonly/master [00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160804T0000). [00:00:09] ^ that there is always this one file owned by root [00:00:15] and i excluded it [00:00:37] uhm, already the next deploy before i got to gerrit :p heh [00:01:57] mutante: mukunda's out this week, I/we forgot to remove the standing entry form the calendar [00:02:08] ah:) ok [00:02:11] thanks greg-g [00:02:31] and on that note, time for me to head out [00:02:33] later [00:02:47] cya [00:05:13] (03PS3) 10Dzahn: Gerrit: Redirect plain "/r" (no trailing slash) to gerrit as well [puppet] - 10https://gerrit.wikimedia.org/r/301829 (owner: 10Chad) [00:05:33] (03CR) 10Dzahn: [C: 032] Gerrit: Redirect plain "/r" (no trailing slash) to gerrit as well [puppet] - 10https://gerrit.wikimedia.org/r/301829 (owner: 10Chad) [00:07:08] (03CR) 10Paladox: "We need this, were trying to make it optional for labs." [puppet] - 10https://gerrit.wikimedia.org/r/302462 (owner: 10Chad) [00:07:36] ah yes there were some URL issues: https://wikitech.wikimedia.org/w/index.php?title=Module:Gerrit&diff=prev&oldid=787524 [00:08:02] /r/#q,,n,z didn't work anymore at upgrade time. [00:08:21] dereckson yep [00:08:37] that was a known problem and intentianable from upstream [00:08:38] (and by this diff, I see AceEditor doesn't strip whitespaces at EOL by default) [00:09:03] (03CR) 10Dzahn: [C: 032] Gerrit: Attempt retaining logs for 10 days [puppet] - 10https://gerrit.wikimedia.org/r/301895 (owner: 10Chad) [00:09:14] (03PS2) 10Dzahn: Gerrit: Attempt retaining logs for 10 days [puppet] - 10https://gerrit.wikimedia.org/r/301895 (owner: 10Chad) [00:11:00] (03CR) 10Dzahn: "Leroy Jenkins, please stand up" [puppet] - 10https://gerrit.wikimedia.org/r/301895 (owner: 10Chad) [00:11:12] mutante ^^ lol [00:11:35] (03PS2) 10Dzahn: Gerrit: Rename apache logs a tad [puppet] - 10https://gerrit.wikimedia.org/r/301824 (owner: 10Chad) [00:11:49] (03CR) 10Dzahn: [C: 032] Gerrit: Rename apache logs a tad [puppet] - 10https://gerrit.wikimedia.org/r/301824 (owner: 10Chad) [00:12:29] (03Draft2) 10Paladox: Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 [00:12:43] mutante i will try finishing that ^^ tomarror [00:13:18] (03PS4) 10Dzahn: Gerrit: Redirect plain "/r" (no trailing slash) to gerrit as well [puppet] - 10https://gerrit.wikimedia.org/r/301829 (owner: 10Chad) [00:13:41] (03CR) 10Paladox: Gerrit: Redirect plain "/r" (no trailing slash) to gerrit as well [puppet] - 10https://gerrit.wikimedia.org/r/301829 (owner: 10Chad) [00:13:49] (03CR) 10Paladox: [C: 031] "needs a merge," [puppet] - 10https://gerrit.wikimedia.org/r/301829 (owner: 10Chad) [00:13:56] (03CR) 10Dzahn: [V: 032] Gerrit: Redirect plain "/r" (no trailing slash) to gerrit as well [puppet] - 10https://gerrit.wikimedia.org/r/301829 (owner: 10Chad) [00:17:13] !log gerrit is restarting for config change 301822 (set default project owners). gerrit apache is restarting for 301829 (redirect /r) 301895 (logs for 10 days) and 301824 (renaming logs) [00:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:21] done [00:18:33] paladox: you can confirm the /r/ stuff if you want [00:18:47] Ok [00:18:49] i will now [00:19:11] Yeh it works [00:19:13] mutante ^ [00:19:20] paladox: the ratelimit for Letsencrypt and wmflabs.org .. i think you are running into that because of Krenair's work on beta earlier [00:19:27] and yea, tomorrow [00:19:28] Oh [00:19:33] yep [00:19:37] it is 01:19am [00:19:41] ok, good about the redirect [00:19:42] yes [00:19:47] Yep [00:19:48] * paladox going to watch tv now. [00:19:54] ok, laters [00:19:57] ok [00:21:18] PROBLEM - puppet last run on mw2082 is CRITICAL: CRITICAL: Puppet has 1 failures [00:22:30] mutante, might be because of both of us [00:22:35] I thought I ran into a slightly different limit [00:23:17] he is getting [00:23:20] Too many certificates already issued for: wmflabs.org", [00:23:28] either we make it optional in labs [00:23:34] or we use their staging server [00:23:35] i guess [00:30:47] oh right, the grrrit-wm.. yes yes [00:37:12] (03CR) 10Dzahn: [C: 032] tcpircbot: add rhodium to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/302621 (owner: 10Dzahn) [00:37:18] (03PS4) 10Dzahn: tcpircbot: add rhodium to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/302621 [00:38:07] PROBLEM - puppet last run on ganeti2006 is CRITICAL: CRITICAL: Puppet has 25 failures [00:45:59] (03PS3) 10Dzahn: installserver: move role to module [puppet] - 10https://gerrit.wikimedia.org/r/298907 [00:47:07] RECOVERY - puppet last run on mw2082 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:01:10] (03PS4) 10Dzahn: installserver: move role to module [puppet] - 10https://gerrit.wikimedia.org/r/298907 [01:04:35] (03PS5) 10Dzahn: installserver: move role to module [puppet] - 10https://gerrit.wikimedia.org/r/298907 [01:04:42] mutante: yeah that ratelimit is for total unique certs (unique sets of SANs), where any cert has any of its SANs in wmflabs.org. which is unique 20 certs/week. [01:05:22] but basically if you're doing things right-ish, there's really no reason to ever hit any of the ratelimits. mostly they get hit be constantly re-issuing because you're testing your software setup, which the staging server is more-ideal for. [01:05:57] yes, i saw your comment earlier about possibly adding that to puppet.. a mode to use the staging server [01:06:05] although labs instances with their own individual public hostnames and certs... that's a strange case, as they will get limited to 20/week even though they're 'unrelated' to each other from our POV [01:06:35] it would be better to manage them centrally in a proxy with big SAN lists, probably [01:06:37] also, we can either get a public IP for gerrit in labs [01:06:51] and use LE there, with real certs [01:06:57] gerrit.wmflabs or whatnot [01:07:03] .org [01:07:07] LE won't do private domains [01:07:11] or we can click the proxy thing in wikitech [01:07:24] and skip LE altogether [01:07:50] but then there are more differences between prod and labs [01:08:00] where the point was to actually use the prod class [01:08:09] yeah [01:08:48] if we use LE with the staging server, we won't get real valid certs, right [01:08:55] will we get no certs [01:09:06] i mean.. will apache start [01:14:11] (03PS6) 10Dzahn: installserver: move role to module [puppet] - 10https://gerrit.wikimedia.org/r/298907 [01:14:50] (03CR) 10Aaron Schulz: Increase retries for rename jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/302650 (https://phabricator.wikimedia.org/T141731) (owner: 10Gergő Tisza) [01:18:22] (03PS7) 10Dzahn: installserver: move role to module [puppet] - 10https://gerrit.wikimedia.org/r/298907 [01:19:44] ostriches: around? [01:21:48] (03CR) 10Dzahn: "@Faidon i could swear this was a limitation of the role keyword and there was definitely a reason in the past. Also note how there is not " [puppet] - 10https://gerrit.wikimedia.org/r/298907 (owner: 10Dzahn) [01:24:38] (03CR) 10Dzahn: "@Faidon the issue is described on https://phabricator.wikimedia.org/T119042 , specifically https://phabricator.wikimedia.org/T119042#20616" [puppet] - 10https://gerrit.wikimedia.org/r/298907 (owner: 10Dzahn) [01:26:13] (03CR) 10Dzahn: "the actual issue and reason to rename it is this https://phabricator.wikimedia.org/T119042#2061646" [puppet] - 10https://gerrit.wikimedia.org/r/298911 (owner: 10Dzahn) [01:47:56] krenair@labtestcontrol2001:~$ df -h | grep 100% [01:48:02] /dev/md0 9.1G 8.6G 28M 100% / [01:48:50] this seems to be causing rabbitmq to be unhappy [01:52:08] okay... I know who to have a chat with about this [02:10:36] (03PS2) 10Gergő Tisza: Increase retries for rename jobs [puppet] - 10https://gerrit.wikimedia.org/r/302650 (https://phabricator.wikimedia.org/T141731) [02:16:36] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on mira is OK: Files ownership is ok. [02:24:20] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.12) (duration: 08m 39s) [02:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:24:44] (03CR) 10Aaron Schulz: [C: 031] Increase retries for rename jobs [puppet] - 10https://gerrit.wikimedia.org/r/302650 (https://phabricator.wikimedia.org/T141731) (owner: 10Gergő Tisza) [02:36:28] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on mira is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [02:36:53] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.13) (duration: 05m 52s) [02:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:43:16] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Aug 4 02:43:16 UTC 2016 (duration 6m 23s) [02:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:18:28] PROBLEM - puppet last run on mw2247 is CRITICAL: CRITICAL: puppet fail [03:41:46] (03PS2) 10Alex Monk: labs dnsrecursor metaldns: Don't return NXDOMAIN when we don't have a record of the right type but do recognise the domain [puppet] - 10https://gerrit.wikimedia.org/r/299903 (https://phabricator.wikimedia.org/T139438) [03:42:44] (03CR) 10BBlack: [C: 031] labs dnsrecursor metaldns: Don't return NXDOMAIN when we don't have a record of the right type but do recognise the domain [puppet] - 10https://gerrit.wikimedia.org/r/299903 (https://phabricator.wikimedia.org/T139438) (owner: 10Alex Monk) [03:45:57] RECOVERY - puppet last run on mw2247 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:48:57] (03PS3) 10Alex Monk: labs dnsrecursor metaldns: Don't return NXDOMAIN when we don't have a record of the right type but do recognise the domain [puppet] - 10https://gerrit.wikimedia.org/r/299903 (https://phabricator.wikimedia.org/T139438) [03:57:19] (03PS4) 10Alex Monk: labs dnsrecursor metaldns: Don't return NXDOMAIN when we don't have a record of the right type but do recognise the domain [puppet] - 10https://gerrit.wikimedia.org/r/299903 (https://phabricator.wikimedia.org/T139438) [05:47:58] (03PS1) 10Dzahn: planet: add mapped IPv6 address on VMs [puppet] - 10https://gerrit.wikimedia.org/r/302862 [05:54:04] (03PS1) 10Dzahn: phab2001: add IPv6 AAAA and reverse [dns] - 10https://gerrit.wikimedia.org/r/302865 [05:59:03] (03PS1) 10Dzahn: zosma: add mapped IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/302866 [06:09:15] (03PS1) 10Dzahn: alsafi: add missing IPV6 AAAA and reverse [dns] - 10https://gerrit.wikimedia.org/r/302868 [06:10:26] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [06:11:07] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [06:12:07] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [06:20:48] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [06:21:56] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [06:21:57] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:25:29] (03PS1) 10Dzahn: delete strategyapps.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/302870 (https://phabricator.wikimedia.org/T31675) [06:30:18] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:47] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:08] (03PS1) 10Dzahn: delete videos.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/302873 [06:32:57] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:16] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:38] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:07] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 3 failures [06:34:27] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: puppet fail [06:37:07] icinga-wm: no [06:37:26] PROBLEM - puppet last run on mw2213 is CRITICAL: CRITICAL: puppet fail [06:38:17] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:38:36] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [06:39:38] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [06:39:47] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [06:41:06] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2521553 (10ema) @madhuvishy: your PGP key does not seem to be signed yet: https://wikitech.wikimedia.org/wiki/PGP_Keys#Signing_keys. Ping me if you want... [06:54:08] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [06:55:16] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [06:55:17] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:56:17] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:57:37] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:48] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:16] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [07:04:57] RECOVERY - puppet last run on mw2213 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [07:06:16] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [07:13:47] (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/3572/ says 305 NOOP. The 5 that fail are due to labs/private issues. Will fix in another commit" [puppet] - 10https://gerrit.wikimedia.org/r/302741 (owner: 10Alexandros Kosiaris) [07:15:15] (03CR) 10Alexandros Kosiaris: "@Chad. Indeed. Thanks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/302741 (owner: 10Alexandros Kosiaris) [07:15:57] (03PS2) 10Alexandros Kosiaris: hiera: Remove 2 items from the hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/302750 [07:15:59] (03PS2) 10Alexandros Kosiaris: realm: Move the $::site setting code first [puppet] - 10https://gerrit.wikimedia.org/r/302741 [07:16:42] (03CR) 10Alexandros Kosiaris: [C: 032] realm: Move the $::site setting code first [puppet] - 10https://gerrit.wikimedia.org/r/302741 (owner: 10Alexandros Kosiaris) [07:16:55] (03PS3) 10Alexandros Kosiaris: realm: Move the $::site setting code first [puppet] - 10https://gerrit.wikimedia.org/r/302741 [07:17:02] (03CR) 10Alexandros Kosiaris: [V: 032] realm: Move the $::site setting code first [puppet] - 10https://gerrit.wikimedia.org/r/302741 (owner: 10Alexandros Kosiaris) [07:24:57] (03PS1) 10Alexandros Kosiaris: Revert "rhodium: add IPv6 AAAA and reverse" [dns] - 10https://gerrit.wikimedia.org/r/302878 [07:26:17] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "rhodium: add IPv6 AAAA and reverse" [dns] - 10https://gerrit.wikimedia.org/r/302878 (owner: 10Alexandros Kosiaris) [07:26:53] (03CR) 10Alexandros Kosiaris: "reverted in https://gerrit.wikimedia.org/r/#/c/302878/" [dns] - 10https://gerrit.wikimedia.org/r/302772 (owner: 10Dzahn) [07:30:27] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: puppet fail [07:30:28] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1008.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [07:30:29] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1009.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [07:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:31:01] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1008.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [07:31:02] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1009.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [07:31:03] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1010.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [07:31:04] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1011.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [07:31:07] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp1012.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [07:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:34:47] 06Operations, 10Traffic, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2521610 (10ema) New test performed this morning: sudo varnishd -a :81 -f /var/tmp/frontend-v4.vcl -F -n frontend sudo rm /var/tmp/varnish.main1 ; sudo varnishd -a :3128 -b localho... [07:35:06] (03PS1) 10Alexandros Kosiaris: wtp10XX: Set all installers to jessie [puppet] - 10https://gerrit.wikimedia.org/r/302880 (https://phabricator.wikimedia.org/T135176) [07:35:31] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] wtp10XX: Set all installers to jessie [puppet] - 10https://gerrit.wikimedia.org/r/302880 (https://phabricator.wikimedia.org/T135176) (owner: 10Alexandros Kosiaris) [07:37:57] (03PS1) 10Alexandros Kosiaris: Revert "Revert "rhodium: add IPv6 AAAA and reverse"" [dns] - 10https://gerrit.wikimedia.org/r/302881 [07:38:38] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 147 not-conn: cp2001_v6 [07:39:50] (03PS1) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [WiP] [debs/pybal] - 10https://gerrit.wikimedia.org/r/302882 [07:41:44] (03CR) 10jenkins-bot: [V: 04-1] Add netlink-based Ipvsmanager implementation [WiP] [debs/pybal] - 10https://gerrit.wikimedia.org/r/302882 (owner: 10Giuseppe Lavagetto) [07:43:09] (03PS1) 10Alexandros Kosiaris: Add various missing secrets [labs/private] - 10https://gerrit.wikimedia.org/r/302883 [07:43:54] (03CR) 10Alexandros Kosiaris: "Re-Reversion proposed in https://gerrit.wikimedia.org/r/#/c/302881/ so we don't forget about it. But it's not yet mergeable" [dns] - 10https://gerrit.wikimedia.org/r/302878 (owner: 10Alexandros Kosiaris) [07:44:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1 to block until prerequisite in commit message is addressed" [dns] - 10https://gerrit.wikimedia.org/r/302881 (owner: 10Alexandros Kosiaris) [07:44:56] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add various missing secrets [labs/private] - 10https://gerrit.wikimedia.org/r/302883 (owner: 10Alexandros Kosiaris) [07:46:59] (03PS1) 10Jcrespo: Use mariadb::analytics for the special host db1047 [puppet] - 10https://gerrit.wikimedia.org/r/302884 [07:47:27] (03PS2) 10Jcrespo: Use mariadb::analytics for the special host db1047 [puppet] - 10https://gerrit.wikimedia.org/r/302884 [07:51:35] (03CR) 10Jcrespo: [C: 032] Use mariadb::analytics for the special host db1047 [puppet] - 10https://gerrit.wikimedia.org/r/302884 (owner: 10Jcrespo) [07:53:17] I'm having issues on puppet deploy [07:54:14] it has been deployed to palladium, not so sure about strontium or rodium [07:54:16] PROBLEM - configured eth on wtp1008 is CRITICAL: Connection refused by host [07:54:36] PROBLEM - MD RAID on wtp1008 is CRITICAL: Connection refused by host [07:54:37] PROBLEM - dhclient process on wtp1008 is CRITICAL: Connection refused by host [07:54:47] PROBLEM - parsoid on wtp1008 is CRITICAL: Connection refused [07:54:57] PROBLEM - dhclient process on wtp1011 is CRITICAL: Connection refused by host [07:54:57] PROBLEM - configured eth on wtp1009 is CRITICAL: Connection refused by host [07:54:57] PROBLEM - puppet last run on wtp1009 is CRITICAL: Connection refused by host [07:54:57] PROBLEM - Check size of conntrack table on wtp1008 is CRITICAL: Connection refused by host [07:55:03] you can puppet-merge on those two hosts if you're not sure your changeset made it over [07:55:09] if that's what you are talking about [07:55:17] PROBLEM - Check size of conntrack table on wtp1010 is CRITICAL: Connection refused by host [07:55:17] PROBLEM - Check size of conntrack table on wtp1012 is CRITICAL: Connection refused by host [07:55:18] PROBLEM - configured eth on wtp1011 is CRITICAL: Connection refused by host [07:55:26] PROBLEM - salt-minion processes on wtp1012 is CRITICAL: Connection refused by host [07:55:26] PROBLEM - salt-minion processes on wtp1009 is CRITICAL: Connection refused by host [07:55:26] PROBLEM - dhclient process on wtp1012 is CRITICAL: Connection refused by host [07:55:26] PROBLEM - DPKG on wtp1011 is CRITICAL: Connection refused by host [07:55:27] PROBLEM - dhclient process on wtp1009 is CRITICAL: Connection refused by host [07:55:27] PROBLEM - Disk space on wtp1008 is CRITICAL: Connection refused by host [07:55:27] PROBLEM - DPKG on wtp1008 is CRITICAL: Connection refused by host [07:55:28] PROBLEM - MD RAID on wtp1009 is CRITICAL: Connection refused by host [07:55:28] PROBLEM - Disk space on wtp1010 is CRITICAL: Connection refused by host [07:55:36] PROBLEM - parsoid on wtp1012 is CRITICAL: Connection refused [07:55:36] PROBLEM - puppet last run on wtp1010 is CRITICAL: Connection refused by host [07:55:37] PROBLEM - DPKG on wtp1012 is CRITICAL: Connection refused by host [07:55:37] PROBLEM - puppet last run on wtp1008 is CRITICAL: Connection refused by host [07:55:37] PROBLEM - salt-minion processes on wtp1011 is CRITICAL: Connection refused by host [07:55:37] PROBLEM - salt-minion processes on wtp1008 is CRITICAL: Connection refused by host [07:55:38] PROBLEM - salt-minion processes on wtp1010 is CRITICAL: Connection refused by host [07:55:38] PROBLEM - Disk space on wtp1012 is CRITICAL: Connection refused by host [07:55:45] akosiaris: those are you I guess? [07:55:47] PROBLEM - MD RAID on wtp1010 is CRITICAL: Connection refused by host [07:55:47] PROBLEM - Disk space on wtp1011 is CRITICAL: Connection refused by host [07:55:57] PROBLEM - Check size of conntrack table on wtp1009 is CRITICAL: Connection refused by host [07:55:57] PROBLEM - parsoid on wtp1009 is CRITICAL: Connection refused [07:56:06] PROBLEM - Disk space on wtp1009 is CRITICAL: Connection refused by host [07:56:07] PROBLEM - DPKG on wtp1009 is CRITICAL: Connection refused by host [07:56:12] apergos: yeah I think he is rebooting [07:56:17] PROBLEM - DPKG on wtp1010 is CRITICAL: Connection refused by host [07:56:17] PROBLEM - configured eth on wtp1010 is CRITICAL: Connection refused by host [07:56:18] PROBLEM - Check size of conntrack table on wtp1011 is CRITICAL: Connection refused by host [07:56:24] it has, it was only very slow [07:56:26] PROBLEM - MD RAID on wtp1012 is CRITICAL: Connection refused by host [07:56:26] PROBLEM - MD RAID on wtp1011 is CRITICAL: Connection refused by host [07:56:26] PROBLEM - puppet last run on wtp1011 is CRITICAL: Connection refused by host [07:56:36] PROBLEM - parsoid on wtp1011 is CRITICAL: Connection refused [07:56:36] PROBLEM - dhclient process on wtp1010 is CRITICAL: Connection refused by host [07:56:38] PROBLEM - puppet last run on wtp1012 is CRITICAL: Connection refused by host [07:56:46] PROBLEM - parsoid on wtp1010 is CRITICAL: Connection refused [07:56:56] PROBLEM - configured eth on wtp1012 is CRITICAL: Connection refused by host [07:57:23] apergos: yes that's me [07:57:28] damn I forgot icinga [07:57:31] okey dokey [07:57:39] icinga didn't forget you! [07:57:47] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:58:04] jynus: it was https://gerrit.wikimedia.org/r/#/c/302881/ [07:58:06] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [07:58:23] TL;DR: IPv6 record added on rhodium but no ferm access [07:58:41] I 've reverted [08:02:07] RECOVERY - puppet last run on ganeti2006 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [08:09:14] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [08:09:31] akosiaris, it timed out? then why it worked? I am confused [08:11:04] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [08:13:06] jynus: fallback from IPv6 to IPv4 [08:13:22] I didn't know that was even possible! [08:13:33] it's the standard process [08:13:50] I am happily surprised [08:14:03] used to be that whenever a site was not loading and then after 13 secs it would load I would "it's IPv6!" [08:14:36] and I was always right. ofc happy eyeballs came then and browsers started being better [08:15:08] but most other software relies on the standard behavior and not happy eyeballs [08:32:34] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [08:40:18] 06Operations, 10Traffic: Analyze Range requests on cache_upload frontend - https://phabricator.wikimedia.org/T142076#2521768 (10ema) [08:42:15] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [08:44:44] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 26 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:50:35] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 12 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:54:03] (03PS1) 10Jcrespo: Remove /a mount point exception for old parsercache servers [puppet] - 10https://gerrit.wikimedia.org/r/302889 [08:55:27] (03CR) 10Jcrespo: [C: 032] Remove /a mount point exception for old parsercache servers [puppet] - 10https://gerrit.wikimedia.org/r/302889 (owner: 10Jcrespo) [08:59:51] (03PS14) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [09:07:53] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [09:11:44] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [09:18:00] (03PS15) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [09:18:51] (03CR) 10Filippo Giunchedi: "overall LGTM, mispelled class name" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [09:21:27] (03PS16) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [09:21:29] (03CR) 10Volans: [C: 031] "The salt part looks good, just a minor styling comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [09:23:00] (03PS17) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [09:29:15] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [09:30:45] 06Operations, 10MediaWiki-Database: periodic spike of MW exceptions "DB connection was already closed or the connection dropped." - https://phabricator.wikimedia.org/T142079#2521846 (10fgiunchedi) [09:31:04] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.107, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [09:31:24] PROBLEM - AQS root url on aqs1004 is CRITICAL: Connection refused [09:31:34] PROBLEM - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is CRITICAL: Connection refused [09:31:35] This is me bootstrapping new node, the downtime that I've set yesterday expired [09:31:38] sorry [09:32:17] (03CR) 10jenkins-bot: [V: 04-1] Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [09:33:55] (03PS18) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [09:36:00] (03CR) 10jenkins-bot: [V: 04-1] Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [09:39:15] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [09:39:51] (03PS19) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [09:43:46] (03PS1) 10DCausse: mwgrep: fails gracefully when an invalid regex is provided [puppet] - 10https://gerrit.wikimedia.org/r/302892 (https://phabricator.wikimedia.org/T141996) [09:45:04] (03CR) 10jenkins-bot: [V: 04-1] Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [09:45:26] (03PS1) 10Gehel: Maps - missing option in initial import script [puppet] - 10https://gerrit.wikimedia.org/r/302893 (https://phabricator.wikimedia.org/T138092) [09:46:38] (03PS20) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [09:46:50] (03CR) 10Gehel: [C: 032] Maps - missing option in initial import script [puppet] - 10https://gerrit.wikimedia.org/r/302893 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [09:49:30] (03CR) 10Filippo Giunchedi: [C: 031] Add prometheus's mysql-exporter to all jessie's production dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [09:51:49] 06Operations, 10Traffic: Analyze Range requests on cache_upload frontend - https://phabricator.wikimedia.org/T142076#2521872 (10ema) [09:51:55] (03PS1) 10Gehel: Maps - typo in initial import script [puppet] - 10https://gerrit.wikimedia.org/r/302894 (https://phabricator.wikimedia.org/T138092) [09:53:07] (03CR) 10Gehel: [C: 032] Maps - typo in initial import script [puppet] - 10https://gerrit.wikimedia.org/r/302894 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [09:56:45] (03PS21) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production codfw dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [09:56:51] (03PS22) 10Jcrespo: Add prometheus's mysql-exporter to all jessie's production codfw dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) [09:58:21] (03CR) 10Addshore: [C: 031] UrlShortener: Whitelist *.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302851 (https://phabricator.wikimedia.org/T142055) (owner: 10Legoktm) [10:00:20] (03CR) 10Jcrespo: [C: 032] Add prometheus's mysql-exporter to all jessie's production codfw dbs [puppet] - 10https://gerrit.wikimedia.org/r/302680 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [10:01:48] I'm running puppet on db2069 [10:02:45] also on db1057 [10:06:44] RECOVERY - HTTPS-policy on policy.wikimedia.org is OK: SSL OK - Certificate policy.wikimedia.org valid until 2017-09-27 16:01:01 +0000 (expires in 419 days) [10:10:27] the salt part is working nicely (although not yet applied to all hosts) [10:10:43] (03PS1) 10Addshore: wmgEchoMentionStatusNotifications true for test/test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302898 (https://phabricator.wikimedia.org/T141995) [10:16:45] PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: Puppet has 1 failures [10:17:12] intersting [10:18:10] godog: Execution of '/usr/sbin/service prometheus-mysqld-exporter start' returned 1: Job for prometheus-mysqld-exporter.service failed. [10:18:33] hah, interesting indeed [10:18:42] I'm looking as well [10:22:43] jynus: I've bounced mysqld-exporter, I think it might be because of a missing dependency on the .my.cnf file [10:23:26] what? [10:23:44] .my.cnf does not have dependencies [10:23:59] oh, you mean prometheus's mysql config? [10:24:56] but I think it is node exporter what failed, not mysqld-exporter? [10:25:38] in the line above mysqld-exporter failed when first installed by puppet, it didn't even start [10:26:17] (03PS2) 10DCausse: mwgrep: fails gracefully when an invalid regex is provided [puppet] - 10https://gerrit.wikimedia.org/r/302892 (https://phabricator.wikimedia.org/T141996) [10:27:13] (03PS1) 10Filippo Giunchedi: prometheus: fix mysqld-exporter service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/302901 [10:27:24] PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: Puppet has 1 failures [10:30:45] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [10:33:03] !log upgrading httpd on mw126[56] to 2.4.10-10+deb8u4+wmf3 (T73487) [10:33:04] T73487: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487 [10:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:36:54] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Puppet has 1 failures [10:37:26] (03CR) 10Jcrespo: [C: 031] "Didn't I say never to trust puppet? :-)" [puppet] - 10https://gerrit.wikimedia.org/r/302901 (owner: 10Filippo Giunchedi) [10:38:19] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix mysqld-exporter service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/302901 (owner: 10Filippo Giunchedi) [10:38:44] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:40:54] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:42:34] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:51] just noticed the mw exceptions, but I started the upgrade a bit after the log, so surely not related to my work [10:48:46] errors seems the same ones that ostriches identified yesterday [10:49:51] elukey: yeah I've opened https://phabricator.wikimedia.org/T142079 to track it [10:50:02] silencing the alarm is tricky though as it might mask other issues [10:50:48] 06Operations, 07Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago. - https://phabricator.wikimedia.org/T119042#1816403 (10faidon) Why can't we just move all the roles from manifests/roles into modules/roles in one single commit and be done with... [10:52:16] making an alarm based on mediawiki exceptions is useless [10:53:05] RECOVERY - puppet last run on es2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:53:35] how so? [10:57:36] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter for host monitoring - https://phabricator.wikimedia.org/T140646#2521943 (10fgiunchedi) more db hosts have been added today, though the `mdadm` collector fails to parse raid0 arrays since the 0.12.0 release is m... [10:58:45] (03PS3) 10Alexandros Kosiaris: hiera: Remove 2 items from the hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/302750 [10:59:24] !log applying prometheus grants to pc* hosts [10:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:03:43] godog, most of those are harmless (they should not happen, but they do) [11:03:58] sorry, godog, wrong person [11:04:39] 06Operations, 10Traffic: Analyze Range requests on cache_upload frontend - https://phabricator.wikimedia.org/T142076#2521949 (10ema) I've collected 30 minutes of frontend GET requests on cp1048 as follows: varnishncsa -m 'RxRequest:GET' -F '%{Range}i %{Content-Length}o %r' -n frontend Out of 1029692 total... [11:04:49] most jobs do not handle connections right, so they just do things with a single connection until the connection closes [11:05:14] not a big deal because they are reopened or the jobs are retried [11:05:37] but they should not be on a single, long running connection in the first place [11:05:48] if the logs help fixing that, they are great [11:07:02] anyway, it is mediawiki, I should not have an opinion about that, so whatever it works [11:07:13] (03Abandoned) 10ArielGlenn: capture dumps cron job output in log and add log rotation [puppet] - 10https://gerrit.wikimedia.org/r/302827 (owner: 10ArielGlenn) [11:07:31] although it affects me because it makes failovers more complicated [11:08:44] I wouldn't discard that being related to the transaction handling issue: T140955 [11:08:44] T140955: Wikibase\Repo\Store\WikiPageEntityStore::updateWatchlist: Automatic transaction with writes in progress (from DatabaseBase::query (LinkCache::addLinkObj)), performing implicit commit! - https://phabricator.wikimedia.org/T140955 [11:10:15] in terms of alerts- I think there should only be alerts if interactive queries fail, the rest should be on a graph [11:13:57] (03PS1) 10Filippo Giunchedi: prometheus: add mysql job configuration [puppet] - 10https://gerrit.wikimedia.org/r/302904 (https://phabricator.wikimedia.org/T126757) [11:14:24] I agree, if it isn't user-facing exceptions it doesn't need an alert, or at least not at the same threshold [11:15:05] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add mysql job configuration [puppet] - 10https://gerrit.wikimedia.org/r/302904 (https://phabricator.wikimedia.org/T126757) (owner: 10Filippo Giunchedi) [11:17:29] (03PS2) 10Filippo Giunchedi: prometheus: add mysql job configuration [puppet] - 10https://gerrit.wikimedia.org/r/302904 (https://phabricator.wikimedia.org/T126757) [11:20:15] (03PS3) 10Filippo Giunchedi: prometheus: add mysql job configuration [puppet] - 10https://gerrit.wikimedia.org/r/302904 (https://phabricator.wikimedia.org/T126757) [11:21:45] (03PS4) 10Filippo Giunchedi: prometheus: add mysql job configuration [puppet] - 10https://gerrit.wikimedia.org/r/302904 (https://phabricator.wikimedia.org/T126757) [11:22:53] (03PS4) 10Alexandros Kosiaris: hiera: Remove 2 items from the hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/302750 [11:24:26] (03PS1) 10Alexandros Kosiaris: Move labtest hiera into host specific configs [labs/private] - 10https://gerrit.wikimedia.org/r/302905 [11:25:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Move labtest hiera into host specific configs [labs/private] - 10https://gerrit.wikimedia.org/r/302905 (owner: 10Alexandros Kosiaris) [11:27:31] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add mysql job configuration [puppet] - 10https://gerrit.wikimedia.org/r/302904 (https://phabricator.wikimedia.org/T126757) (owner: 10Filippo Giunchedi) [11:34:03] RECOVERY - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is OK: TCP OK - 0.008 second response time on port 9042 [11:43:27] this is me --^ [11:47:07] (03PS5) 10Alexandros Kosiaris: hiera: Remove 2 items from the hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/302750 [11:52:10] 06Operations, 10Traffic: Analyze Range requests on cache_upload frontend - https://phabricator.wikimedia.org/T142076#2521992 (10faidon) Note that for some media file types, such as Ogg (the container format), it's impossible to know the file's duration in seconds from the header of the file. Browsers that want... [12:06:49] (03PS6) 10Alexandros Kosiaris: hiera: -2/+1 hierarchy items, labtest realm changes [puppet] - 10https://gerrit.wikimedia.org/r/302750 [12:07:23] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] hiera: -2/+1 hierarchy items, labtest realm changes [puppet] - 10https://gerrit.wikimedia.org/r/302750 (owner: 10Alexandros Kosiaris) [12:07:29] (03PS7) 10Alexandros Kosiaris: hiera: -2/+1 hierarchy items, labtest realm changes [puppet] - 10https://gerrit.wikimedia.org/r/302750 [12:07:33] (03CR) 10Alexandros Kosiaris: [V: 032] hiera: -2/+1 hierarchy items, labtest realm changes [puppet] - 10https://gerrit.wikimedia.org/r/302750 (owner: 10Alexandros Kosiaris) [12:10:10] (03PS1) 10Alexandros Kosiaris: puppetmaster: Enable trusted_node_data for puppet 3.7+ [puppet] - 10https://gerrit.wikimedia.org/r/302907 [12:13:49] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: Enable trusted_node_data for puppet 3.7+ [puppet] - 10https://gerrit.wikimedia.org/r/302907 (owner: 10Alexandros Kosiaris) [12:21:18] (03PS1) 10Alexandros Kosiaris: realm: vary $main_address generation on puppetmaster version [puppet] - 10https://gerrit.wikimedia.org/r/302909 [12:26:47] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2522057 (10Jgreen) @patrick before the test wikipedia.org had no SPF record, and current record tracks wikimedia.org: "v=s... [12:27:53] (03PS1) 10ArielGlenn: fix up pgrep error check for rsyncs between dataset hosts [puppet] - 10https://gerrit.wikimedia.org/r/302910 [12:30:21] 06Operations, 10MediaWiki-General-or-Unknown: 503 error raises again while trying to load a Wikidata page - https://phabricator.wikimedia.org/T140879#2522058 (10abian) Another 503 error, this one with an anormally large diff (+1 784 586 bytes) on [[https://www.wikidata.org/w/index.php?title=Wikidata:Database_r... [12:33:48] (03PS2) 10ArielGlenn: Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 [12:34:10] (03CR) 10jenkins-bot: [V: 04-1] Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 (owner: 10ArielGlenn) [12:39:50] (03PS3) 10Paladox: Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 [12:40:44] (03PS4) 10Paladox: Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 [12:42:00] (03CR) 10jenkins-bot: [V: 04-1] Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [12:42:23] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [12:43:53] (03PS5) 10Paladox: Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 [12:44:58] (03CR) 10jenkins-bot: [V: 04-1] Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [12:45:58] (03PS2) 10ArielGlenn: If a prereq job is missing, optionally run it instead of giving up [dumps] - 10https://gerrit.wikimedia.org/r/302706 (https://phabricator.wikimedia.org/T141981) [12:46:06] (03PS6) 10Paladox: Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 [12:51:59] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2522087 (10faidon) >>! In T135410#2522057, @Jgreen wrote: > "v=spf1 include:wikimedia.org ?all" > > We probably want a pol... [12:56:52] (03PS7) 10Paladox: Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 [12:57:22] (03PS8) 10Paladox: Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 [12:57:30] (03PS3) 10ArielGlenn: Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 [12:59:13] (03CR) 10Faidon Liambotis: [C: 04-1] "Why change the semantics of $dump? Just keep it a boolean and vary the ensure parameter based on it instead?" [puppet] - 10https://gerrit.wikimedia.org/r/302271 (owner: 10Chad) [13:03:27] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [13:08:55] (03PS9) 10Paladox: Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 [13:17:49] 06Operations: Automate creation of Phab task for failed disks - https://phabricator.wikimedia.org/T142085#2522106 (10Volans) [13:26:54] (03PS4) 10Elukey: Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) [13:40:35] (03PS1) 10Filippo Giunchedi: prometheus: remove redundant rules checking [puppet] - 10https://gerrit.wikimedia.org/r/302917 [13:42:12] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: remove redundant rules checking [puppet] - 10https://gerrit.wikimedia.org/r/302917 (owner: 10Filippo Giunchedi) [13:42:51] 06Operations, 10Icinga: Automate creation of Phab task for failed disks - https://phabricator.wikimedia.org/T142085#2522176 (10Peachey88) [13:43:35] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1008.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [13:43:37] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1009.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [13:43:40] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1010.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [13:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:42] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1011.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [13:43:44] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp1012.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=parsoid', 'service=parsoid']) [13:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:45:40] akosiaris: looks like ganeti1004 needs its salt key deleted/accepted, safe to do so? [13:48:34] godog: er.. yes and no... why did that happen ? [13:48:43] (03PS1) 10Alexandros Kosiaris: apt::repository: Rename the trusted parameter [puppet] - 10https://gerrit.wikimedia.org/r/302919 [13:49:04] akosiaris: not sure why, I saw the alert in icinga, 1.5d old [13:50:19] ganeti1004 was fine... what ... [13:51:54] !log delete+accept ganeti1004 salt minion key [13:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:52:43] godog: weird... let's keep an eye on that [13:53:02] RECOVERY - salt-minion processes on ganeti1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:54:34] akosiaris: ack, thanks for taking a look! [13:55:02] (03CR) 10Alexandros Kosiaris: [C: 032] apt::repository: Rename the trusted parameter [puppet] - 10https://gerrit.wikimedia.org/r/302919 (owner: 10Alexandros Kosiaris) [13:55:44] (03PS2) 10Alexandros Kosiaris: apt::repository: Rename the trusted parameter [puppet] - 10https://gerrit.wikimedia.org/r/302919 [13:55:46] (03PS2) 10Alexandros Kosiaris: realm: vary $main_address generation on puppetmaster version [puppet] - 10https://gerrit.wikimedia.org/r/302909 [13:57:29] (03PS1) 10Alexandros Kosiaris: Pass --trusted_node_data to the compilation process [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/302920 [13:57:56] (03CR) 10jenkins-bot: [V: 04-1] Pass --trusted_node_data to the compilation process [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/302920 (owner: 10Alexandros Kosiaris) [13:58:57] (03PS3) 10Alexandros Kosiaris: apt::repository: Rename the trusted parameter [puppet] - 10https://gerrit.wikimedia.org/r/302919 [13:59:04] (03CR) 10Alexandros Kosiaris: [V: 032] apt::repository: Rename the trusted parameter [puppet] - 10https://gerrit.wikimedia.org/r/302919 (owner: 10Alexandros Kosiaris) [14:01:05] (03PS2) 10Alexandros Kosiaris: Pass --trusted_node_data to the compilation process [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/302920 [14:02:06] (03CR) 10Alexandros Kosiaris: [C: 032] Pass --trusted_node_data to the compilation process [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/302920 (owner: 10Alexandros Kosiaris) [14:05:01] (03PS5) 10Elukey: Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) [14:05:45] (03PS6) 10Elukey: Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) [14:07:42] (03PS3) 10Alexandros Kosiaris: realm: vary $main_address generation on puppetmaster version [puppet] - 10https://gerrit.wikimedia.org/r/302909 [14:09:53] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] realm: vary $main_address generation on puppetmaster version [puppet] - 10https://gerrit.wikimedia.org/r/302909 (owner: 10Alexandros Kosiaris) [14:11:10] !log manually config s1 dbs to scrape on prometheus2001 as a test [14:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:36] (03CR) 10Ottomata: [C: 032] Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [14:22:35] (03CR) 10Chad: "We already removed the option entirely, this just was the leftover in hiera." [puppet] - 10https://gerrit.wikimedia.org/r/302462 (owner: 10Chad) [14:25:03] (03CR) 10Paladox: "Because labs will hit the rate limit" [puppet] - 10https://gerrit.wikimedia.org/r/302462 (owner: 10Chad) [14:25:20] (03CR) 10Paladox: "Notice: /Stage[main]/Gerrit::Proxy/Letsencrypt::Cert::Integrated[gerrit]/Exec[acme-setup-acme-gerrit]/returns: ValueError: Error signing c" [puppet] - 10https://gerrit.wikimedia.org/r/302462 (owner: 10Chad) [14:26:09] (03PS1) 10Alexandros Kosiaris: rhodium: pool rhodium with a loadfactor of 1 [puppet] - 10https://gerrit.wikimedia.org/r/302922 [14:26:11] (03PS1) 10Alexandros Kosiaris: rhodium: Ramp up to the same load as strontium [puppet] - 10https://gerrit.wikimedia.org/r/302923 [14:26:13] (03PS1) 10Alexandros Kosiaris: strontium: depool, palladium: lower the load to 1 [puppet] - 10https://gerrit.wikimedia.org/r/302924 [14:26:15] (03CR) 10Thcipriani: "couple of inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [14:26:41] (03CR) 10Paladox: [C: 031] "Actually you can remove this, but ive re added support here https://gerrit.wikimedia.org/r/#/c/302852/" [puppet] - 10https://gerrit.wikimedia.org/r/302462 (owner: 10Chad) [14:26:46] (03PS1) 10Elukey: Renamed the analytics deploy keyholder ssh keypair files [labs/private] - 10https://gerrit.wikimedia.org/r/302925 [14:26:53] (03CR) 10Chad: [C: 04-1] "Why? Managing certificates is a way bigger pain than letting letsencrypt do it. And in labs: you would never have certs that aren't self-s" [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [14:27:10] (03CR) 10Elukey: [C: 032 V: 032] Renamed the analytics deploy keyholder ssh keypair files [labs/private] - 10https://gerrit.wikimedia.org/r/302925 (owner: 10Elukey) [14:27:38] (03CR) 10Paladox: "Actually all domains in labs have a ssl certificate. See https://phab-01.wmflabs.org/" [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [14:28:14] jynus: just got to looking at the mariadb::groups changeset, the salt grains look like they cover my needs, was about to go look at the whole patch more closely when.. [14:28:15] (03CR) 10Chad: "Plus: the file is copy+pasted (ew), and the filename is misleading....without ssl? No, it just does the same thing but without the LE cert" [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [14:28:23] I saw it's already merged :-D [14:28:35] :-) [14:28:39] sorry about that [14:28:45] it was more about the prometheus groups [14:28:49] (03CR) 10Chad: "That's assuming it's behind the generic labs proxy, which it might not be." [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [14:28:57] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB - https://phabricator.wikimedia.org/T136200#2522301 (10Cmjohnson) @Jgreen cabled to por11 pfw2...do you need anything more? [14:29:14] (03CR) 10Paladox: "But the domain is." [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [14:29:17] and we have been discussing it 3 of us for 2 days [14:29:22] (03CR) 10Elukey: Move the Analytics Refinery role to scap3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [14:29:23] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup new fundraising queue servers - https://phabricator.wikimedia.org/T136882#2522304 (10Cmjohnson) 05Open>03Resolved The data center portion of this has been completed. [14:29:30] but it is still WIP [14:29:40] not definitive at all [14:29:51] however, depending on how puppet ends up [14:30:02] (the upcoming upgrade) [14:30:03] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2522309 (10Cmjohnson) Had a new system board installed but issue persists. Need more time to troubleshoot. [14:30:18] (03CR) 10Chad: "Yes, but we don't have access to that cert unless we're behind the generic labs proxy, which we might not be, like I said." [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [14:30:22] we may want to do it differently [14:30:25] 06Operations, 10hardware-requests: decommission dickson - https://phabricator.wikimedia.org/T120752#2522310 (10Cmjohnson) p:05Triage>03Low [14:30:36] dare I ask what the other options are? [14:30:42] I would wait to see what puppetdb brings [14:31:10] ahhh hrm [14:31:16] and see if we should integrate it with salt or with whatever comes from that [14:31:19] interesting, we shall see [14:31:28] salt in any case will be around [14:31:32] yep [14:31:45] well having something that uses puppetdb as well is a great idea, in theory at any rate [14:31:49] but with that in progress [14:32:02] and the new mysql monitoring [14:32:04] this coming week I should be able to get back to the automated check script [14:32:11] as I mentioned earlier [14:32:18] maybe in 2 months things look completely differently [14:32:22] yep [14:32:31] so I would work on the non-orchestration parts of it [14:32:44] or maybe abstract the information gathering [14:32:51] if for any reason it changes [14:33:26] well I just want to integrate the pieces so there's something there that works, and can be extended [14:33:33] we will definitely need remote-execution [14:33:39] oh yes [14:33:59] although I would try to use mysql protocol for the mysql-specific bits [14:34:06] as it is more reliable [14:34:36] sure [14:34:49] the other things is- we have a script, that is ok, but it is only a part [14:35:07] what do we do with it? a webpage with reports? [14:35:16] there's crunching the output, showing irregularities [14:35:16] maybe storing it on a database? [14:35:18] email reports [14:35:30] storing it somewhere, possible alert generation [14:35:32] and possibly [14:35:33] I am thinking we may need to cache results [14:35:38] actions taken [14:35:39] and not only generated them [14:35:45] all much further down the road of course [14:35:56] because there is like 500k tables in tital? [14:36:29] so one thing about it is that we can do a group of tables at a time [14:36:31] I am not thinking that far, but if we have to wait on puppet, maybe we can do the other parts already [14:36:47] compre to see what's drifted/broken/missing etc [14:36:51] report that out [14:37:08] because I dunno about you but I will never read a report about 500k tables [14:37:12] yes [14:37:20] as in, of course not [14:37:25] :-D [14:37:36] one thing I thought is [14:37:47] (03CR) 10Paladox: "Yep, but at least we have an option if we are behind labs so we doint hit the limit." [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [14:37:51] as hopefuly, the graphing part of tendril will go away soon [14:37:59] it will? [14:38:05] what will replace it? [14:38:09] * apergos likes the graphs [14:38:17] apergos, prometheus, fingers crossed [14:38:20] ahhhhh [14:38:41] oh, if everything goes as planned, you will have even better graphs [14:38:41] as soon as there's something to look at, feel free to toss me a linky [14:39:09] so, one thing I thought, as I was saying, is to cache that on a database [14:39:22] so that we have always something to look at [14:39:27] and work with that cache [14:39:43] otherwise querying hundreds of servers will become unbearable [14:39:50] which would make queries against the data so much easier [14:39:53] and better than having X text files [14:39:55] exactly [14:39:59] as opposed to working with huge strings of text [14:40:01] yeah [14:40:02] and that was tendril in a nutshel [14:40:05] oh dear [14:40:13] sure looked nice from the user end though :-D [14:40:31] so you know me, I'm gonna want the text files AND the db [14:40:35] so, taking away the graph thing, but improve it with that [14:40:42] so that if the db goes all to hell we can reload from the text files [14:40:51] but only use the text files for that [14:41:08] why do you think we will not have HA for tendril? [14:41:23] I don't know if we'll have HA that always works [14:41:31] I mean eventually I bet all the bugs will be out [14:41:34] but until then.... [14:41:36] m*ark suggested it before, when I told him it was crashing many times [14:41:39] 06Operations: eqiad: Install SSD's into ganeti hosts - https://phabricator.wikimedia.org/T138414#2522379 (10Cmjohnson) @akosiaris ganeti1002 disks have been replaced. All yours [14:41:51] 06Operations: eqiad: Install SSD's into ganeti hosts - https://phabricator.wikimedia.org/T138414#2522381 (10Cmjohnson) [14:41:54] (03CR) 10Thcipriani: Move the Analytics Refinery role to scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [14:41:58] so just call it hedging our bets [14:42:00] (03CR) 10Alexandros Kosiaris: [C: 032] rhodium: pool rhodium with a loadfactor of 1 [puppet] - 10https://gerrit.wikimedia.org/r/302922 (owner: 10Alexandros Kosiaris) [14:42:03] I mean, tendril right now crases every day and all graphs come from it [14:42:13] seriously? every day? [14:42:19] almost [14:42:28] thcipriani: thanks a lot for the help [14:42:28] we have terabyres of data there [14:42:42] memory cannot handle it [14:42:43] elukey: np, happy to help :) [14:42:45] ugh [14:42:58] you know what I'm gonna ask... is there a ticket? :-D [14:43:02] ha ha [14:43:17] akosiaris: I42aa26ce0c267146357edd49d11b8efbc2b4d447 seems to have broken tool labs [14:43:18] the ticket is the one you subscribed to [14:43:25] all right then [14:43:25] if you mean for tendril [14:43:26] akosiaris: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: facts is not a hash or array when accessing it with ipaddress_eth0 at /etc/puppet/manifests/realm.pp:9 on node tools-proxy-02.tools.eqiad.wmflabs [14:43:29] yep [14:43:38] there are a couple [14:43:47] the one for memory issues, HA, all that [14:43:49] valhallasw`cloud: damn... [14:43:56] but set as low because this is all caused by the graphing [14:44:09] and that is exactly what godog and me are working on now [14:44:12] valhallasw`cloud: what puppetmaster version is that on.. [14:44:13] gotcha [14:44:36] so, I would say, I know it is a priority [14:45:00] but it has some blockers, let's wait a month, think about options and lets reevaluate [14:45:05] right [14:45:14] (03PS8) 10Elukey: Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) [14:45:18] it has been open for a year [14:45:19] well for me it's not a rush, I'd like to see it happen but it's your ballgame as the dba [14:45:36] I'll support as I can [14:45:39] it can wait a month now that things are really improving [14:46:01] if I had a wish [14:46:15] it would be to convert your script on an api [14:46:31] (03CR) 10jenkins-bot: [V: 04-1] Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [14:46:33] or separate the commmand line from the logic [14:46:37] write one paragraph on that ticket about your wish [14:46:39] so it can be reused [14:46:44] remember I'm working on it next week again [14:46:45] apergos, fair enough! [14:46:46] (03PS10) 10Paladox: Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 [14:46:51] sweet [14:46:53] (03PS11) 10Paladox: Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 [14:48:50] (03PS1) 10Alexandros Kosiaris: realm: Also check if $facts exists [puppet] - 10https://gerrit.wikimedia.org/r/302929 [14:49:16] (03PS12) 10Paladox: Make letsencrypt optional in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/302852 [14:49:22] valhallasw`cloud: this sould fix ithttps://gerrit.wikimedia.org/r/#/c/302929/1 [14:50:16] (03PS9) 10Elukey: Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) [14:52:05] (03CR) 10Alexandros Kosiaris: [C: 032] realm: Also check if $facts exists [puppet] - 10https://gerrit.wikimedia.org/r/302929 (owner: 10Alexandros Kosiaris) [14:52:16] (03PS2) 10Alexandros Kosiaris: realm: Also check if $facts exists [puppet] - 10https://gerrit.wikimedia.org/r/302929 [14:52:21] (03CR) 10Alexandros Kosiaris: [V: 032] realm: Also check if $facts exists [puppet] - 10https://gerrit.wikimedia.org/r/302929 (owner: 10Alexandros Kosiaris) [14:52:32] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [14:52:55] (03CR) 10Thcipriani: [C: 031] Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [14:53:09] (03CR) 10Ottomata: [C: 032] Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [14:53:54] \o/ [14:54:07] valhallasw`cloud: ok merged. You should good to go [14:54:13] be good to go* [14:55:15] got to restart puppetmaster on palladium, sorry for the spam that is going to follow guys [14:55:31] er, people I mean [14:56:33] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:56:37] 06Operations, 10Icinga: Automate creation of Phab task for failed disks - https://phabricator.wikimedia.org/T142085#2522455 (10Volans) My proposed solution is: Monitored hosts: - allow to run through NRPE a specific command that gathers all the informations needed for the task for the specific type of RAID.... [15:00:04] anomie, ostriches, thcipriani, hashar, twentyafterfour, and aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160804T1500). [15:00:04] jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:34] akosiaris: thanks, I'll keep an eye on shinken [15:01:12] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: puppet fail [15:01:13] PROBLEM - puppet last run on elastic2024 is CRITICAL: CRITICAL: puppet fail [15:01:23] PROBLEM - puppet last run on es2018 is CRITICAL: CRITICAL: puppet fail [15:01:38] I can SWAT today. jan_drewniak ping! [15:01:43] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: puppet fail [15:01:50] thcipriani: o/ [15:01:53] PROBLEM - puppet last run on mw2061 is CRITICAL: CRITICAL: puppet fail [15:01:53] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Puppet has 10 failures [15:01:53] PROBLEM - puppet last run on wtp2019 is CRITICAL: CRITICAL: puppet fail [15:01:53] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 18 failures [15:01:53] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: puppet fail [15:02:02] PROBLEM - puppet last run on restbase2008 is CRITICAL: CRITICAL: Puppet has 11 failures [15:02:03] those ^ are expected [15:02:04] PROBLEM - puppet last run on mw2095 is CRITICAL: CRITICAL: Puppet has 9 failures [15:02:08] thcipriani: merging the change, going to move the refinery now to scap3 [15:02:10] akosiaris: fine to SWAT now? Or should I wait until puppetmaster restart is finished? [15:02:12] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [15:02:12] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail [15:02:22] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Puppet has 24 failures [15:02:22] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 9 failures [15:02:25] argh didn't see it [15:02:32] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: puppet fail [15:02:33] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 37 failures [15:02:33] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: Puppet has 9 failures [15:02:37] thcipriani: oh it's done already. that's just the wake. you can proceed [15:02:43] ack, thanks [15:03:02] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Puppet has 11 failures [15:03:03] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: puppet fail [15:03:03] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: puppet fail [15:03:12] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: puppet fail [15:03:13] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 5 failures [15:03:14] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 11 failures [15:03:14] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302921 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:03:23] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: Puppet has 2 failures [15:03:39] (03PS10) 10Elukey: Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) [15:03:44] (03Merged) 10jenkins-bot: Bumping portals to master. Updating stats on all portals. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302921 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:04:53] (03CR) 10Elukey: [C: 032] Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [15:05:13] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on mira is OK: Files ownership is ok. [15:05:13] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on tin is OK: Files ownership is ok. [15:07:02] 06Operations, 10hardware-requests: reclaim hooft to spares - https://phabricator.wikimedia.org/T131560#2522493 (10RobH) [15:07:28] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:302921|Bumping portals to master.]] (duration: 00m 36s) [15:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:06] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:302921|Bumping portals to master.]] (duration: 00m 37s) [15:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:12] ^ jan_drewniak check please [15:08:58] (03PS1) 10Joal: Remove pagecounts-[raw|all-sites] related code [puppet] - 10https://gerrit.wikimedia.org/r/302932 (https://phabricator.wikimedia.org/T130656) [15:09:12] thcipriani: looks good, thanks! [15:09:22] jan_drewniak: thanks for checking. [15:09:25] ottomata: when you have a minute --^ [15:12:21] 06Operations, 10Icinga: Automate creation of Phab task for failed disks - https://phabricator.wikimedia.org/T142085#2522106 (10jcrespo) Suggestion: have a "blacklist" and/or a "whitelist"- most of our servers will want an "as automated as possible" hw replacement, but I am thinking of some edge cases that may... [15:13:05] elukey: I'm around analytics/refinery deployment moral support and/or deployment stalking :) [15:13:16] thcipriani: I was about to update you :) [15:13:32] so all good on tin, I can see the keys under /etc/keyholder.d [15:13:43] so now I'd need to run keyholder add analytics_deploy [15:14:36] or maybe arm, but it say all keys on the wiki doc [15:14:59] 'add' doesn't seem right [15:15:07] probably I'd need to arm the keys [15:15:12] I'd probably do add, otherwise you have to update all the keys... [15:15:13] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [15:15:27] hello keyholder! [15:15:37] thcipriani: --^ [15:15:45] arm just adds all non-public keys under /etc/keyholder.d [15:16:19] add just adds a specific key, so you could do: keyholder add /etc/keyholder.d/analytics_deploy [15:16:31] all right trying [15:16:53] 06Operations, 10Icinga: Automate creation of Phab task for failed disks - https://phabricator.wikimedia.org/T142085#2522537 (10Volans) @jcrespo of course, the automatic message I had in mind will ask to always check with the service "owner" or with anyone in Ops before actually replacing the disk. But it could... [15:17:13] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [15:17:20] thcipriani: done :) [15:17:45] elukey: nice! you should test the key pre-deploy, make sure keyholder permissions are happy. [15:19:26] (03CR) 10Ottomata: "Just to write this down somewhere. After merge we need to do some manual steps:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/302932 (https://phabricator.wikimedia.org/T130656) (owner: 10Joal) [15:19:30] 06Operations, 10Traffic, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2522544 (10ema) Some more observations on the stalling issue: - v3-plus "stalls" for a while (~3s in my tests) on miss - v4 stalls on first hit and doesn't stall on subsequent hits [15:20:04] thcipriani: I updated https://wikitech.wikimedia.org/wiki/Keyholder too [15:20:46] (03PS1) 10Alexandros Kosiaris: nova: Move away from the deprecated hash mutations [puppet] - 10https://gerrit.wikimedia.org/r/302934 [15:21:03] ahh, nice, thanks forgot about that step :) [15:21:45] thcipriani: this is a dumb question but what test would you do pre-deploy? A simple ssh conn? [15:22:25] yup. After you've run puppet on the target boxes to create the analytics users/keys do: SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l analytics stat1002.eqiad.wmnet from tin. [15:22:49] super [15:23:29] *might* need to restart keyholder_proxy, I can't remember. If it gives you something about agent refusing to sign, then definitely restart keyholder proxy :) [15:23:53] E: Version '3.2.0-1' for 'scap' was not found [15:24:08] but IIRC the pkg should be on carbon [15:24:10] ah crap. [15:24:53] not anymore. I had an update planned for today to 3.2.2-1 so that's the package that's on carbon. https://gerrit.wikimedia.org/r/#/c/302744/ [15:25:04] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [15:25:13] ah poor mira! [15:25:20] going to fix it [15:25:27] but I haven't bumped it in puppet yet. I scheduled it for puppet swat in half an hour :( [15:25:32] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [15:26:22] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:26:29] thcipriani: if you have the pkg somewhere I can install it manually [15:26:52] (03PS1) 10Alexandros Kosiaris: elasticsearch: Remove cruft notice [puppet] - 10https://gerrit.wikimedia.org/r/302935 [15:27:13] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:27:13] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:27:22] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:27:23] RECOVERY - puppet last run on elastic2024 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:27:32] RECOVERY - puppet last run on es2018 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:27:40] (03CR) 10Alexandros Kosiaris: [C: 031] "PCC is happy at https://puppet-compiler.wmflabs.org/3608/." [puppet] - 10https://gerrit.wikimedia.org/r/302934 (owner: 10Alexandros Kosiaris) [15:27:44] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:27:53] ok mira should be fine [15:27:53] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:27:53] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:27:53] RECOVERY - puppet last run on wtp2019 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:27:55] 3.2.0-1? Might have to rebuild it. I can do that though... [15:28:02] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:28:03] RECOVERY - puppet last run on restbase2008 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:28:13] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:28:13] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:28:13] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:14] thcipriani: ah okok sorry didn't see the version change [15:28:23] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:33] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:28:39] (03PS3) 10Chad: Adding my new SSH key to production [puppet] - 10https://gerrit.wikimedia.org/r/302277 [15:28:42] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:44] RECOVERY - puppet last run on mw1249 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:29:03] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:29:03] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:04] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:29:05] thcipriani: I can wait until you update the scap package, not in a hurry [15:29:13] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [15:29:13] maybe I can ping you in a bit? [15:29:22] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:32] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:33] (03CR) 10Chad: "s/PS2/PS3/, didn't realize it'd been rebased. Anyway, comment stands :)" [puppet] - 10https://gerrit.wikimedia.org/r/302277 (owner: 10Chad) [15:29:52] elukey: yeah, I'm around all day. [15:29:54] RECOVERY - puppet last run on mw2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:06] well, for the next 8ish hours anyway. [15:30:16] (03CR) 10Andrew Bogott: [C: 031] ":( I liked the old way better" [puppet] - 10https://gerrit.wikimedia.org/r/302934 (owner: 10Alexandros Kosiaris) [15:30:31] thcipriani: I am EU based but will be around for the next couple of hours [15:31:31] elukey: ack. scap version bump will fix, sorry about that :( [15:31:51] thcipriani: no problem, you have been super helpful! [15:37:38] (03CR) 10Alexandros Kosiaris: "same here. puppet sucks as a declarative language. thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/302934 (owner: 10Alexandros Kosiaris) [15:37:40] (03CR) 10Alexandros Kosiaris: [C: 032] nova: Move away from the deprecated hash mutations [puppet] - 10https://gerrit.wikimedia.org/r/302934 (owner: 10Alexandros Kosiaris) [15:41:36] (03CR) 10Nuria: [C: 031] "Looks good, I added @ottomata's comments to ticket so we do not forget." [puppet] - 10https://gerrit.wikimedia.org/r/302932 (https://phabricator.wikimedia.org/T130656) (owner: 10Joal) [15:43:19] (03PS2) 10Thcipriani: Bump Scap to v.3.2.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/302744 [15:46:12] (03PS2) 10Joal: Remove pagecounts-[raw|all-sites] related code [puppet] - 10https://gerrit.wikimedia.org/r/302932 (https://phabricator.wikimedia.org/T130656) [15:46:14] 06Operations, 10Phabricator: Package Python phabricator module for both Ubuntu Precise and Debian Jessie - https://phabricator.wikimedia.org/T142097#2522579 (10Volans) [15:46:19] (03CR) 10Joal: "change done." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/302932 (https://phabricator.wikimedia.org/T130656) (owner: 10Joal) [15:47:08] 06Operations: Package Python phabricator module for both Ubuntu Precise and Debian Jessie - https://phabricator.wikimedia.org/T142097#2522596 (10Volans) [15:49:04] (03CR) 10Alex Monk: "So it's all duplicated across each of the labtest host files? It seems messy - are you sure there isn't some neater way, maybe involving r" [puppet] - 10https://gerrit.wikimedia.org/r/302750 (owner: 10Alexandros Kosiaris) [15:51:31] 06Operations, 06Discovery, 10Traffic, 03Discovery-Search-Sprint: Setup LVS for elasticsearch service on relforge servers - https://phabricator.wikimedia.org/T142098#2522624 (10Gehel) [15:53:12] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2522639 (10Cmjohnson) PowerEdge Expandable RAID Controller BIOS Copyright(c) 2014 LSI Corporation Press to Run Configuration Utility HA -0 (Bus 1 Dev 0) PERC H730 Mini FW packag... [15:55:52] (03CR) 10Rush: "I"m not sure what best way to do this if you are cleaning up but I don't think labtest is temporary. AFAIU it is a long lived testing env" [puppet] - 10https://gerrit.wikimedia.org/r/302750 (owner: 10Alexandros Kosiaris) [15:58:06] (03PS1) 10Reedy: Update WikimediaMessages LicenseTexts entry point in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302940 [15:58:58] akosiaris, mobrovac how many nodes are done now? [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160804T1600). [16:00:04] urandom, thcipriani, and tgr: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:12] o/ [16:00:12] o/ [16:00:15] * urandom is present [16:00:54] (03PS3) 10Chad: Phab: Properly remove public_task_dump.py if $dump = false [puppet] - 10https://gerrit.wikimedia.org/r/302271 [16:01:54] (03PS3) 10Eevans: Increase permissions validity on RESTBase cluster [puppet] - 10https://gerrit.wikimedia.org/r/301878 (https://phabricator.wikimedia.org/T140869) [16:02:06] (03PS8) 10Eevans: Configurable `vm.dirty_background_bytes` parameter [puppet] - 10https://gerrit.wikimedia.org/r/301425 (https://phabricator.wikimedia.org/T140825) [16:02:18] (03CR) 10jenkins-bot: [V: 04-1] Phab: Properly remove public_task_dump.py if $dump = false [puppet] - 10https://gerrit.wikimedia.org/r/302271 (owner: 10Chad) [16:02:52] (03PS4) 10Chad: Phab: Properly remove public_task_dump.py if $dump = false [puppet] - 10https://gerrit.wikimedia.org/r/302271 [16:05:09] opsen around for puppet swat? one of the updates I had is blocking analytics :( [16:07:07] thcipriani: so _joe_ is on holiday, and godog is in a meeting i think [16:07:16] blerg. [16:07:35] moritzm: it may come down to you :) [16:08:03] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2522677 (10Cmjohnson) Requested a new RAID Controller. Found that we're not the only one w/this problem https://arstechnica.com/civis/viewtopic.php?f=21&t=1316257 [16:08:13] yeah sorry for bailing out, I'm available in 20 min [16:10:41] subbu: the first 7 are running on jessie / node 4 for now [16:11:23] how about wtp1008 - wtp1012 ? i thought they were being reimaged today. [16:11:46] urandom: o/ have you reached an agreement with joe about vm.dirty_background_bytes ? IIRC there was a -1 from him [16:12:08] I can merge in a bit the permission change [16:12:15] (03CR) 10Chad: [C: 04-1] "If you don't have a cert somewhere on disk to pass, this won't work. Where do you have a cert? The *.wmflabs.org cert isn't available on o" [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [16:12:20] elukey: there was a -1 pending some feedback (pretty straightforward stuff), and i addressed that feedback [16:12:33] elukey: and then he went on holiday :( [16:12:36] ah okok, so I am in a meeting atm but I'll review it shortly [16:12:48] (at least try to) [16:12:49] so, implicitly I think we're OK [16:13:01] elukey: awesome, thanks [16:13:47] (03CR) 10Paladox: "@Chad or we can get it so we use letsencrypt test certificates in labs." [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [16:14:02] (03CR) 10Elukey: [C: 032] Increase permissions validity on RESTBase cluster [puppet] - 10https://gerrit.wikimedia.org/r/301878 (https://phabricator.wikimedia.org/T140869) (owner: 10Eevans) [16:14:37] urandom: --^ [16:14:42] elukey: sweet, thanks! [16:15:03] (03PS2) 10Reedy: Update WikimediaMessages LicenseTexts entry point in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302940 (https://phabricator.wikimedia.org/T139800) [16:16:44] mobrovac, how about wtp1008 - wtp1012 ? looking at ganglia, I saw that they were down for a few hours today. [16:17:52] subbu: oh right, yes, those are on jessie now as well [16:18:34] ok. so, 12 done, 12 more to go. [16:19:22] PROBLEM - puppet last run on mw2114 is CRITICAL: CRITICAL: puppet fail [16:20:45] urandom: any issue if we delay https://gerrit.wikimedia.org/r/#/c/301425 until godog is around? I've read the comments and it looks good to me, but I'd prefer to wait for him [16:20:47] 06Operations, 10Ops-Access-Requests: Add analytics team members to group aqs-deploy to be able to deploy pageview APi - https://phabricator.wikimedia.org/T142101#2522728 (10Nuria) [16:21:00] elukey: nope, no issue [16:21:08] super thanks :) [16:21:13] thcipriani: your turn :)( [16:21:20] elukey: thanks for your help! [16:21:36] (03PS3) 10Elukey: Bump Scap to v.3.2.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/302744 (owner: 10Thcipriani) [16:22:14] \o/ thanks elukey [16:23:10] so the change looks super easy, it should be fine.. have you guys already deployed the pkg and this is for consistency (and for me :) or is it going to update packages all around? [16:23:15] thcipriani: --^ [16:23:56] elukey: this update will update the package on the next puppet run. The package has been updated nowhere. [16:24:23] ...except on carbon, I suppose. [16:24:32] (03PS2) 10Alex Monk: labnet: Merge site_address and network_public_ip in novaconfig [puppet] - 10https://gerrit.wikimedia.org/r/302835 [16:25:36] thcipriani: all right, again it lgtm but I'd prefer to wait for godog for final confirmation. If you don't mind I can review the beta patches first [16:25:50] elukey: sure that's fine. [16:26:08] the beta patches are all beta-only and have been cherry-picked on the puppetmaster there. [16:27:06] 06Operations, 10Ops-Access-Requests, 10Analytics: Add analytics team members to group aqs-admins to be able to deploy pageview APi - https://phabricator.wikimedia.org/T142101#2522756 (10Nuria) [16:27:10] thcipriani: super [16:27:17] the scap.cfg one will change the cfg file in production, but it won't impact configuration that affects production in any way. [16:28:17] this one: https://gerrit.wikimedia.org/r/#/c/300457/ won't have any impact on any production machines. [16:29:16] (03PS1) 10Reedy: 11 more to extension.json for wmf/1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302942 (https://phabricator.wikimedia.org/T139800) [16:29:54] thcipriani: for some reason I can't rebase or merge for https://gerrit.wikimedia.org/r/#/c/300458 [16:29:55] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2522767 (10HJiang-WMF) Hi, I tried the chmod 600 * for a 2nd time and was able to log in to myusername@bast1001 without sudo, but whe... [16:30:31] elukey: hmm, sorry about that, lemme manually rebase. [16:31:22] sure! [16:32:08] (03PS2) 10Thcipriani: Beta: Add logstash host to scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/300458 [16:33:07] ^ elukey should be rebased now [16:33:51] super! Last thing - the -1 is old right? [16:34:15] because there is WIP: Depends on https://phabricator.wikimedia.org/D248 [16:34:20] (03CR) 10Thcipriani: [C: 031] "can merge now" [puppet] - 10https://gerrit.wikimedia.org/r/300458 (owner: 10Thcipriani) [16:34:23] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2522784 (10AlexMonk-WMF) When you say "when I tried to ssh from there in to any specific stat machines (e.g. stat1001)", did you just... [16:35:23] elukey: yup, old. that is deployed in beta, and it actually wouldn't have impacted beta anyway, just an extra key in the config. [16:35:58] (03CR) 10Filippo Giunchedi: [C: 031] Configurable `vm.dirty_background_bytes` parameter [puppet] - 10https://gerrit.wikimedia.org/r/301425 (https://phabricator.wikimedia.org/T140825) (owner: 10Eevans) [16:36:02] elukey: yup LGTM ^ [16:36:13] (03PS1) 10Reedy: 2 more to extension.json for wmf/1.28.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302943 (https://phabricator.wikimedia.org/T139800) [16:36:33] hah we've a full lid for puppet swat [16:36:56] s/lid/roster/ [16:37:54] (03CR) 10Elukey: [C: 032] Beta: Add logstash host to scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/300458 (owner: 10Thcipriani) [16:38:19] elukey: you're taking some puppet swat reviews? [16:38:34] yeah just did one for Eric and one for Tyler [16:38:43] the biggest one were waiting for you [16:38:43] :) [16:39:10] godog: https://gerrit.wikimedia.org/r/#/c/301425 is the one remaining for me [16:39:22] yeah, that one has the priority [16:39:24] it looks good to me [16:39:26] it's the sysctl parameter and trickle_fsync disable [16:39:43] urandom: yup, that's already applied in production? [16:40:01] _joe_ had some feedback (indicating it was OK otherwise), i addressed that, but he went on holiday so i didn't hear anything back [16:40:13] godog: no, not https://gerrit.wikimedia.org/r/#/c/301425 [16:40:57] ok, I checked and sysctl::parameter just drops the file, so on puppet run nothing will happen [16:41:00] just FYI [16:41:15] you mean it won't be applied on the machine, right? [16:41:27] correct [16:41:28] I'm expecting/OK with that [16:41:29] yeah [16:41:38] i've already set it on most machines anyway [16:42:09] urandom: ok! could you rebase it? there's a path conflict according to gerrit [16:42:34] godog: yup, one sec [16:42:36] godog: I wanted to have your opnion on this one too - https://gerrit.wikimedia.org/r/#/c/302744/2 [16:43:07] because it will trigger pkg upgrades.. is it the correct procedure or should we deploy it manually on some hosts first etc.. [16:43:12] haven't done it before in this way [16:43:43] elukey: scap gets first installed/tested in beta generally so the upgrade is safe [16:44:23] (03PS9) 10Eevans: Configurable `vm.dirty_background_bytes` parameter [puppet] - 10https://gerrit.wikimedia.org/r/301425 (https://phabricator.wikimedia.org/T140825) [16:44:24] thcipriani: empty file for mediawiki-api-canaries on modules/beta/files/dsh/group/mediawiki-api-canaries ? [16:44:28] godog: ^^ [16:45:00] godog: yeah, scap expects a file there, but it doesn't have to have anything. It gets combined with the appserver canaries. [16:45:45] (03CR) 10Filippo Giunchedi: [C: 032] Configurable `vm.dirty_background_bytes` parameter [puppet] - 10https://gerrit.wikimedia.org/r/301425 (https://phabricator.wikimedia.org/T140825) (owner: 10Eevans) [16:46:03] godog: thanks man! [16:46:06] waiting for jenkins +2s is like watching water boling [16:46:09] boiling even [16:46:14] urandom: np! [16:46:25] RECOVERY - puppet last run on mw2114 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:46:37] thcipriani: ack, thanks [16:46:43] (03PS2) 10Filippo Giunchedi: Beta: Scap canary deploy dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/300457 (owner: 10Thcipriani) [16:46:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Beta: Scap canary deploy dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/300457 (owner: 10Thcipriani) [16:48:12] Hi, puppet is failing in -releng. [16:48:15] (03CR) 10Filippo Giunchedi: [C: 032] Bump Scap to v.3.2.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/302744 (owner: 10Thcipriani) [16:48:23] (03PS4) 10Filippo Giunchedi: Bump Scap to v.3.2.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/302744 (owner: 10Thcipriani) [16:48:25] (03CR) 10Filippo Giunchedi: [V: 032] Bump Scap to v.3.2.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/302744 (owner: 10Thcipriani) [16:48:26] godog: huh, this generating an error [16:48:33] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: "false" is not a boolean. It looks to be a String at /etc/puppet/modules/cassandra/manifests/init.pp:330 on node restbase1007.eqiad.wmnet [16:48:57] urandom: sigh [16:49:02] oh my [16:49:05] :( [16:49:22] thcipriani: I wanted to ask the same question about the empty file, otherwise the patch looks fine [16:49:52] :) [16:50:59] urandom: taking a look [16:52:11] ah yeah other bools e.g. start_rpc I don't think we're validating [16:55:55] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:55:56] PROBLEM - puppet last run on cp1040 is CRITICAL: CRITICAL: puppet fail [16:56:28] (03PS1) 10Filippo Giunchedi: cassandra: don't run validate_bool on trickle_fsync [puppet] - 10https://gerrit.wikimedia.org/r/302948 [16:57:16] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: puppet fail [16:57:26] godog: so the problem is limited to validation? [16:57:35] godog: shouldn't this be caught by CI? [16:57:55] 06Operations, 06Discovery, 10Traffic, 03Discovery-Search-Sprint: Setup LVS for elasticsearch service on relforge servers - https://phabricator.wikimedia.org/T142098#2522897 (10Gehel) Reading [[ https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service | documentation ]], it seems that we only... [16:57:59] not our current CI, possibly the compiler though [16:58:35] PROBLEM - puppet last run on cp1039 is CRITICAL: CRITICAL: puppet fail [16:59:36] PROBLEM - MariaDB Slave Lag: s2 on db1036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1043.69 seconds [16:59:53] s2? why [17:00:05] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160804T1700). Please do the needful. [17:00:13] db1036, why is that familiar? [17:00:27] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Puppet has 1 failures [17:00:39] s2 is not me this time [17:00:47] I only have 4 things going on over there [17:00:53] no deploy for ores today I guess [17:00:56] and they should all be short run queries in serial [17:01:13] * apergos doublechecks [17:01:20] godog: hrmm, i did run it through the compiler (at least around abouts patch 4) [17:01:21] Hi jynus. So, in a few hours, we'll create the database for tcy.wikipedia.org, on s3. [17:01:21] nope [17:01:26] it is the rename user [17:01:32] task [17:01:44] pl wiki this time [17:01:45] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: don't run validate_bool on trickle_fsync [puppet] - 10https://gerrit.wikimedia.org/r/302948 (owner: 10Filippo Giunchedi) [17:01:49] thcipriani: on stat1002.eqiad.wmnet returned [255]: Agent admitted failure to sign using the key. :( [17:01:57] RECOVERY - MariaDB Slave Lag: s2 on db1036 is OK: OK slave_sql_lag Replication lag: 0.71 seconds [17:01:58] I lie, I have a number of things happening on s2 [17:02:02] but they are all short [17:02:17] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: puppet fail [17:02:44] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: puppet fail [17:03:03] elukey: ah, yeah, so if /etc/keyholder-auth.d/analytics_deploy.yaml looks right on tin, then that means that the keyholder_agent service will need to be restart so that it can re-read the permissions on disk. [17:03:10] er, [17:03:15] keyholder_proxy [17:03:17] it is ok now [17:03:24] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: puppet fail [17:03:35] renames grrrr [17:03:40] urandom: it did the right thing this time around [17:03:43] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:03:50] godog: yup, looks good. [17:03:51] apergos, you have things running on the watchlist/recentchanges slave? [17:03:57] sorry, could have been a headache :) keyholder_proxy *shouldn't* ask you to re-arm keyholder, it should just be the permissions piece. [17:03:58] or just s2 in general? [17:03:59] no [17:03:59] godog: Notice: /Stage[main]/Sysctl/Exec[update_sysctl] <-- is this not the param being applied? [17:04:02] I'd better not [17:04:08] if it is dump, it is unrelated [17:04:09] I hae them on the vslow/dumps [17:04:17] sorry I was just looking at cluster, not host [17:04:17] 99% convinced it is a rename process [17:04:21] no outage [17:04:38] just watchlists/recentchanges were outdated for a few minutes [17:04:38] thcipriani: so is it safe to just service keyholder-proxy restart? [17:04:55] PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: puppet fail [17:04:55] I will bump the priority of the rename task to unbreak now, ok? [17:04:57] * thcipriani double-checks [17:05:04] PROBLEM - puppet last run on restbase2008 is CRITICAL: CRITICAL: puppet fail [17:05:14] PROBLEM - puppet last run on restbase2003 is CRITICAL: CRITICAL: puppet fail [17:05:15] 06Operations, 06Release-Engineering-Team, 15User-greg, 07Wikimedia-Incident: Institute quarterly(?) review of incident reports and follow-up - https://phabricator.wikimedia.org/T141287#2522931 (10greg) Update: We (Faidon, Kevin S, and myself) just had a conversation about this. Notes at https://etherpad.w... [17:05:52] Dereckson, I will be with you in a minute [17:06:01] yeah ubn makes sense [17:06:05] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: puppet fail [17:06:31] elukey: yes, I believe so. To be on the safe side, would be good to be ready to re-enter passwords for the keys in keyholder, but I really doubt that you will be prompted to do that. Should just re-make the proxy socket after re-reading perms. [17:06:44] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:06:46] all right [17:06:54] RECOVERY - puppet last run on cp1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:07:02] urandom: ah, heh indeed [17:07:09] !log restarting keyholder-proxy on tin to let the new analytics key to be picked up [17:07:12] the new ubn is #Wikimedia-Incident [17:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:07:18] if you mean the old meaning of UBN [17:07:34] PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: puppet fail [17:08:26] thcipriani: much better now! scap deploy proceeds but I have a python error [17:08:29] going to check [17:08:34] !log T140825,T140869: Restarting Cassandra, restbase1007-a.eqiad.wmnet [17:08:35] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [17:08:36] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [17:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:09:22] bblack, thanks [17:09:42] what's the new meaning? I sort of lost track reading the discussion [17:09:43] so I should add #Wikimedia-Incident to https://phabricator.wikimedia.org/T116425, is that right? [17:10:03] I didn't know about that new usage [17:10:23] thcipriani: https://phabricator.wikimedia.org/P3643 - but scap-deploy is showing me progress (git-fat step) [17:10:37] jynus: yes, if it's an incident-level thing, or the old meaning of UBN as we used it [17:10:53] (03PS23) 10Chad: Logstash: Enable log4j provider [puppet] - 10https://gerrit.wikimedia.org/r/302601 (https://phabricator.wikimedia.org/T141324) [17:10:58] there was a discussion about how not every project in phab is part of wikimedia prod stuff, and they might have their own meanings for UBN and other priorities, etc, etc.... [17:11:20] yes, it caused, for the second time user annoyance on one of our more important functionalities (recentchanges) [17:11:28] so UBN still exists but has variable meaning. #Wikimedia-Incident means it's an incident for us in prod that deserves highest priority in WMF Tech. [17:11:31] elukey: hmm, that's an error outputting some log messages. I'm watching with scap deploy-log -v and everything *seems* to be going ok. I will investigate that one. [17:11:45] yeah super weird [17:11:45] elukey: seems like it finished? [17:11:48] anoyance == degradation of service [17:11:53] yes I stopped to the canary [17:12:00] since the other ones have puppet disabled [17:12:09] I am going to re-enable them and run scap deploy [17:12:14] 06Operations, 10procurement: eqiad: Purchase new air filters for cr1/2-eqiad - https://phabricator.wikimedia.org/T142109#2522980 (10Cmjohnson) [17:12:19] ack [17:12:31] tgr: I'm sorry I didn't get to https://gerrit.wikimedia.org/r/#/c/302650 :( [17:13:18] godog: no worries, it's not urgent [17:13:19] (03CR) 10Mobrovac: [C: 031] "LGTM, we could schedule it for the next PuppetSWAT" [puppet] - 10https://gerrit.wikimedia.org/r/302309 (https://phabricator.wikimedia.org/T139674) (owner: 10Ppchelko) [17:13:43] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint: Elasticsearch SSL on relforge hosts monitoring alerts - https://phabricator.wikimedia.org/T141234#2523002 (10debt) a:03Gehel [17:14:37] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint: Elasticsearch SSL on relforge generating monitoring alerts - https://phabricator.wikimedia.org/T141234#2491070 (10debt) [17:14:39] 06Operations, 10ops-eqiad, 10netops: Replace cr1/2-eqiad PSUs/fantrays with high-capacity ones - https://phabricator.wikimedia.org/T140765#2523010 (10Cmjohnson) 05Open>03Resolved [17:14:41] 06Operations, 10ops-eqiad, 10netops: cr1/cr2-eqiad: install new SCBs and linecards - https://phabricator.wikimedia.org/T140764#2523011 (10Cmjohnson) [17:14:51] !log T140825,T140869: Restarting Cassandra, restbase1007-b.eqiad.wmnet [17:14:53] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [17:14:53] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [17:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:14:59] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint: Elasticsearch SSL on relforge is generating monitoring alerts - https://phabricator.wikimedia.org/T141234#2523014 (10ksmith) [17:15:04] RECOVERY - puppet last run on cp1039 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:16:38] ok, things are semi-stable [17:17:07] derekson, sorry, there were ongoing issue, when is the new wiki being created? [17:17:34] I see it now [17:17:37] !log T140825,T140869: Restarting Cassandra, restbase1007-c.eqiad.wmnet [17:17:39] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [17:17:39] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [17:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:02] I need to do things after the actual database gets created, but I think it can wait until my friday morning [17:18:23] I just need to be notified so it doesn't get forgotten [17:18:30] so thank you for that [17:18:34] !log gerrit: Restarting really quick, trying alternative mysql library. [17:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:20] Well *that* didn't work [17:21:39] o.O [17:23:31] "there's your problem" type of thing? [17:24:04] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures [17:24:38] does anyone remember a change to mediawiki saying something in line with "making rename faster" - I swear I saw something like that recently [17:25:05] jynus: they made it slower instead if you refer to user renaming [17:25:11] oh [17:25:14] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 2 failures [17:25:18] so it was the other way around [17:25:23] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 7 failures [17:25:45] RECOVERY - puppet last run on restbase2004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:25:54] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:26:00] jynus: now the jobs run one per wiki, so they're slower, but less prone to fail (the last thing still happens from time to time though) [17:26:16] maybe a logic change introduced some regression, I will have a look at the history [17:26:18] when one finishes, it starts running the next, etc. [17:26:45] I think it was tgr who patched that, or legoktm; or both. I don't quite remember. [17:26:54] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:27:14] RECOVERY - puppet last run on restbase2008 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:27:23] it is just now issue in the last 8 months that I remember, then 2 in the same week sounds strange [17:28:34] jynus: I think it's because some rather large users were renamed, rather than any code changes [17:28:34] the underlying db queries have remained mostly the same [17:29:03] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:29:18] I think you agree with me that we will solve that eventually and forever :-) [17:29:31] thcipriani: deploying on all nodes, looks good so far, but the weird error message is still there [17:29:51] elukey: yup, watching. [17:30:05] I will try to add more resources to that service for now [17:30:49] just done, all good! [17:30:52] I've seen that error recently, but couldn't find a way to reproduce. Hopefully this will give me more clues as to where it's coming from. [17:31:13] elukey: thank you for all your work on moving this over! [17:31:27] thcipriani: thank you for the support! I owe you a lot of beers [17:31:32] :D [17:31:35] RECOVERY - puppet last run on restbase2003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:32:12] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:32:24] RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:33:03] I am looking at the logs, and I am not finding the typical errors, so maybe failover worked well, and users didn't notice at all [17:33:21] if that gets confirmed, I could probably lower the priority of the bug [17:34:00] thcipriani: going afk in a bit, but everything looks good on my side. let's chat tomorrow for the python stack trace :) [17:34:14] elukey: cool, sounds good :) [17:34:22] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:36:42] !log added the analytics-deploy key to the Keyholder for the Analytics Refinery scap3 migration (also updated https://wikitech.wikimedia.org/wiki/Keyholder) [17:36:43] PROBLEM - Disk space on analytics1027 is CRITICAL: DISK CRITICAL - free space: / 494 MB (1% inode=81%) [17:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:36:49] what!!! [17:38:01] checking analytics1027 [17:38:44] ottomata: --^ [17:38:52] 99% space consumption on 1027 [17:39:05] ! [17:39:57] yeah :/ [17:40:02] 1.1G camus logs, but that's not that much [17:40:02] I just deployed with scap [17:40:43] 3.3G in tmp [17:40:54] hm, that's just those old reinery dirs [17:41:01] removing those, but i don't think that is the problem [17:42:16] ottomata: I am checking the refinery's dir.. maybe git-fat + multiple revision of the repo are not good? [17:42:17] blerg. Since scap swaps out symlinks between code-repos as a rollback mechanism it's not particularly conservative with space. [17:42:25] ahhh yes [17:42:29] my suspicion [17:42:36] ayeyyy [17:42:40] and we have only one version deployed [17:42:43] refinery is 4G! [17:42:44] RECOVERY - Disk space on analytics1027 is OK: DISK OK [17:43:14] so each deployment will be a +4? [17:43:18] or a +2 maybe [17:43:41] maybe +2, for artifacts [17:43:41] not sure [17:43:45] sounds like the git-fat file is 2G? [17:44:17] 1.9G artifacts/org/wikimedia/analytics/refinery [17:44:20] there are lots of files [17:44:25] so initially it'll be 4G, every additional will be a +2G thereafter. scap does clean up after itself after 5 deploys (can't remember if we made that configurable or not) [17:44:27] elukey: i think we can reduce this by removing the old versions [17:44:36] or at least some [17:44:53] we keep lots of old versions of refinery jars around [17:44:58] so that when we sync to hdfs they still exist there [17:45:09] and running jobs that reference those versions still have the jars. [17:45:10] hm. [17:45:19] but, when we deploy to hdfs, we also deploy in a versioned directory [17:45:27] and the jobs link directly to that versioned directory. [17:45:38] so, i think it would be safe to remove old versions of jars in refinery artifacts [17:45:40] hmm [17:46:08] maybe it could break, if an job specificies refinery_version: whatever.old.version, and then we deploy to hdfs without whatever.old.version in artifacts [17:46:24] and then we restart that job out of the latest deployed refinery in hdfs [17:46:53] elukey: i'm going to go ahead and make a patch that removes some old jars [17:46:54] but not all [17:46:56] just really old ones [17:47:14] RECOVERY - Elasticsearch HTTPS on relforge1002 is OK: SSL OK - Certificate relforge1002.eqiad.wmnet valid until 2021-08-03 17:43:52 +0000 (expires in 1824 days) [17:47:22] RECOVERY - Elasticsearch HTTPS on relforge1001 is OK: SSL OK - Certificate relforge1001.eqiad.wmnet valid until 2021-08-03 17:42:59 +0000 (expires in 1824 days) [17:47:40] ottomata: I am a bit ignorant on the refinery's internals :( [17:48:19] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint: Elasticsearch SSL on relforge is generating monitoring alerts - https://phabricator.wikimedia.org/T141234#2523193 (10Gehel) 05Open>03Resolved There was an issue with SSL certificates which I had not regenerated to include alt_names.... [17:50:33] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:50:53] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:53:41] @next [17:54:46] I think curl Deployments will be faster than for me to remember the commands [18:00:01] jouncebot: next [18:00:01] In 0 hour(s) and 59 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160804T1900) [18:00:44] good, I will slip in https://gerrit.wikimedia.org/r/302952 [18:01:44] am I missing a bot? [18:02:02] hey grrrit-wm [18:02:06] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2523225 (10HJiang-WMF) No, I meant that even with my ssh config specified that I could log in just at my local machine with ssh stat1... [18:03:31] I'm going to restart it [18:03:46] legoktm: I got it I think [18:03:51] oh, thanks :) [18:04:20] 18:03:52 sync-file failed: ('%d test canaries had check failures', 11) [18:04:33] legoktm: hmmm... I restarted the pos by the bot didn't part and rejoin...? [18:04:35] *pod [18:04:53] this integration with logstash, is it reliable? [18:04:56] might take a minute? [18:05:34] my scap failed spectacularly [18:06:33] legoktm: I restarted the wrong pod :) [18:06:48] bd808, should I just retry? is there a new sync method I do not know about? [18:07:10] I just want this very small config change done [18:07:12] jynus: hmmm... I think thcipriani rolled out a new version of scap recently [18:07:36] (03PS3) 10Ottomata: Remove pagecounts-[raw|all-sites] related code [puppet] - 10https://gerrit.wikimedia.org/r/302932 (https://phabricator.wikimedia.org/T130656) (owner: 10Joal) [18:07:55] jynus: use: scap sync-file --force to override that check [18:08:19] ok, I follow your directions [18:08:20] but it's weird that it's failing. It means there was a 10x increase in errors on the canary machines after you deployed that file. [18:08:26] no [18:08:29] it is not failing [18:08:31] the code is [18:08:42] the check gives an error [18:08:45] oh, sorry, misunderstood. [18:08:50] not that the check fails to pass [18:09:02] you probably can see it on the log [18:09:09] gaierror(-2, 'Name or service not known')': /logstash-*/_search [18:09:12] lots of that [18:09:22] Generic connection error: HTTPConnectionPool(host='logstash1001.eqiad.wmnet;9200', port=80) [18:09:32] to be fair [18:09:37] probably the connection is [18:09:52] but I immediatly checked kibana just in case [18:09:59] so some firewall issue ? [18:10:12] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2523253 (10HJiang-WMF) Clarification: stat1003 log in is so slow that I'm not certain if it would actually be logged in [18:10:18] uhhh, ;9200 port=80 is probably the problem [18:10:27] oh [18:10:33] good catch [18:10:49] you know, ops always thing it is an infra problem :-) [18:11:09] I will run it with force to clear the queue [18:11:19] then you can debug at your own pace [18:11:41] jynus: ack. I'll have a patch momentarily. Sorry about that. [18:11:48] sorry for breaking things! [18:11:51] out of time [18:12:01] :) [18:12:09] that looks like a cool feature, though [18:12:52] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1074 as backup rc node in case db1036 lags (duration: 00m 25s) [18:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:30] I will keep around for some time, checking a 1% traffic change requires time [18:15:11] thcipriani: Patch incoming to scap to fix logstash. [18:15:47] BTW, are any of you power users of kibana? [18:17:26] is there a way of sending something to SAL programmatically, outside of IRC? [18:17:44] (03PS1) 10Thcipriani: Scap: Override default logstash host [puppet] - 10https://gerrit.wikimedia.org/r/302957 [18:18:11] ^ that should fix scap in production [18:18:16] urandom: /usr/local/bin/dologmsg [18:19:03] greg-g: where is that? [18:19:15] /usr/local/bin on what host(s)? [18:19:36] urandom: it's on tin at least [18:20:37] thcipriani, greg-g: cool, thanks [18:20:49] and terbium [18:20:58] tin/mira, and terbium/wasat, at least [18:21:08] deploy hosts and work hosts in both dcs [18:21:53] yeah, it's a crazy simple shell script tho [18:22:02] 3-liner, wraps netcat [18:22:09] Could also be a firewall issue for logstash, not being able to talk on alternative port. [18:22:34] urandom: yeah, the only gotcha is it needs ferm rules to work on new hosts [18:23:08] greg-g: makes sense [18:23:08] (03CR) 10Chad: [C: 031] Scap: Override default logstash host [puppet] - 10https://gerrit.wikimedia.org/r/302957 (owner: 10Thcipriani) [18:23:40] ostriches: logstash checker works from tin, currently. Just my fat-fingered ";" seemingly breaking stuff. logstash_checker.py --logstash-host logstash1001.eqiad.wmnet:9200 --host mw1017.eqiad.wmnet [18:23:43] urandom: see eg https://phabricator.wikimedia.org/T141619 [18:24:25] thcipriani: Ok, so just the port fix and not the firewall too? [18:24:49] ostriches: that's correct, firewall seems good [18:24:55] k [18:26:16] Also did https://phabricator.wikimedia.org/D303 [18:26:49] Although maybe best for default would be localhost:9200 [18:27:03] accepted, thanks :) [18:30:24] could I get an opsen to +2 https://gerrit.wikimedia.org/r/#/c/302957/ to unbreak my scap screwup? [18:30:41] (03CR) 10Alexandros Kosiaris: [C: 032] Scap: Override default logstash host [puppet] - 10https://gerrit.wikimedia.org/r/302957 (owner: 10Thcipriani) [18:31:02] alex beat me to it [18:31:15] akosiaris: jynus thank you! [18:31:45] no, thanks to all of you for fixing it! [18:32:15] I only the part of complaining :-P [18:32:18] *did [18:32:26] testers are needed :P [18:33:02] to be fair, it happens many times because I am usually one of the earliest people in the morning to test deployments [18:33:15] although normally is an issue with the application server hosts [18:33:19] :-) [18:33:34] subbu: wtp1008-wtp1012 are now jessie. should hopefully be done by tomorrow [18:33:39] with how many, 300? there is always one with issues [18:35:35] jynus: :/ yep, machines are ficle [18:35:36] fickle [18:49:03] !log thcipriani@tin Synchronized README: test logstash host (duration: 00m 51s) [18:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:19] ^ logstash patch test. Works. [19:00:05] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160804T1900). Please do the needful. [19:00:58] (03PS2) 10Chad: gerrit (2.12.2-wmf.2) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/302498 [19:01:10] * thcipriani sets train wheels in motion [19:03:04] (03CR) 10Paladox: [C: 031] gerrit (2.12.2-wmf.2) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/302498 (owner: 10Chad) [19:03:57] (03PS1) 10Thcipriani: all wikis to 1.28.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302968 [19:04:29] (03PS6) 10Paladox: Add gbp.conf file for debian [debs/gerrit] - 10https://gerrit.wikimedia.org/r/301841 [19:05:12] (03PS8) 10Paladox: Testing [debs/gerrit] - 10https://gerrit.wikimedia.org/r/302371 [19:05:44] (03CR) 10Thcipriani: [C: 032] all wikis to 1.28.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302968 (owner: 10Thcipriani) [19:06:16] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302968 (owner: 10Thcipriani) [19:06:39] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.13 [19:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:21:34] (03PS1) 10Alexandros Kosiaris: graphite: Fix @hostname in local-relay.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/302972 [19:23:09] (03CR) 10Alexandros Kosiaris: [C: 032] graphite: Fix @hostname in local-relay.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/302972 (owner: 10Alexandros Kosiaris) [19:23:12] 06Operations, 06Editing-Analysis: Connection time out to stat1003 - https://phabricator.wikimedia.org/T142126#2523589 (10HJiang-WMF) [19:23:16] (03PS2) 10Alexandros Kosiaris: graphite: Fix @hostname in local-relay.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/302972 [19:23:19] (03CR) 10Alexandros Kosiaris: [V: 032] graphite: Fix @hostname in local-relay.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/302972 (owner: 10Alexandros Kosiaris) [19:23:53] (03PS1) 10MarcoAurelio: User rights configuration changes for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302974 (https://phabricator.wikimedia.org/T142123) [19:24:07] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2523606 (10HJiang-WMF) 05Open>03Resolved The "password" issue and connection to bastion hosts are resolved. The time out issue of... [19:25:11] why 'cannot merge' when my local repo is "git pull origin master --> already up to date" ?? [19:25:13] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:27:36] (03PS2) 10MarcoAurelio: User rights configuration changes for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302974 (https://phabricator.wikimedia.org/T142123) [19:29:30] * apergos misreads "Human rights configuration changes..." must be time for brain food again [19:29:32] 06Operations, 06Commons, 06Multimedia: Deploy a PHP and HHVM patch (Exif values retrieved incorrectly if they appear before IFD) - https://phabricator.wikimedia.org/T140419#2523622 (10matmarex) p:05Normal>03Triage [19:29:34] stupid brain [19:31:19] (03PS3) 10MarcoAurelio: User rights configuration changes for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302974 (https://phabricator.wikimedia.org/T142123) [19:32:12] apergos: rotfl :) [19:34:04] (03PS1) 10Alexandros Kosiaris: cassandra: trickle_fsync should be boolean [puppet] - 10https://gerrit.wikimedia.org/r/302978 [19:37:28] !log T140825,T140869: Restarting Cassandra, restbase1010-a.eqiad.wmnet [19:37:30] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [19:37:30] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [19:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:35] (03PS13) 10Chad: Gerrit: Make SSL optional for proxy setup [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [19:37:37] (03PS1) 10Chad: Gerrit: Simplify rewrites to avoid mentioning the host unless needed [puppet] - 10https://gerrit.wikimedia.org/r/302980 [19:39:20] thcipriani: is the train done? :) [19:39:22] (03PS2) 10Alexandros Kosiaris: rhodium: Ramp up to the same load as strontium [puppet] - 10https://gerrit.wikimedia.org/r/302923 [19:39:27] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] rhodium: Ramp up to the same load as strontium [puppet] - 10https://gerrit.wikimedia.org/r/302923 (owner: 10Alexandros Kosiaris) [19:39:29] Reedy: yep [19:39:38] !log deploying eventlogging-service-eventbus to codfw hosts, depooling and pooling [19:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:51] (03CR) 10jenkins-bot: [V: 04-1] Gerrit: Make SSL optional for proxy setup [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [19:40:07] !log T140825,T140869: Restarting Cassandra, restbase1010-b.eqiad.wmnet [19:40:08] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [19:40:08] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [19:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:40:56] (03CR) 10jenkins-bot: [V: 04-1] Gerrit: Simplify rewrites to avoid mentioning the host unless needed [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [19:40:58] !log otto@palladium conftool action : set/pooled=no; selector: kafka2001.codfw.wmnet [19:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:22] !log T140825,T140869: Restarting Cassandra, restbase1010-c.eqiad.wmnet [19:42:24] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [19:42:24] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [19:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:31] hmm. Can I be bothered trying to setup proxycommand on putty [19:43:56] !log otto@palladium conftool action : set/pooled=yes; selector: kafka2001.codfw.wmnet [19:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:07] (03PS2) 10Paladox: Gerrit: Simplify rewrites to avoid mentioning the host unless needed [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [19:44:37] (03CR) 10Paladox: "We need to have the host name otherwise I think it fails but not sure." [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [19:44:56] Reedy you can do it through git [19:45:14] Reedy or just download windows 10 and install bash lol [19:45:23] paladox: Please don't rebase my patches. [19:45:29] Oh sorry [19:45:44] !log otto@palladium conftool action : set/pooled=no; selector: kafka2002.codfw.wmnet [19:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:09] !log restart puppetmaster on palladium to activate loadfactor change. puppet related icinga spam will ensure [19:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:26] !log restart puppetmaster on palladium to activate loadfactor change. puppet related icinga spam will ensue* [19:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:47:03] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures [19:47:24] PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: Puppet has 1 failures [19:47:50] !log otto@palladium conftool action : set/pooled=yes; selector: kafka2002.codfw.wmnet [19:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:33] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: puppet fail [19:49:43] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: puppet fail [19:49:50] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "looks good in https://puppet-compiler.wmflabs.org/3612/restbase1007.eqiad.wmnet/, merging" [puppet] - 10https://gerrit.wikimedia.org/r/302978 (owner: 10Alexandros Kosiaris) [19:49:55] (03PS2) 10Alexandros Kosiaris: cassandra: trickle_fsync should be boolean [puppet] - 10https://gerrit.wikimedia.org/r/302978 [19:50:01] (03CR) 10Alexandros Kosiaris: [V: 032] cassandra: trickle_fsync should be boolean [puppet] - 10https://gerrit.wikimedia.org/r/302978 (owner: 10Alexandros Kosiaris) [19:50:03] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: puppet fail [19:50:22] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: puppet fail [19:50:32] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Puppet has 8 failures [19:50:33] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 15 failures [19:50:33] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: puppet fail [19:50:33] PROBLEM - puppet last run on mw2096 is CRITICAL: CRITICAL: puppet fail [19:50:42] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [19:50:43] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 4 failures [19:50:44] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: puppet fail [19:50:44] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet has 11 failures [19:50:54] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 13 failures [19:51:03] PROBLEM - puppet last run on mw2110 is CRITICAL: CRITICAL: puppet fail [19:51:03] PROBLEM - puppet last run on mw2182 is CRITICAL: CRITICAL: puppet fail [19:51:03] PROBLEM - puppet last run on wtp2016 is CRITICAL: CRITICAL: puppet fail [19:51:12] PROBLEM - puppet last run on mw2176 is CRITICAL: CRITICAL: Puppet has 8 failures [19:51:13] PROBLEM - puppet last run on mw2087 is CRITICAL: CRITICAL: Puppet has 7 failures [19:51:22] PROBLEM - puppet last run on mw2083 is CRITICAL: CRITICAL: Puppet has 1 failures [19:51:23] PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Puppet has 3 failures [19:51:23] PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Puppet has 15 failures [19:51:24] PROBLEM - puppet last run on mw2233 is CRITICAL: CRITICAL: puppet fail [19:51:34] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 11 failures [19:51:54] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Puppet has 15 failures [19:52:02] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 6 failures [19:52:02] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 12 failures [19:52:12] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Puppet has 3 failures [19:52:32] PROBLEM - puppet last run on mw2079 is CRITICAL: CRITICAL: Puppet has 9 failures [19:52:33] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: Puppet has 6 failures [19:53:03] PROBLEM - puppet last run on mw2196 is CRITICAL: CRITICAL: Puppet has 1 failures [19:53:24] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: Puppet has 1 failures [19:53:33] PROBLEM - puppet last run on mw2067 is CRITICAL: CRITICAL: Puppet has 10 failures [19:55:38] yikes [19:56:00] (03PS2) 10Reedy: 11 more to extension.json for wmf/1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302942 (https://phabricator.wikimedia.org/T139800) [19:57:01] akosiaris: Did you break puppet? :P [19:57:45] (03CR) 10Reedy: [C: 032] 11 more to extension.json for wmf/1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302942 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [19:58:14] (03Merged) 10jenkins-bot: 11 more to extension.json for wmf/1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302942 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [19:58:26] !log testing some eventbus kafka failure scenarios in codfw with test.event. (short icinga downtime has been scheduled) [19:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:59:23] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:59:33] !log reedy@tin Synchronized wmf-config/extension-list: More to extension.json in extension-list (duration: 00m 52s) [19:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:59:46] $techDebt--; [20:01:15] Reedy: it's always broken, needs no help from me :P [20:03:26] (03CR) 10Legoktm: [C: 031] Remove temporary wgCentralAuthEnableUserMerge override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301746 (owner: 10MaxSem) [20:03:38] !log T140825,T140869: Performing Cassandra instance rolling restart of restbase1011.eqiad.wmnet [20:03:40] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [20:03:40] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [20:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:43] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: puppet fail [20:04:54] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: puppet fail [20:04:54] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: puppet fail [20:05:24] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 2 failures [20:05:27] (03PS1) 10MarcoAurelio: Rename 'autoreview' to 'autopatrolled' on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302987 [20:05:43] 06Operations, 06Release-Engineering-Team, 10Traffic, 07HTTPS: Retire gerrit.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T142131#2523742 (10demon) [20:06:04] PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Puppet has 1 failures [20:06:04] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures [20:06:23] PROBLEM - puppet last run on mc2013 is CRITICAL: CRITICAL: Puppet has 1 failures [20:06:23] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Puppet has 1 failures [20:06:33] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures [20:06:53] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: Puppet has 1 failures [20:06:55] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:03] PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:03] PROBLEM - puppet last run on es2015 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:03] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:13] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:22] PROBLEM - puppet last run on db2033 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:23] PROBLEM - puppet last run on mw1184 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:23] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:32] so many scary upper-cased adjectives [20:07:33] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:33] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:34] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:43] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:43] PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:43] PROBLEM - puppet last run on mw2132 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:44] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:52] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:53] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Puppet has 1 failures [20:08:09] urandom: icinga went to LAW SCHOOL ;-) [20:08:12] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Puppet has 1 failures [20:08:13] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Puppet has 1 failures [20:08:14] (03PS2) 10MarcoAurelio: Rename 'autoreview' to 'autopatrolled' on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302987 [20:08:15] heh [20:08:23] PROBLEM - puppet last run on mw2072 is CRITICAL: CRITICAL: Puppet has 1 failures [20:08:42] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Puppet has 1 failures [20:08:42] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Puppet has 1 failures [20:09:02] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Puppet has 1 failures [20:09:03] (03PS3) 10MarcoAurelio: Rename 'autoreview' to 'autopatrolled' on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302987 [20:09:13] PROBLEM - puppet last run on mw1202 is CRITICAL: CRITICAL: Puppet has 1 failures [20:09:22] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Puppet has 1 failures [20:09:23] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures [20:09:55] PROBLEM!! [20:10:21] (03CR) 10Krinkle: [C: 031] mwgrep: fails gracefully when an invalid regex is provided (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/302892 (https://phabricator.wikimedia.org/T141996) (owner: 10DCausse) [20:10:44] SPF|Cloud: don't be so CRITICAL [20:10:55] !log T140825,T140869: Cassandra instance restarts complete: restbase1011.eqiad.wmnet [20:10:56] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [20:10:57] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [20:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:11:08] Well here I need a quick RECOVERY [20:12:24] Heh, JOKING should be an alias for RECOVERY [20:12:27] eyeroll [20:12:33] sounds like you got a PROBLEM to me [20:13:01] if i knew this was going to give people such a NICE time I would be doing it more OFTEN [20:13:20] !log T140825,T140869: Performing rolling restart of codfw Cassandra instances [20:13:21] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [20:13:21] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [20:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:13:40] decomissioning the puppetmasters should also be fine ;) [20:13:53] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:06] SPF|Cloud: close enough. [20:14:13] RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:15:53] RECOVERY - puppet last run on mw2087 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:16:03] RECOVERY - puppet last run on mw2083 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:16:03] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:16:03] RECOVERY - puppet last run on mw1298 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:16:23] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:16:33] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:16:43] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:16:43] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:16:44] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:16:52] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:17:03] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:13] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:13] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:23] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:17:23] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:34] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:43] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [20:17:43] RECOVERY - puppet last run on mw2196 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:13] RECOVERY - puppet last run on mw2233 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [20:18:23] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:19:13] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:19:32] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:41] (03Draft2) 10Paladox: Test for now [puppet] - 10https://gerrit.wikimedia.org/r/302982 [20:20:12] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:29] (03Draft1) 10Paladox: Test for now [puppet] - 10https://gerrit.wikimedia.org/r/302982 [20:20:44] (03CR) 10jenkins-bot: [V: 04-1] Test for now [puppet] - 10https://gerrit.wikimedia.org/r/302982 (owner: 10Paladox) [20:21:13] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:13] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/302982 (owner: 10Paladox) [20:21:19] (03PS3) 10Paladox: Test for now [puppet] - 10https://gerrit.wikimedia.org/r/302982 [20:21:23] RECOVERY - puppet last run on mw2096 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:33] RECOVERY - puppet last run on wtp2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:52] RECOVERY - puppet last run on wtp2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:53] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:22:22] RECOVERY - puppet last run on mw2067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:22:23] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [20:22:47] (03CR) 10jenkins-bot: [V: 04-1] Test for now [puppet] - 10https://gerrit.wikimedia.org/r/302982 (owner: 10Paladox) [20:23:13] RECOVERY - puppet last run on mw2079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:23:43] (03PS1) 10MarcoAurelio: Cleaning a bit frwiktionary's botadmin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302990 [20:23:52] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:23:52] RECOVERY - puppet last run on mw2182 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [20:24:18] (03CR) 10jenkins-bot: [V: 04-1] Cleaning a bit frwiktionary's botadmin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302990 (owner: 10MarcoAurelio) [20:24:20] (03CR) 10Krinkle: [C: 031] Rename 'autoreview' to 'autopatrolled' on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302987 (owner: 10MarcoAurelio) [20:24:43] 06Operations, 06Editing-Analysis: Connection time out to stat1003 - https://phabricator.wikimedia.org/T142126#2523589 (10Neil_P._Quinn_WMF) @HJiang-WMF, thank you for filing this! Could you paste the output you get when you run `ssh -v stat1003`? [20:24:44] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Puppet has 1 failures [20:25:41] not sure why it fails [20:25:54] @ gerrit 302990 [20:25:56] (03PS4) 10Paladox: Test for now [puppet] - 10https://gerrit.wikimedia.org/r/302982 [20:26:32] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:42] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:02] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:22] PROBLEM - puppet last run on mw2142 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:38] 06Operations, 10Traffic, 07HTTPS, 07Security-General: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298#1105632 (10GWicke) Related: https://tom.vg/papers/heist_blackhat2016.pdf [20:27:53] PROBLEM - puppet last run on db1010 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:03] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:09] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302990 (owner: 10MarcoAurelio) [20:28:42] PROBLEM - puppet last run on mw2130 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:54] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:03] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:03] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:02] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:09] mafk, 20:28:21 PHP Parse error: syntax error, unexpected ''rollback'' (T_CONSTANT_ENCAPSED_STRING), expecting ']' in wmf-config/InitialiseSettings.php on line 7965 [20:30:30] Krenair: but I see no ''rollback'' but 'rollback' [20:30:32] mafk, you missed the comma on the previous line [20:30:38] oh [20:31:01] ah, that's it [20:31:01] it was not expecting 'rollback', and it wraps that part in its own apostrophes [20:31:13] it was not expecting it because you forgot the comma before [20:31:30] fixing :D [20:31:39] you can find this error by clicking the operations-mw-config-php55lint link in jenkins-bot's comment [20:32:03] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:17] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:23] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:58] (03PS2) 10MarcoAurelio: Cleaning a bit frwiktionary's botadmin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302990 [20:33:55] Yep, but I was didn't noticed that it was the line before. Thank you. [20:34:25] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: puppet fail [20:35:20] (03PS1) 10Jdlrobson: Promote language switcher to top of page in Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302999 (https://phabricator.wikimedia.org/T138961) [20:35:47] (03CR) 10Danny B.: "> This does not tell me why we are doing it, nor does it tell me why" [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) (owner: 10Paladox) [20:36:17] what will sit forever is https://gerrit.wikimedia.org/r/#/c/272499/ [20:36:28] until that wikibase test is amended [20:39:19] (03PS1) 10Jdlrobson: Promote new language switcher to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303003 (https://phabricator.wikimedia.org/T129505) [20:40:13] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 379 bytes in 0.107 second response time [20:40:17] (03CR) 10Dereckson: [C: 031] Cleaning a bit frwiktionary's botadmin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302990 (owner: 10MarcoAurelio) [20:40:25] !log restart apache on palladium. Managed to get it into a deadlock. OFC puppet spam will flood the channel [20:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:40:32] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 353 bytes in 0.124 second response time [20:40:34] Reedy: now I broke puppet... [20:40:41] badly this time around [20:40:42] xD [20:40:51] akosiaris: \o/ [20:42:09] doo dee doo dee doo [20:42:14] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:42:30] (03PS4) 10Paladox: Strip out branch HEAD in git.wikimedia.org tree link [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) [20:42:36] (03PS5) 10Paladox: Strip out branch HEAD in git.wikimedia.org tree link [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) [20:42:59] akosiaris: so. much. red. [20:43:33] (03CR) 10Dereckson: "Could we find a more descriptive name for this option in the Mobile Frontend code?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302999 (https://phabricator.wikimedia.org/T138961) (owner: 10Jdlrobson) [20:43:42] PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: puppet fail [20:43:42] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [20:43:42] PROBLEM - puppet last run on elastic2023 is CRITICAL: CRITICAL: puppet fail [20:43:42] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail [20:43:42] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: puppet fail [20:43:42] PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: puppet fail [20:43:42] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: puppet fail [20:43:43] PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: puppet fail [20:43:43] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: puppet fail [20:43:44] PROBLEM - puppet last run on es1018 is CRITICAL: CRITICAL: puppet fail [20:43:44] PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: Puppet has 79 failures [20:43:45] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: puppet fail [20:43:45] PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: puppet fail [20:43:46] PROBLEM - puppet last run on cp1074 is CRITICAL: CRITICAL: puppet fail [20:44:38] (03CR) 10Dereckson: [C: 031] Rename 'autoreview' to 'autopatrolled' on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302987 (owner: 10MarcoAurelio) [20:46:44] 06Operations, 06Release-Engineering-Team, 10Traffic, 07HTTPS: Retire gerrit.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T142131#2524001 (10RobH) I used to habitually revoke old certificates, but I was advised against doing so indescriminately by @bblack. My understanding is unless the priva... [20:49:08] ostriches: around? [20:49:10] (03PS6) 10Paladox: Strip out branch HEAD in git.wikimedia.org tree link [puppet] - 10https://gerrit.wikimedia.org/r/302747 (https://phabricator.wikimedia.org/T141965) [20:51:03] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations: Analytics cluster access request for ISI Foundation team - https://phabricator.wikimedia.org/T141634#2524009 (10Nuria) Approved. I assume the access will revoked when project is completed cc @Dartar? How do we follow up... [20:58:42] (03CR) 10Bmansurov: [C: 031] Promote new language switcher to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303003 (https://phabricator.wikimedia.org/T129505) (owner: 10Jdlrobson) [21:00:04] Dereckson: Dear anthropoid, the time has come. Please deploy Add a wiki: tcy.wikipedia (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160804T2100). [21:00:27] oh, are we creating tcy? [21:00:37] Yep [21:01:00] There is a 302 redirect from tcy.wikipedia.org to https://incubator.wikimedia.org/wiki/Wp/tcy?goto=mainpage. [21:01:14] delete delete [21:01:19] Is that handled by the default parking page? I don't see anything in puppet repo. [21:01:43] RECOVERY - puppet last run on mw2072 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:01:43] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:01:44] RECOVERY - puppet last run on wtp1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:01:44] RECOVERY - puppet last run on es2017 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [21:01:45] RECOVERY - puppet last run on db2051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:01:45] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:01:46] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:01:46] RECOVERY - puppet last run on ganeti2006 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [21:01:48] Dereckson, might need a cache purge [21:01:52] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [21:01:52] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [21:01:52] RECOVERY - puppet last run on kafka1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:01:52] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:01:52] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [21:01:53] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [21:01:53] RECOVERY - puppet last run on analytics1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:01:53] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [21:01:54] RECOVERY - puppet last run on restbase2004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [21:01:54] RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [21:02:02] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:02] RECOVERY - puppet last run on auth2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [21:02:02] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:02:02] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:02:02] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [21:02:03] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [21:02:03] RECOVERY - puppet last run on wmf4724 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:03] RECOVERY - puppet last run on mw2214 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [21:02:04] RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [21:02:04] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [21:02:05] RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:12] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:02:12] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:13] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:13] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [21:02:13] RECOVERY - puppet last run on db1030 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [21:02:22] RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:22] RECOVERY - puppet last run on mc1013 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [21:02:23] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [21:02:23] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:23] RECOVERY - puppet last run on db2066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:32] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [21:02:32] RECOVERY - puppet last run on es2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:32] RECOVERY - puppet last run on mw2210 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:02:33] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:02:33] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:33] RECOVERY - puppet last run on ms-be2024 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [21:02:33] RECOVERY - puppet last run on rdb2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:33] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:34] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:34] RECOVERY - puppet last run on elastic2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:35] RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:42] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [21:02:43] hmm. or not [21:02:43] RECOVERY - puppet last run on restbase2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:02:43] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:43] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [21:02:44] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [21:02:52] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:02:52] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:52] RECOVERY - puppet last run on rdb1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:52] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:53] RECOVERY - puppet last run on wtp1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:54] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [21:02:57] okay this is rather disruptive, can someone +q that? [21:03:02] RECOVERY - puppet last run on mw2155 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [21:03:03] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:03:04] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:12] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [21:03:12] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [21:03:13] RECOVERY - puppet last run on ms-be1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:13] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:13] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:13] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [21:03:13] RECOVERY - puppet last run on wtp1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:22] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:22] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [21:03:22] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:22] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:22] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [21:03:23] RECOVERY - puppet last run on lvs1010 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [21:03:23] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:24] RECOVERY - puppet last run on db1057 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:03:24] RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:03:24] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:32] robh: +q icinga-wm please [21:03:32] RECOVERY - puppet last run on mw2169 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [21:03:33] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:33] RECOVERY - puppet last run on mw2175 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [21:03:56] thanks [21:04:15] oh, I always end up doing the host name (tha'ts what "quiet #wikimedia-operations $nick" does) [21:04:17] wasnt me. [21:04:21] 'twas me [21:04:39] make sure to turn it back to normal in 15m or so =] [21:04:52] I'll be in a meeting, ping me if I forget :) [21:05:00] or ill just do it [21:05:01] heh [21:05:02] /mode #wikimedia-operations +q icinga-wm!*@* ?? [21:05:05] Dereckson, okay so [21:05:12] this is going through mediawiki-config's missing.php [21:05:19] Dereckson maybe we follow how they did it with jam [21:05:25] /msg ChanServ unquiet #wikimedia-operations icinga-wm [21:05:28] ^^ [21:05:48] Which happens if the DB name is not in wgLocalDatabases [21:05:54] missing has indeed an incubator redirect [21:05:54] https://phabricator.wikimedia.org/T134017 [21:06:36] Dereckson, did you not add it to all.dblist? [21:06:40] that's what wgLocalDatabases is built on [21:07:06] I Krenair that should be fixed by https://gerrit.wikimedia.org/r/#/c/300182/ [21:07:41] haha yeah [21:07:45] not gonna work without that one Dereckson [21:08:20] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 05MW-1.28-release-notes, and 3 others: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2524123 (10Dereckson) @MF-Warburg Could you ask them what collation they want for the categories? [21:08:21] Remove the I [21:09:02] Krenair: I'm still reviewing if all is fine, haven't merged the settings patch yet [21:09:22] okay [21:09:38] (03CR) 10Paladox: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [21:09:48] Dereckson Krenair i need to check if it needs rebasing [21:09:50] (03CR) 10jenkins-bot: [V: 04-1] Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [21:09:52] since wasent mw updated [21:09:57] Yep [21:09:59] i was right [21:10:09] Im rebasing now [21:10:45] paladox: we're at 1.28.0-wmf.13 [21:10:52] Ok [21:10:57] I need to update to that [21:10:58] thanks [21:11:27] gerrit edit interface <3 [21:12:11] yeah that's why I suggested not doing the wikiversions change until the day [21:12:16] Ok that is strange [21:12:16] but then of course I forgot to do it on the day [21:12:22] it is saying it is all update to date [21:12:24] when rebasing [21:12:45] so... ¯\_(ツ)_/¯ [21:12:47] Oh thats why [21:12:54] $ git pull [21:12:54] git reset --hard origin/master [21:12:54] ssh: connect to host gerrit.wikimedia.org port 294 [21:13:02] 29418 [21:13:23] Sorry it didnt paste all of it [21:13:24] $ git pull [21:13:24] git reset --hard origin/master [21:13:24] ssh: connect to host gerrit.wikimedia.org port 29418: Connection timed out [21:13:24] fatal: Could not read from remote repository. [21:13:37] I am going to delete the repo and redownload it [21:13:44] strange though, it was working before [21:14:07] * mafk tests [21:14:15] mafk you carnt use the inline edit to rebase patches [21:14:19] !log T140825,T140869: Rolling restart of codfw Cassandra instances complete [21:14:20] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [21:14:20] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [21:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:14:36] $ git reset --hard origin/master [21:14:38] HEAD is now at 91ecc45 11 more to extension.json for wmf/1.28.0-wmf.12 [21:14:52] paladox: yep, I know, only simple rebases and through the button [21:15:02] Yep [21:15:12] Seems it is stuck for me git cloning it [21:15:17] working now [21:15:24] :D [21:17:26] (03PS1) 10Reedy: Load WikimediaMessages via wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303014 (https://phabricator.wikimedia.org/T140852) [21:19:18] All rebased now [21:19:18] (03PS12) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) [21:19:24] Thanks. [21:19:35] Your welcome [21:20:27] (03PS13) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) [21:22:56] (03PS3) 10Andrew Bogott: WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 [21:22:58] (03PS1) 10Andrew Bogott: Changes to spice proxy to work behind misc-web: [puppet] - 10https://gerrit.wikimedia.org/r/303018 [21:24:22] (03CR) 10Dereckson: [C: 032] Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [21:24:36] (03CR) 10jenkins-bot: [V: 04-1] WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 (owner: 10Andrew Bogott) [21:24:51] (03Merged) 10jenkins-bot: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [21:25:28] (03CR) 10jenkins-bot: [V: 04-1] Changes to spice proxy to work behind misc-web: [puppet] - 10https://gerrit.wikimedia.org/r/303018 (owner: 10Andrew Bogott) [21:26:10] !log T140825,T140869: Rolling restart of Cassandra instances, eqiad Rack `b' [21:26:11] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [21:26:12] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [21:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:30] Config live on mw1099, terbium and tin. [21:26:30] Dereckson does this https://gerrit.wikimedia.org/r/#/c/300214/ also need merging? [21:26:47] (03PS2) 10Reedy: Load WikimediaMessages via wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303014 (https://phabricator.wikimedia.org/T140852) [21:27:10] (03PS2) 10Andrew Bogott: Changes to spice proxy to work behind misc-web: [puppet] - 10https://gerrit.wikimedia.org/r/303018 [21:27:12] (03PS4) 10Andrew Bogott: WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 [21:27:26] paladox: later [21:27:30] ok [21:27:50] as said mobro.vac "As soon as tcy.wikipedia.org is up and kicking" [21:28:30] (03CR) 10jenkins-bot: [V: 04-1] Changes to spice proxy to work behind misc-web: [puppet] - 10https://gerrit.wikimedia.org/r/303018 (owner: 10Andrew Bogott) [21:28:46] Oh [21:28:50] ok [21:29:18] Does something else need to happen http://tcy.wikipedia.org/ for that too work? [21:29:19] (03CR) 10jenkins-bot: [V: 04-1] WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 (owner: 10Andrew Bogott) [21:29:30] without it redirecting to incubator [21:29:39] now I'm creating the db [21:29:45] Oh :) [21:29:59] (03PS3) 10Andrew Bogott: Changes to spice proxy to work behind misc-web: [puppet] - 10https://gerrit.wikimedia.org/r/303018 [21:30:01] (03PS5) 10Andrew Bogott: WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 [21:31:10] (03CR) 10jenkins-bot: [V: 04-1] Changes to spice proxy to work behind misc-web: [puppet] - 10https://gerrit.wikimedia.org/r/303018 (owner: 10Andrew Bogott) [21:31:44] (03CR) 10jenkins-bot: [V: 04-1] WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 (owner: 10Andrew Bogott) [21:34:23] https://phabricator.wikimedia.org/P3645 [21:34:25] Catchable fatal error: Argument 1 passed to CirrusSearch\Connection::getPool() must be an instance of CirrusSearch\SearchConfig, null given, called in /srv/mediawiki/php-1.28.0-wmf.13/extensions/CirrusSearch/includes/Maintenance/Maintenance.php on line 88 and defined in /srv/mediawiki/php-1.28.0-wmf.13/extensions/CirrusSearch/includes/Connection.php on line 90 [21:34:43] (03PS4) 10Andrew Bogott: Changes to spice proxy to work behind misc-web: [puppet] - 10https://gerrit.wikimedia.org/r/303018 [21:34:45] (03PS6) 10Andrew Bogott: WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 [21:34:55] Probably unloved [21:36:18] (03CR) 10jenkins-bot: [V: 04-1] WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 (owner: 10Andrew Bogott) [21:36:50] (03CR) 10Andrew Bogott: [C: 032] Changes to spice proxy to work behind misc-web: [puppet] - 10https://gerrit.wikimedia.org/r/303018 (owner: 10Andrew Bogott) [21:37:37] strange [21:37:59] $ sql tcywiki -h db2036 [21:37:59] Error looking up DB "tcywiki" [21:38:06] Oh [21:38:07] but script says it created tables [21:38:12] Dereckson, yeah [21:38:17] look at what the sql command does :) [21:38:27] you'll have to `sql aawiki -h db2036` and `use tcywiki;` [21:39:00] okay so we've 76 tables [21:39:05] :) [21:39:11] how many does jamwiki have? [21:39:16] 76 [21:39:23] okay that sounds promising [21:39:25] Oh, that is alot [21:39:33] did the whole script break at cirrus? [21:40:14] according https://phabricator.wikimedia.org/P3645, yes [21:41:01] So that means... we'll missed cirrus, wikibase site population, MassMessage cache clearing, and ugh... The damn newprojects mail [21:41:02] Is that a bug or known problem with cirrus? [21:41:09] https://phabricator.wikimedia.org/diffusion/EWMA/browse/master/addWiki.php;c247d5012c8a7c5c93a6433a6a2f2c8b11924f7a$191 [21:41:34] well, in the worst case you can delete and start again, no content will be lost [21:41:40] paladox, there is nothing more infuriating when running addWiki.php than some code screwing up and causing the whole damn thing to break half way through [21:41:51] Yep [21:41:58] Is there a bug filled? [21:42:07] bug = task [21:42:52] (03PS7) 10Andrew Bogott: WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 [21:42:54] (03PS1) 10Andrew Bogott: typo fix re: spice package name [puppet] - 10https://gerrit.wikimedia.org/r/303072 [21:44:16] What about create a addWiki-T140898.php script, with the remaining steps, merge it in maintenance and run it? [21:44:17] T140898: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898 [21:44:18] (03CR) 10jenkins-bot: [V: 04-1] WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 (owner: 10Andrew Bogott) [21:44:29] I wouldn't bother [21:44:32] Krenair, paladox you have my most sympathies for that [21:44:38] Broken at https://phabricator.wikimedia.org/diffusion/ECIR/browse/master/includes/Maintenance/Maintenance.php;7d73c704af314dd27b1d5006fa52c7fe4a14242d$88 [21:44:41] Oh [21:44:49] Every single time someone screwed this up, I just live hacked addWiki to comment the done parts [21:44:51] and the broken part [21:44:57] k [21:44:57] Oh lol [21:44:57] you are doing $deity's work [21:45:01] I don't think I've ever had that script run successfully [21:45:01] Ever. [21:45:07] Without having to live hack it [21:45:07] :-) [21:45:24] if it helps, the labs part is completely broken too [21:45:33] (03CR) 10Andrew Bogott: [C: 032] typo fix re: spice package name [puppet] - 10https://gerrit.wikimedia.org/r/303072 (owner: 10Andrew Bogott) [21:45:38] It's always something different [21:46:28] It seems that it is https://phabricator.wikimedia.org/diffusion/ECIR/browse/master/includes/Maintenance/Maintenance.php;7d73c704af314dd27b1d5006fa52c7fe4a14242d$88 [21:47:12] Okay I've dereckson@terbium:/srv/mediawiki/php-1.28.0-wmf.13/extensions/WikimediaMaintenance$ sudo -u mwdeploy chmod 664 addWiki.php [21:47:19] and I'm commenting done parts [21:47:27] Yeah, that's what I'd do [21:48:17] let me know when you're done and I'll check [21:48:28] !log T140825,T140869: Rolling restart of Cassandra instances, eqiad Rack `b', complete [21:48:30] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [21:48:30] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [21:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:29] Krenair: I'm done [21:49:59] (03PS8) 10Andrew Bogott: WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 [21:50:01] (03PS1) 10Andrew Bogott: I can't stop making typos [puppet] - 10https://gerrit.wikimedia.org/r/303076 [21:50:14] !log T140825,T140869: Rolling restart of Cassandra instances, eqiad Rack `d' [21:50:16] T140869: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869 [21:50:16] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [21:50:17] Dereckson, stop [21:50:17] problem [21:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:50:40] your commenting included $ucsite, but that's used later on [21:51:20] (I only salvaged the selectdb) [21:51:27] other than that LGTM [21:51:32] okay [21:52:39] I used to have to deal with multiple of these issues sometimes [21:52:49] Let's see how far populateSitesTable gets [21:52:56] Okay, addSite, try 2 [21:53:23] done. [21:53:25] (03CR) 10jenkins-bot: [V: 04-1] WIP: Aggregate instance root passwords on Labs puppet master [puppet] - 10https://gerrit.wikimedia.org/r/302834 (owner: 10Andrew Bogott) [21:53:27] Done. sync the config as in https://wikitech.wikimedia.org/wiki/Add_a_wiki#MediaWiki_configuration [21:54:00] Where could we check the mail? [21:54:50] On newprojects@lists.wikimedia.org [21:55:07] https://lists.wikimedia.org/pipermail/newprojects/2016-August/000106.html [21:55:11] okay mail worked [21:56:05] yep looks good [21:56:22] so the only thing missing is Cirrus? no complaints from populateSites etc.? [21:56:35] on mw1099, the config is already there, and I'm still redirected to Incubator [21:56:36] (03CR) 10Andrew Bogott: [C: 032] I can't stop making typos [puppet] - 10https://gerrit.wikimedia.org/r/303076 (owner: 10Andrew Bogott) [21:56:41] no complaint from populateSites [21:58:29] What is the next step now? [21:58:34] No error on mw1099 [21:58:35] Krenair: MassMessage cache clear is nbd, it expires every hour [21:59:02] Dereckson, investigating [21:59:19] we sync to prod or we solve Incubator redirect first? [21:59:35] incubator redirect is dblist related IIRC [21:59:51] Reedy: yes but dblists are fine on mw1099, terbium and tin [22:00:00] touch IS? [22:00:30] dereckson@mw1099:~$ sudo -u mwdeploy touch /srv/mediawiki/wmf-config/InitialiseSettings.php [22:00:44] Yes [22:00:47] It comes from wgLocalDatabases [22:00:51] which is based on wgConf->wikis [22:00:54] which comes from all.dblist [22:01:07] and indeed if you `mwrepl aawiki` and `=$wgLocalDatabases`, it's there [22:01:18] dereckson@mw1099:~$ grep tcywiki /srv/mediawiki/dblists/all.dblist [22:01:21] tcywiki [22:01:46] and yet: [22:01:48] krenair@mw1099:/srv/mediawiki$ curl -I -H "Host: tcy.wikipedia.org" http://mw1099/wiki/Main_Page?asd [22:01:48] HTTP/1.1 302 Found [22:02:19] Location: http://incubator.wikimedia.org/wiki/Wp/tcy/Main_Page [22:03:11] Usually, when do you sync to prod the change? [22:03:16] before or after test that? [22:03:37] Oh [22:03:45] Did you rebuild wikiversions? [22:03:49] nope [22:04:08] That'll explain it [22:04:09] krenair@mw1099:/srv/mediawiki$ grep tcywiki wikiversions.php [22:04:10] krenair@mw1099:/srv/mediawiki$ grep tcywiki wikiversions.json [22:04:10] "tcywiki": "php-1.28.0-wmf.13", [22:04:12] Run sync-wikiversions to synchronize the version number to use for this wiki [22:04:19] is after Merge the config change in Gerrit, and pull it onto tin [22:04:49] we can do that bit live too [22:04:51] Okay [22:04:59] just do it all live [22:05:00] much easier [22:05:01] :) [22:06:04] done [22:06:21] got it [22:06:50] 06Operations, 10LDAP-Access-Requests: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#2524307 (10JbuattiWMF) Hi @Dzahn , could we add access for the following two Wikitech accounts? skidd rgopal They need access to review the draft transparency report. Their... [22:06:53] Pff, someone else got UID 1 [22:07:03] !log mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=aawiki tcy wikipedia tcywiki [22:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:07:37] Reedy got merged before me [22:07:51] trolol [22:09:32] LOL i get https://incubator.wikimedia.org/wiki/Wp/tcy?goto=mainpage a blank page now [22:09:50] we're using X-Wikimedia-Debug with mw1099 [22:10:03] Oh [22:10:09] It works now [22:10:14] https://tcy.wikipedia.org/wiki/%E0%B2%AE%E0%B3%81%E0%B2%96%E0%B3%8D%E0%B2%AF_%E0%B2%AA%E0%B3%81%E0%B2%9F [22:10:21] Dereckson, I think we can roll it out without Cirrus, right? [22:10:22] yes I'm syncing [22:10:24] ok [22:10:45] You may want to set permission on the main page before some spamming bot decides to gives links away [22:11:10] paladox: ask this kind of stuff on #wikimedia-stewards perhaps [22:11:17] Oh [22:11:20] Yeah, nothing to worry about [22:11:28] Ok [22:11:34] When the task is updated, the imports and protections will happen pretty quickly [22:12:15] I'm running the cirrus part again [22:12:23] !log Synced dblists, wikiversions, langlist, wmf-config/InitialiseSettings.php for tcy.wikipedia [22:12:26] Where is the bot? [22:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:12:37] which bot? [22:12:40] morebots? [22:12:41] I am a logbot running on tools-exec-1204. [22:12:41] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [22:12:41] To log a message, type !log . [22:13:03] I made the 'running on x' bit kind of redundant, it's in the host now :/ [22:13:07] logmsgbot [22:14:27] logmsgbot, help [22:14:30] ohhhh [22:14:48] You remember when greg-g quieted neon? [22:14:53] Yeah that bot runs on neon.wikimedia.org [22:15:03] along with icinga-wm [22:15:04] oh [22:15:07] that explains that [22:15:08] I thought we could unquiet now [22:15:17] we probably could. greg-g? [22:15:20] since they said 15 minutes to wait before we can make it talk [22:15:48] But only ops can do it [22:16:02] robh could you unquiet neon.wikimedia.org please [22:16:04] ? [22:16:12] !log Synchronized static/images/project-logos: Logos for tcy.wikipedia.org (T140898) (duration: 00m 48s) [22:16:13] T140898: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898 [22:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:24] Thanks [22:16:28] welcome [22:16:32] 06Operations, 10LDAP-Access-Requests: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#2524369 (10JbuattiWMF) 05Resolved>03Open [22:16:33] :) [22:16:38] i should have set a timer, ehh [22:16:40] oh well [22:16:43] a new wikipedia? [22:16:43] Oh [22:16:46] Yep [22:16:48] tcy [22:16:51] https://tcy.wikipedia.org/wiki/%E0%B2%B5%E0%B2%BF%E0%B2%B8%E0%B3%87%E0%B2%B8%E0%B3%8A:Log [22:16:56] We start to have visitors [22:17:02] :) [22:17:19] :) [22:17:29] pfff [22:17:32] even the log thinks I should be UID 1 [22:18:15] :) [22:19:23] Dereckson, I wonder why Cirrus was unhappy [22:19:27] the script completed successfully for me [22:19:41] Probably many reasons [22:20:10] and well... search seems to work [22:21:01] okay logo works [22:21:01] running it standalone vs piggybacked from aawiki... [22:21:04] mobrovac: ping? [22:21:10] jynus, hey, so, it's ready [22:21:30] Reedy, yeah that might be it :( [22:21:51] Some config presumably needs bootstrapping [22:22:17] So [22:22:19] "As soon as tcy.wikipedia.org is up and kicking, yes. If that happens today, you can coordinate with Gabriel, Petr or Eric E. to restart RB." [22:22:25] | user_id | user_name | user_registration | [22:22:26] | 1 | Reedy | 20160804220627 | [22:22:26] | 2 | Krenair | 20160804220629 | [22:22:28] meh [22:22:34] 2 seconds ftw [22:23:15] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:23:50] Dereckson: Krenair notifications page is broken on tcywiki [22:24:00] Says 20 at the top [22:24:11] "You have no notifications." on the special page [22:24:22] shows them on other wikis [22:24:38] Something with them being before hte wiki was create, and not propogated? [22:24:41] I don't see any [22:24:41] RoanKattouw: ping? ^ [22:24:49] Reedy: we miss RestBase [22:25:00] Does Echo use Restbase? [22:25:01] There is https://gerrit.wikimedia.org/r/#/c/300214/ to merge now [22:25:03] No [22:25:05] no [22:25:17] RoanKattouw: we've started tcy., and we don't have notifications there according Reedy [22:25:24] Will investigate in a second, hol d on [22:25:27] any old ones [22:25:52] On the special page [22:25:58] petan: ping? [22:26:19] (03PS2) 10Dereckson: restbase: add new tcy.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/300214 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn) [22:26:35] (03PS3) 10GWicke: restbase: add new tcy.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/300214 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn) [22:26:37] Exception in module-execute in module ext.echo.special: [22:26:37] load.php?debug=false&lang=en-gb&modules=jquery%2Cmediawiki&only=scripts&skin=vector&version=0wf6vc6:175 TypeError: Cannot read property 'preferences' of null TypeError: Cannot read property 'preferences' of null(…)log @ load.php?debug=false&lang=en-gb&modules=jquery%2Cmediawiki&only=scripts&skin=vector&version=0wf6vc6:175 [22:26:38] (03Draft1) 10Paladox: Gerrit: Support having phab commits as links [puppet] - 10https://gerrit.wikimedia.org/r/302229 (https://phabricator.wikimedia.org/T76459) [22:26:47] (03PS9) 10Paladox: Gerrit: Support having phab commits as links [puppet] - 10https://gerrit.wikimedia.org/r/302229 (https://phabricator.wikimedia.org/T76459) [22:26:51] Dereckson: the puppet change has not been merged yet [22:26:57] that would be the next step [22:27:35] gwicke: can you restart RB if we merge it? [22:27:46] opsens have the necessary powers, so typically the technique is to find one on IRC [22:27:49] Reedy: Oh and the 20 at the top is the global # of notifs, while the special page only shows local ones [22:27:49] we can't merge it [22:27:53] mutante: could you merge 300214? [22:27:57] You may have found a general bug though I think, checking [22:28:08] we can restart RB once it's merged [22:28:15] Krenair: by "we" I actually meant Daniel :p [22:28:46] Dereckson mutante is i think going to the airport [22:28:53] oh [22:28:56] aha [22:29:00] mutante, bblack, robh, YuviPanda: could one of you review / merge https://gerrit.wikimedia.org/r/#/c/300214/ ? [22:29:02] oh [22:29:05] Reedy: Yup, general bug, will file, thanks for reporting [22:29:10] np [22:29:52] gwicke: will this require restbase to restart and you will be handling? [22:30:01] otherwise seems pretty straightforward [22:30:20] robh: to actually apply it, RB will need to be restarted, yes [22:30:27] after puppet has updated the config everywhere [22:30:34] we'll handle that [22:31:20] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/300214 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn) [22:31:42] its still queued for unit testing, just waitin git out.... [22:32:06] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:32:14] (03CR) 10RobH: [C: 032] "chatted with gwicke, after this config is pushed to all restbase, services team will handle restarting the service." [puppet] - 10https://gerrit.wikimedia.org/r/300214 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn) [22:32:30] once its done with that, can merge [22:32:43] robh: thanks! [22:33:04] glad to help [22:34:17] Thanks. [22:34:38] gwicke: ok, merged live on puppetmaster [22:35:27] okay, we'll wait for puppet & restart RB in ~30 minutes [22:35:32] I restored /srv/mediawiki/php-1.28.0-wmf.13/extensions/WikimediaMaintenance/addWiki.php genuine version [22:35:53] Dereckson: Krenair: Have we filed a bug for the cirrus part? [22:35:58] not yet [22:37:36] 06Operations, 10Traffic: Install XKey vmod - https://phabricator.wikimedia.org/T122881#2524425 (10GWicke) Related discussion on XKey & purging from today's meeting with @bblack, @Smalyshev, @ema, @mobrovac & myself: https://docs.google.com/document/d/1dIYQTSoJE2DC5aU7_pr4oE59f0QDJ5dlDYIYmwBOyDI/edit [22:39:05] (03PS10) 10Paladox: Gerrit: Support having phab commits as links [puppet] - 10https://gerrit.wikimedia.org/r/302229 (https://phabricator.wikimedia.org/T76459) [22:40:03] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2524429 (10GWicke) Related discussion notes on XKey & purging from today's meeting with @bblack, @Smalyshev, @ema, @mobrovac & myself: https://docs.google.com/document/d/1dIYQTSoJE2DC5aU7_pr4oE59f0QDJ5dlDYIYmwBOy... [22:40:55] what's the prolem with cirrus? [22:41:31] aude https://phabricator.wikimedia.org/P3645 [22:42:02] there's been a lot of refactoring in cirrus, so likely could be a bug [22:43:19] I'm not sure [22:43:29] I'm running the script it seems to be okay [22:43:31] yes it succeeded [22:43:39] could be because it was only in mw1099 not in prod? [22:43:50] !log Run mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki=tcywiki --baseName=tcywiki [22:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:44:59] Output at https://phabricator.wikimedia.org/P3646 [22:45:48] Krenair: when you started script last time, you did it after syncing wmf-config folder? [22:46:41] that looks ok [22:46:47] Dereckson, last time? [22:46:49] I think so [22:46:52] no wait [22:46:57] it would've been done with addWiki [22:47:23] Here, I merged the config change, pull it on tin, mw1099 and terbium, then run addWiki [22:50:10] 06Operations, 10Citoid, 06Services: Package and test Zotero for Jessie - https://phabricator.wikimedia.org/T107302#2524493 (10greg) [22:51:44] 06Operations, 06Services: Migrate SCA cluster to SCB (Jessie and Node 4.2) - https://phabricator.wikimedia.org/T96017#2524496 (10greg) [22:53:17] 06Operations, 06Services: Migrate SCA cluster to SCB (Jessie and Node 4.2) - https://phabricator.wikimedia.org/T96017#1206310 (10greg) [22:53:34] 06Operations, 10ContentTranslation-CXserver, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 5 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2524526 (10greg) [22:55:09] * Dereckson fills. [22:57:14] https://phabricator.wikimedia.org/T142153 [23:00:04] RoanKattouw, ostriches, MaxSem, awight, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160804T2300). [23:00:04] MaxSem and Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:20] * MaxSem will do it [23:01:22] \o [23:02:41] 06Operations, 06Release-Engineering-Team, 15User-greg, 07Wikimedia-Incident: Institute quarterly(?) review of incident reports and follow-up - https://phabricator.wikimedia.org/T141287#2524577 (10ksmith) I don't really have the context to know what is going to make sense, so I'll toss out some ideas. Assum... [23:02:48] MaxSem: could you defer 5 minutes, I've a little thing to achieve (interwiki map) [23:02:53] sure [23:02:56] Thanks [23:04:21] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2524580 (10GWicke) [23:05:22] (03PS1) 10Dereckson: Interwiki update: tcy.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303098 (https://phabricator.wikimedia.org/T140898) [23:05:58] (03CR) 10Dereckson: [C: 032] Interwiki update: tcy.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303098 (https://phabricator.wikimedia.org/T140898) (owner: 10Dereckson) [23:06:27] (03Merged) 10jenkins-bot: Interwiki update: tcy.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303098 (https://phabricator.wikimedia.org/T140898) (owner: 10Dereckson) [23:06:36] grmbl [23:06:39] error: insufficient permission for adding an object to repository database .git/objects [23:07:32] mutante, ^ [23:07:33] robh: could you chown mwdeploy:wikidev /srv/mediawiki-staging/.git on Terbium please? [23:07:43] (recursively) [23:07:49] er on Tin [23:07:56] MaxSem: Can I add https://gerrit.wikimedia.org/r/#/c/303099/ too? (Will add to the wiki page as well) [23:08:15] MaxSem: mutante seems away [23:08:19] i can [23:08:27] Dereckson: do that for ya now [23:08:30] sure [23:08:56] robh: thanks [23:08:57] Dereckson: done [23:10:24] Okay, interwiki map works on mw1099 [23:12:11] !log dereckson@tin Synchronized wmf-config/interwiki.php: Interwiki map update for tcy.wikipedia.org ([[Gerrit:303098]], T140898) (duration: 01m 05s) [23:12:12] T140898: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898 [23:12:13] Thanks MaxSem I'm done. [23:12:59] (interwiki work fine in prod too) [23:13:13] (03PS2) 10MaxSem: Remove temporary wgCentralAuthEnableUserMerge override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301746 [23:13:18] (03CR) 10MaxSem: [C: 032] Remove temporary wgCentralAuthEnableUserMerge override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301746 (owner: 10MaxSem) [23:13:45] (03Merged) 10jenkins-bot: Remove temporary wgCentralAuthEnableUserMerge override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301746 (owner: 10MaxSem) [23:14:58] !log mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php tcywiki --backend=local-multiwrite [23:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:49] !log maxsem@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/301746/2 (duration: 00m 53s) [23:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:49] RoanKattouw, jdlrobson - your changes are live on mw1099, please test [23:22:33] Do we have a ticket about the git repository permissions issue? [23:22:38] OK, testing [23:23:11] aude: are you still here? [23:24:11] aude: could we remove https://wikitech.wikimedia.org/wiki/Add_a_wiki#Wikidata from the page as it's now called by WikimediaMaintenance/addWiki.php? [23:24:41] (well only the need to run it separately, not the debug instructions and warning) [23:25:08] MaxSem: Working [23:25:48] Krenair: https://phabricator.wikimedia.org/T127093 [23:26:27] MaxSem: works good [23:26:29] thanks! [23:26:48] Dereckson: did it populate correctly? [23:27:01] aude: if no output is correctly, yes [23:28:39] hmm... site_identifiers is populated but not the sites table [23:29:00] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on tin is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [23:29:09] greg-g, should be reopened [23:29:44] !log maxsem@tin Synchronized php-1.28.0-wmf.13/extensions/MobileFrontend: https://gerrit.wikimedia.org/r/#/c/302992/ (duration: 00m 56s) [23:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:53] Krenair: apparently :/ [23:30:02] jdlrobson, ^ [23:31:40] 06Operations, 10Deployment-Systems: error on tin:/srv/mediawiki-staging: insufficient permission for adding an object to repository database .git/objects - https://phabricator.wikimedia.org/T127093#2524662 (10greg) 05Invalid>03Open This has been happening with more regularity lately :/ @Dzahn mentioned th... [23:34:49] !log maxsem@tin Synchronized php-1.28.0-wmf.13/extensions/MobileFrontend: https://gerrit.wikimedia.org/r/#/c/303095/ https://gerrit.wikimedia.org/r/#/c/303099/ (duration: 00m 52s) [23:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:35:11] RoanKattouw, ^ [23:35:39] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on mira is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [23:37:07] !log restart restbase to apply gerrit:300214 config change [23:37:11] !log maxsem@tin Synchronized php-1.28.0-wmf.13/extensions/Echo: https://gerrit.wikimedia.org/r/#/c/303095/ https://gerrit.wikimedia.org/r/#/c/303099/ (duration: 00m 54s) [23:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:13] RoanKattouw, now for realz [23:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:04] Works, thanks MaxSem [23:48:17] I'm puzzled about https://gerrit.wikimedia.org/r/#/c/303108/: analytics/refinery repo didn't have a .gitreview file. [23:50:50] That's 10 persons who use directly git push