[00:28:36] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3037_v6 [00:30:36] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [00:38:16] bblack: any opinion on the last two comments on https://gerrit.wikimedia.org/r/#/c/222079/ ? [00:38:26] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [00:38:45] if you feel those are non-issues, I'll merge an backport, and then Commons can be made HTTPS-only after the next point release [00:46:26] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:52:57] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 connecting: (unnamed) not-conn: cp3043_v6 [00:54:56] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [00:58:26] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3039_v6, cp3047_v6 [01:00:38] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [01:10:15] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:10:35] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2005_v6 [01:11:26] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 connecting: (unnamed) not-conn: cp4012_v6 [01:12:06] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10848 bytes in 0.147 second response time [01:14:36] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:17:26] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [01:19:36] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [01:19:55] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 51.77 ms [01:22:35] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3045_v6 [01:24:27] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3047_v6 [01:24:35] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [01:25:05] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3039_v6 [01:25:16] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 19 connecting: (unnamed) not-conn: cp2009_v6, cp3015_v6, cp3016_v6, cp3018_v6, cp4011_v6 [01:27:16] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [01:29:06] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [01:30:36] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:38:08] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 57 not-conn: cp3005_v6 [01:39:26] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2015_v6 [01:40:06] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 58 ESP OK [01:41:16] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:43:16] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 505 bytes in 1.005 second response time [01:43:35] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [01:45:07] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2005_v6, cp2008_v6 [01:46:25] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:46:46] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2020_v6, cp3045_v6 [01:48:25] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10848 bytes in 1.135 second response time [01:51:15] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:56:45] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [01:59:26] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 24 connecting: (unnamed) [02:01:36] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [02:06:47] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 57 not-conn: cp3032_v6, cp3033_v6, cp3049_v6 [02:08:46] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3037_v6 [02:10:45] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [02:10:45] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [02:13:36] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp3017_v6 [02:15:36] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [02:16:37] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3043_v6 [02:17:15] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp4005_v6, cp4014_v6 [02:17:15] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3048_v6 [02:18:38] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [02:19:16] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [02:20:37] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp4007_v6 [02:21:06] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [02:22:37] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [02:27:06] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2002_v6, cp3034_v6 [02:28:35] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 57 not-conn: cp2002_v6, cp3049_v6, cp4007_v6 [02:29:06] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [02:29:24] !log l10nupdate@tin Synchronized php-1.26wmf20/cache/l10n: l10nupdate for 1.26wmf20 (duration: 06m 42s) [02:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:55] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 51 not-conn: cp2010_v6, cp2013_v6, cp2023_v6, cp3007_v6, cp3009_v6, cp3041_v6, cp4018_v6 [02:32:25] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf20) at 2015-08-31 02:32:25+00:00 [02:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:46] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [02:32:55] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 58 ESP OK [02:35:05] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (101096s 100000s) [02:47:06] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp4007_v6 [02:53:05] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [02:59:17] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3016_v6, cp4019_v6 [03:03:27] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [03:25:06] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 23 connecting: (unnamed) not-conn: cp3018_v6 [03:26:16] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3042_v6 [03:27:06] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [03:28:17] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [03:42:55] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [03:43:51] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1588351 (10BBlack) The approval links a decision from 2011? https://phabricator.wikimedia.org/T26928#280223 What brought this up now, and is that approval still valid? Have they forgo... [03:56:55] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:05:13] !log disabled ipv6 autoconf on neon, flushed old dynamic addr [04:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:05:54] bblack: I'm around, let me know if something's up and I can help. [04:07:03] just the usual mobile_v6 icinga alerts earlier, and a spate of ipv6-only ipsec failures more-recently [04:07:05] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1588353 (10Krenair) >>! In T31919#1588351, @BBlack wrote: > The approval links a decision from 2011? https://phabricator.wikimedia.org/T26928#280223 Yes... >>! In T31919#1588351, @BBl... [04:07:20] we don't really use v6 for cache<->cache traffic anyways. somehow it's all inter-related probably, though. [04:10:20] root@cp1046:~# ping6 cp4011.ulsfo.wmnet [04:10:20] connect: Network is unreachable [04:10:20] root@cp1046:~# ping6 cp4011.ulsfo.wmnet [04:10:20] PING cp4011.ulsfo.wmnet(cp4011.ulsfo.wmnet) 56 data bytes [04:10:21] 64 bytes from cp4011.ulsfo.wmnet: icmp_seq=1 ttl=61 time=73.4 ms [04:10:25] .... [04:10:48] finally randomly caught on error on my own. worked again immediately after [04:11:20] this must be some bug related to autoconf and such, I think. or some other v6 bug on our routers, perhaps. [04:15:01] also, looping on some awk of ip6 default router outputs from RA: [04:15:02] fe80::1 9sec [04:15:02] fe80::fe00:0:0:2 9sec [04:15:03] fe80::fe00:0:0:1 0sec [04:15:11] ^ caught one that made it all the way to zero, for one of the two routers [04:16:00] hmmmm [04:16:07] and just now saw another sample with all three of them at 0s or 1s at the same time [04:18:44] so either that's directly the general v6 issue we're having (RA-based default routes dropping off due to some issue with their lifetime + interval) [04:18:58] or we just have some other v6 loss issue which also causes loss of RAs, perhaps [04:31:36] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (25927 100000s) [04:34:14] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Aug 31 04:34:14 UTC 2015 (duration 34m 13s) [04:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:40:05] I still don't know why it's happening, but it's the first real clue I've seen in a while at least [04:40:10] https://phabricator.wikimedia.org/P1952 [05:14:21] (03PS1) 10Glaisher: Enable WikidataPageBanner extension on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234942 (https://phabricator.wikimedia.org/T110837) [05:34:22] (03PS1) 10Glaisher: Clean up WikidataPageBanner related config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234944 [05:43:10] (03CR) 10Nikerabbit: "Are we blocking this patch because one developer does not like the syntax?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [05:48:56] 6operations, 10MediaWiki-extensions-PdfHandler, 6Multimedia: Error creating PDF on Commons: "convert: no decode delegate for this image format" (fixed in GS 9.07) - https://phabricator.wikimedia.org/T50007#1588409 (10Thgoiter) I still see 10 files without thumbnail in http://commons.wikimedia.org/wiki/Catego... [06:31:36] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:06] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:16] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:36] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:05] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:36] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:46] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:46] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:16] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:55:45] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:15] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:36] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:57:45] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:57:55] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:57:56] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:58:15] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:25] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:25] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:24:45] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [100000000.0] [07:35:00] (03PS4) 10Muehlenhoff: Add ferm rules for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/229707 (https://phabricator.wikimedia.org/T83597) [07:39:44] (03CR) 10Muehlenhoff: "Actually, this patch is identical with the PS1 version of" [puppet] - 10https://gerrit.wikimedia.org/r/234578 (owner: 10Muehlenhoff) [07:39:57] (03PS2) 10Muehlenhoff: Query the puppetmaster from hiera instead of $::serverip [puppet] - 10https://gerrit.wikimedia.org/r/234578 [07:40:11] (03CR) 10Muehlenhoff: [C: 032 V: 032] Query the puppetmaster from hiera instead of $::serverip [puppet] - 10https://gerrit.wikimedia.org/r/234578 (owner: 10Muehlenhoff) [07:50:55] RECOVERY - Outgoing network saturation on labstore1002 is OK: OK: Less than 10.00% above the threshold [75000000.0] [07:58:41] good morning [08:04:32] 6operations, 10Deployment-Systems, 6Release-Engineering: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1588501 (10hashar) Seems a root has to arm it with: ``` sudo -u keyholder env SSH_AUTH_SOCK=/run/keyholder/agent.sock ssh-add /etc/keyholder.d/mwdeploy_rsa ``` [08:04:57] mwdeployement are currently blocked because the key holder has some issue. https://phabricator.wikimedia.org/T110794#1588501 [08:05:05] seems it is all about running some ssh-add command on tin [08:07:35] hashar: it keeps crashing because of a recent change i made, fixing [08:07:57] ori: also noticed the key holder on beta cluster crashed at 7pm utc or so [08:08:01] fixed by running ssh-add [08:09:05] ori: don't spend too much time on it thought! it is 1am already :-} [08:09:14] Pffft [08:09:18] i can't leave it broken [08:09:21] Sleep is for the people who aren't ori [08:09:23] i have to fix or revert, and i'd rather fix [08:10:37] (03PS1) 10Muehlenhoff: Enable ferm on rhodium [puppet] - 10https://gerrit.wikimedia.org/r/234954 [08:11:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on rhodium [puppet] - 10https://gerrit.wikimedia.org/r/234954 (owner: 10Muehlenhoff) [08:11:56] oh, I think I know what it is [08:15:16] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [08:16:48] 6operations, 10Deployment-Systems, 6Release-Engineering: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1588537 (10jcrespo) If that is true, it should be documented at https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#In_your_own_repo_via_gerrit and https... [08:18:52] (03PS1) 10Ori.livneh: ssh-agent-proxy: break out of select loop once client is done [puppet] - 10https://gerrit.wikimedia.org/r/234955 (https://phabricator.wikimedia.org/T110794) [08:19:32] (03CR) 10Ori.livneh: [C: 032] ssh-agent-proxy: break out of select loop once client is done [puppet] - 10https://gerrit.wikimedia.org/r/234955 (https://phabricator.wikimedia.org/T110794) (owner: 10Ori.livneh) [08:23:18] 6operations, 10ops-eqiad, 7Database: Disk issue on db1028 - https://phabricator.wikimedia.org/T103230#1588561 (10ori) [08:23:21] 6operations, 10Deployment-Systems, 6Release-Engineering, 5Patch-For-Review: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1588558 (10ori) 5Open>3Resolved a:3ori @jcrespo, I'll update the docs. [08:23:28] hashar: unblocked [08:23:41] ori: great thank you :-} [08:23:50] oh you even closed the task! [08:23:52] 6operations, 10MediaWiki-extensions-PdfHandler, 6Multimedia: Error creating PDF on Commons: "convert: no decode delegate for this image format" (fixed in GS 9.07) - https://phabricator.wikimedia.org/T50007#1588562 (10Aklapper) >>! In T50007#1588409, @Thgoiter wrote: > Is this another bug? Yes - see the erro... [08:24:14] thank you [08:24:59] (03PS1) 10Muehlenhoff: Enable ferm on strontium (and merge it to the role) [puppet] - 10https://gerrit.wikimedia.org/r/234956 [08:25:22] hopefully in a few months I will not have to deploy code to change servers' topology [08:28:33] (03PS2) 10Muehlenhoff: Enable ferm on strontium (and merge it to the role) [puppet] - 10https://gerrit.wikimedia.org/r/234956 [08:28:43] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on strontium (and merge it to the role) [puppet] - 10https://gerrit.wikimedia.org/r/234956 (owner: 10Muehlenhoff) [08:33:13] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1028, return ES servers back from maintenance (duration: 00m 12s) [08:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:35:19] (03PS1) 10Muehlenhoff: Remove stray lines (nop commit to test puppet-merge after enabling ferm on puppetmaster backends) [puppet] - 10https://gerrit.wikimedia.org/r/234957 [08:35:46] (03PS2) 10Muehlenhoff: Remove stray lines (nop commit to test puppet-merge after enabling ferm on puppetmaster backends) [puppet] - 10https://gerrit.wikimedia.org/r/234957 [08:35:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove stray lines (nop commit to test puppet-merge after enabling ferm on puppetmaster backends) [puppet] - 10https://gerrit.wikimedia.org/r/234957 (owner: 10Muehlenhoff) [08:39:01] 6operations, 6Labs: labstore1002 not mounting all LVs after reboot - https://phabricator.wikimedia.org/T110832#1588588 (10fgiunchedi) actionables: * `start-nfs` doesn't seem to have launched or checked `sync-exports` so bindmounts weren't present when nfs was first started * I couldn't find an equivalent `stop... [08:44:32] jynus: https://wikitech.wikimedia.org/wiki/Keyholder [08:45:00] ori, it is ok, go to sleep now! [08:45:02] :-) [08:45:15] yep, done [08:45:16] good night [08:53:54] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [09:06:19] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL: CRITICAL: 25.93% of data above the critical threshold [100000000.0] [09:08:31] (03PS2) 10Muehlenhoff: Add debdeploy master to palladium [puppet] - 10https://gerrit.wikimedia.org/r/234495 [09:09:08] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add debdeploy master to palladium [puppet] - 10https://gerrit.wikimedia.org/r/234495 (owner: 10Muehlenhoff) [09:14:45] (03PS1) 10Jcrespo: Depool es1007 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234965 (https://phabricator.wikimedia.org/T105843) [09:23:40] (03CR) 10Jcrespo: [C: 032] Depool es1007 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234965 (https://phabricator.wikimedia.org/T105843) (owner: 10Jcrespo) [09:25:32] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1007 for maintenance (duration: 00m 13s) [09:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:27:29] !log update graphite retention policy on files with previous retention and older than 60d T96662 [09:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:08] RECOVERY - Outgoing network saturation on labstore1002 is OK: OK: Less than 10.00% above the threshold [75000000.0] [09:35:36] !log depool ms-fe1001 in preparation for ferm changes [09:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:41:35] (03PS1) 10Filippo Giunchedi: swift: enable firewall for ms-fe1 [puppet] - 10https://gerrit.wikimedia.org/r/234967 [09:43:40] (03CR) 10Muehlenhoff: [C: 031] swift: enable firewall for ms-fe1 [puppet] - 10https://gerrit.wikimedia.org/r/234967 (owner: 10Filippo Giunchedi) [09:44:21] (03PS2) 10Filippo Giunchedi: swift: enable firewall for ms-fe in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/234967 [09:45:18] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: enable firewall for ms-fe in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/234967 (owner: 10Filippo Giunchedi) [09:51:34] 6operations: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1588735 (10ArielGlenn) 3NEW a:3ArielGlenn [09:51:38] !log repool ms-fe1001 [09:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:53:19] (03PS1) 10Ori.livneh: Get rid of cargo-cult statistics in check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/234969 [09:55:01] !log cloning es1007 mysql data into es1013 (ETA: 5h30m) [09:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:27] 6operations, 10Deployment-Systems, 5Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#1588760 (10mark) a:3ArielGlenn [10:01:10] 6operations, 10Deployment-Systems, 5Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#1588768 (10ArielGlenn) I chatted with akosiaris about this, I had not been able to get anything useful after digging around but I will have another go at it this week. [10:05:48] !log depool ms-fe1002 to apply firewall changes [10:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:06:59] (03PS1) 10ArielGlenn: dumps: admin script to do cleanup, enter maintenance mode, etc [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/234971 [10:08:04] 6operations: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1588794 (10ArielGlenn) Draft of admin script: https://gerrit.wikimedia.org/r/234971 Still needs testing and debugging for most options. To be written: script that will restart a broken run from whever it has been inter... [10:16:25] (03CR) 10Filippo Giunchedi: [C: 04-1] Get rid of cargo-cult statistics in check_graphite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/234969 (owner: 10Ori.livneh) [10:18:16] !log repool ms-fe1002 and depool ms-fe1003 for firewall changes [10:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:19:52] !log update graphite retention policy on files with previous retention and older than 30d T96662 [10:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:20:42] 6operations: mw2187 - read-only filesystem - https://phabricator.wikimedia.org/T109717#1588813 (10akosiaris) 5Open>3Invalid a:3akosiaris I see no objects, closing as `Invalid` [10:32:32] !log repool ms-fe1003 and depool ms-fe1004 for firewall changes [10:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:36:28] !log repool ms-fe1004 [10:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:43:50] (03CR) 10Yuvipanda: "Should have an extra parameter to check_graphite that allows us to actually specify a message - *that* would be the actual useful thing. S" [puppet] - 10https://gerrit.wikimedia.org/r/234969 (owner: 10Ori.livneh) [10:45:11] !log reenabling asw2-a5-eqiad:xe-0/0/36 (T107635) [10:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:46:26] 6operations, 10ops-eqiad, 7network: investigate ethernet errors: asw2-a5-eqiad port xe-0/0/36 - https://phabricator.wikimedia.org/T107635#1588859 (10faidon) I turned it up, but it seems there is no link on it now: ``` xe-0/0/36 up down Core: << asw-a-eqiad:xe-6/1/0 {#2169} ``` [10:49:40] PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [100000000.0] [10:53:11] 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1588875 (10jrobell) Thanks everyone for looking into this. I am still having issues when changing... [10:55:41] 6operations, 5Patch-For-Review: Ferm rules for swift - https://phabricator.wikimedia.org/T104965#1588877 (10MoritzMuehlenhoff) 5Open>3Resolved All swift systems are now ferm-enabled. [10:56:58] (03PS1) 10ArielGlenn: dump lists for rsync: copy partial or in progress dump if most recent [puppet] - 10https://gerrit.wikimedia.org/r/234973 [10:57:08] (03PS1) 10Muehlenhoff: Move base::firewall include into the role definitions [puppet] - 10https://gerrit.wikimedia.org/r/234974 [10:57:19] I'm planning to restart the saltmaster on palladium in 10 minutes (to pick up the new debdeploy module), it will take the minions a few minutes to re-connect. please speak out if that disturbs anyone's plans [10:57:40] 6operations: copy partial dumps from dataset host to labs - https://phabricator.wikimedia.org/T108077#1588884 (10ArielGlenn) changes to list-last-n-good-dumps coming up, yet to be tested. see https://gerrit.wikimedia.org/r/234973 [11:10:50] !log restart salt-master on palladium [11:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:20:30] RECOVERY - Outgoing network saturation on labstore1002 is OK: OK: Less than 10.00% above the threshold [75000000.0] [11:32:50] (03Abandoned) 10Muehlenhoff: Enable ferm for puppetmaster backends [puppet] - 10https://gerrit.wikimedia.org/r/228784 (owner: 10Muehlenhoff) [11:33:32] (03Abandoned) 10Muehlenhoff: Enable base::firewall on eqiad proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/234276 (owner: 10Muehlenhoff) [11:48:32] (03CR) 10Sumit: [C: 031] Enable WikidataPageBanner extension on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234942 (https://phabricator.wikimedia.org/T110837) (owner: 10Glaisher) [11:56:41] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: puppet fail [12:01:02] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [100000000.0] [12:03:34] (03PS2) 10Nemo bis: [Italian Planet] Update Wikimedia Italia feeds [puppet] - 10https://gerrit.wikimedia.org/r/234921 [12:05:03] apergos: do you know what the labstore1003 alerts are? [12:05:17] are they just rsyncs from dumps? [12:05:57] (03CR) 10Muehlenhoff: "rsyncd is covered by the ferm::service in misc::udp2log::rsyncd, udp2log is covered by the ferm rules in misc::udp2log::firewall." [puppet] - 10https://gerrit.wikimedia.org/r/227720 (owner: 10Muehlenhoff) [12:06:14] (03PS2) 10ArielGlenn: dump lists for rsync: copy partial or in progress dump if most recent [puppet] - 10https://gerrit.wikimedia.org/r/234973 [12:06:37] YuviPanda: no idea. I haven't changed anything over there (yet) [12:06:56] apergos: can you take a look maybe? we might have to adjust the threshold if it's just the rsync [12:07:08] sure [12:07:15] we can always set a bw limit [12:07:25] cool [12:14:10] whatever is causing the larts right now, it's not a dataset rsync [12:14:15] *alerts [12:14:53] (03PS1) 10Muehlenhoff: Use the DNS name [puppet] - 10https://gerrit.wikimedia.org/r/234978 [12:15:43] (03PS1) 10Dereckson: Add *.ggpht.com to Wikimedia Commons upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234980 (https://phabricator.wikimedia.org/T110869) [12:20:49] 6operations, 6Labs, 10Tool-Labs: labstore1003 alerting because of network saturation - https://phabricator.wikimedia.org/T110881#1588993 (10ArielGlenn) 3NEW [12:23:01] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:23:02] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [12:23:49] (03CR) 10ArielGlenn: [C: 032] dump lists for rsync: copy partial or in progress dump if most recent [puppet] - 10https://gerrit.wikimedia.org/r/234973 (owner: 10ArielGlenn) [12:29:43] (03PS1) 10ArielGlenn: dumps rsync to labs: generate and use list of lastr 3 good dumps [puppet] - 10https://gerrit.wikimedia.org/r/234982 [12:30:36] (03CR) 10ArielGlenn: [C: 032] dumps rsync to labs: generate and use list of lastr 3 good dumps [puppet] - 10https://gerrit.wikimedia.org/r/234982 (owner: 10ArielGlenn) [12:34:10] 6operations: copy partial dumps from dataset host to labs - https://phabricator.wikimedia.org/T108077#1589018 (10ArielGlenn) tested and merged. https://gerrit.wikimedia.org/r/#/c/234982/ is the change to generate the list of last three good dumps and use that for rsync, also merged. we should see new behavior t... [12:45:33] (03PS3) 10Muehlenhoff: Enable debdeploy for an initial set of servers [puppet] - 10https://gerrit.wikimedia.org/r/234499 [12:49:22] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:51:03] hm, wonder if that's me [12:57:52] (03CR) 10Ottomata: "One more jmxtrans-jmx! Otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229707 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [13:11:52] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 4 others: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1589060 (10Andrew) [13:11:54] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 5 others: Try to fail over to labnet1002 - https://phabricator.wikimedia.org/T109329#1589058 (10Andrew) 5Open>3Resolved Done, and lessons learned now documented here: https://wikitech.wikimedia.org/wiki/Labs_troubleshooting#Fail-over [13:17:57] (03PS5) 10Muehlenhoff: Add ferm rules for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/229707 (https://phabricator.wikimedia.org/T83597) [13:19:33] 6operations, 7IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099#1589091 (10BBlack) We could also just not do step 1 immediately. It's the hardest step, and by skipping it all we lose is static-mapped ipv6 (or really, ipv6 at all) for new installs, b... [13:22:27] apergos: almost certainly [13:22:42] fixed [13:24:11] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:28:21] 6operations: redo dumps monitor so it runs as a service - https://phabricator.wikimedia.org/T110888#1589112 (10ArielGlenn) 3NEW a:3ArielGlenn [13:28:46] (03PS6) 10Muehlenhoff: Add ferm rules for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/229707 (https://phabricator.wikimedia.org/T83597) [13:28:47] 6operations: redo dumps monitor so it runs as a service - https://phabricator.wikimedia.org/T110888#1589123 (10ArielGlenn) [13:28:48] 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1589122 (10ArielGlenn) [13:30:51] 6operations, 7Tracking: staged dumps implementation - https://phabricator.wikimedia.org/T107757#1589124 (10ArielGlenn) [13:31:10] ottomata: there's a varnishncsa process running as user "otto" on cp1052 since May? [13:31:17] (03PS7) 10Muehlenhoff: Add ferm rules for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/229707 (https://phabricator.wikimedia.org/T83597) [13:31:19] otto 7627 1 2 May20 ? 2-18:19:23 varnishncsa -F %m %s -n frontend [13:31:32] ? [13:31:33] uh oh [13:31:34] looking [13:34:03] hm, bblack, sorry. I was doing some testing of the new non udp2log based varnishreqstats there, looks like I didn't clean up well. that project was just put on hold because of higher priority things. :/ [13:34:05] just killed it [13:34:09] ok thanks [13:34:09] and some screens I had running there [13:35:03] (03PS8) 10Ottomata: Add ferm rules for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/229707 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [13:40:21] RECOVERY - DPKG on mc2001 is OK: All packages OK [13:40:30] RECOVERY - Disk space on mc2001 is OK: DISK OK [13:40:42] RECOVERY - RAID on mc2001 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [13:40:42] RECOVERY - Memcached on mc2001 is OK: TCP OK - 0.053 second response time on port 11211 [13:40:43] !log restarting hhvm on mw1065 [13:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:51] RECOVERY - Redis on mc2001 is OK: TCP OK - 0.052 second response time on port 6379 [13:40:53] !log running puppet on newly-installed mc2001 [13:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:41:20] RECOVERY - HHVM rendering on mw1065 is OK: HTTP OK: HTTP/1.1 200 OK - 70645 bytes in 3.986 second response time [13:41:21] RECOVERY - dhclient process on mc2001 is OK: PROCS OK: 0 processes with command name dhclient [13:41:30] RECOVERY - configured eth on mc2001 is OK: OK - interfaces up [13:41:50] RECOVERY - salt-minion processes on mc2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:41:50] RECOVERY - puppet last run on mc2001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [13:42:21] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [13:46:24] (03CR) 10Ottomata: [C: 031] Add ferm rules for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/229707 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [13:46:54] (03CR) 10Ottomata: [C: 031] Enable debdeploy for an initial set of servers [puppet] - 10https://gerrit.wikimedia.org/r/234499 (owner: 10Muehlenhoff) [13:47:20] (03PS1) 10BBlack: varnishncsa: fix process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/234995 [13:48:08] (03CR) 10Ottomata: [C: 031] varnishncsa: fix process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/234995 (owner: 10BBlack) [13:48:54] (03CR) 10BBlack: [C: 032] varnishncsa: fix process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/234995 (owner: 10BBlack) [13:51:15] (03PS1) 10Andrew Bogott: Move labvir1001 and 1002 to Juno [puppet] - 10https://gerrit.wikimedia.org/r/234996 (https://phabricator.wikimedia.org/T110886) [13:51:17] (03PS1) 10Andrew Bogott: Move labvirt1003 and 1006 to Juno [puppet] - 10https://gerrit.wikimedia.org/r/234997 (https://phabricator.wikimedia.org/T110886) [13:51:19] (03PS1) 10Andrew Bogott: Make OpenStack Juno the new default, except for Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/234998 (https://phabricator.wikimedia.org/T110886) [13:51:21] (03PS1) 10Andrew Bogott: Now that all virt nodes are running Juno, return everything to the scheduler pool. [puppet] - 10https://gerrit.wikimedia.org/r/234999 (https://phabricator.wikimedia.org/T110886) [13:57:24] (03PS9) 10Muehlenhoff: Add ferm rules for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/229707 (https://phabricator.wikimedia.org/T83597) [13:57:57] (03CR) 10Filippo Giunchedi: [C: 031] Enable debdeploy for an initial set of servers [puppet] - 10https://gerrit.wikimedia.org/r/234499 (owner: 10Muehlenhoff) [13:58:11] RECOVERY - NTP on mc2001 is OK: NTP OK: Offset -0.03405439854 secs [13:59:25] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/229707 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [14:05:06] (03PS1) 10Ottomata: require_package openjdk-7 in zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/235000 [14:06:35] !log rebooted krypton. was reporting 100% cpu steal time [14:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:08:05] PROBLEM - ganeti-noded running on ganeti1002 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 0 (root), command name ganeti-noded [14:09:58] (03PS2) 10Filippo Giunchedi: Move base::firewall include into the role definitions [puppet] - 10https://gerrit.wikimedia.org/r/234974 (owner: 10Muehlenhoff) [14:10:05] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Move base::firewall include into the role definitions [puppet] - 10https://gerrit.wikimedia.org/r/234974 (owner: 10Muehlenhoff) [14:10:22] (03PS2) 10Ottomata: require_package openjdk-7 in zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/235000 [14:10:29] (03CR) 10Ottomata: [C: 032 V: 032] require_package openjdk-7 in zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/235000 (owner: 10Ottomata) [14:11:55] (03PS1) 10Muehlenhoff: Enable ferm on analytics1021 [puppet] - 10https://gerrit.wikimedia.org/r/235001 [14:17:26] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1002 is OK: OK - create-dbusers is active [14:17:47] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1589214 (10zeljkofilipin) ``` $ git submodule status -76a5e8627659197f81a4e5240c3eb4fe01cb9888 modules/cdh -e4c66b04b3f4df82569f4d14d0cf74b6fb79e57d mod... [14:21:29] (03PS1) 10Matthias Mullie: Whitelist Flow opt-in on user talkpage as BetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235003 (https://phabricator.wikimedia.org/T98270) [14:22:10] (03CR) 10Matthias Mullie: [C: 04-1] "Not to be enabled yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235003 (https://phabricator.wikimedia.org/T98270) (owner: 10Matthias Mullie) [14:26:56] 6operations, 7Monitoring: Fix up icinga puppetization - https://phabricator.wikimedia.org/T110893#1589251 (10BBlack) 3NEW [14:29:39] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1589266 (10hashar) >>! In T102020#1589214, @zeljkofilipin wrote: > ``` > $ git submodule status > -76a5e8627659197f81a4e5240c3eb4fe01cb9888 modules/cdh... [14:34:11] RECOVERY - ganeti-noded running on ganeti1002 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [14:34:49] (03PS1) 10Andrew Bogott: Make ldap/nscd/nslcd install order more explicit. [puppet] - 10https://gerrit.wikimedia.org/r/235005 (https://phabricator.wikimedia.org/T110891) [14:35:45] (03PS2) 10Andrew Bogott: Make ldap/nscd/nslcd install order more explicit. [puppet] - 10https://gerrit.wikimedia.org/r/235005 (https://phabricator.wikimedia.org/T110891) [14:38:05] (03CR) 10Andrew Bogott: [C: 032] Make ldap/nscd/nslcd install order more explicit. [puppet] - 10https://gerrit.wikimedia.org/r/235005 (https://phabricator.wikimedia.org/T110891) (owner: 10Andrew Bogott) [14:41:21] (03PS1) 10Andrew Bogott: Explicitly include nslcd package. [puppet] - 10https://gerrit.wikimedia.org/r/235006 (https://phabricator.wikimedia.org/T110891) [14:41:54] !log bouncing Cassandra on restbase1001 to apply temporary GC setting [14:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:01] (03PS2) 10Andrew Bogott: Explicitly include nslcd package. [puppet] - 10https://gerrit.wikimedia.org/r/235006 (https://phabricator.wikimedia.org/T110891) [14:43:18] (03CR) 10Andrew Bogott: [C: 032] Explicitly include nslcd package. [puppet] - 10https://gerrit.wikimedia.org/r/235006 (https://phabricator.wikimedia.org/T110891) (owner: 10Andrew Bogott) [14:43:31] !log elasticsearch cluster.routing.allocation.disk.watermark.high set to 75% to force elastic1022 to reduce its disk usage [14:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:46:04] 6operations, 7Graphite, 7HHVM, 7Monitoring: check_graphite - "UNKNOWN: More than half of the datapoints are undefined " - https://phabricator.wikimedia.org/T105218#1589314 (10fgiunchedi) this seems to be happening when graphite (or statsd) show `null` for recent values, `check_graphite` by default looks ba... [14:49:56] (03PS1) 10John F. Lewis: icinga: puppetise apache mods [puppet] - 10https://gerrit.wikimedia.org/r/235008 (https://phabricator.wikimedia.org/T110893) [14:52:23] (03PS1) 10Faidon Liambotis: network: indent/reflow $special_hosts [puppet] - 10https://gerrit.wikimedia.org/r/235009 [14:52:24] (03PS1) 10Faidon Liambotis: network: add bast2001 to bastions [puppet] - 10https://gerrit.wikimedia.org/r/235010 [14:53:15] (03CR) 10Faidon Liambotis: [C: 032] network: indent/reflow $special_hosts [puppet] - 10https://gerrit.wikimedia.org/r/235009 (owner: 10Faidon Liambotis) [14:53:27] (03CR) 10Faidon Liambotis: [C: 032] network: add bast2001 to bastions [puppet] - 10https://gerrit.wikimedia.org/r/235010 (owner: 10Faidon Liambotis) [14:57:25] (03PS1) 10Andrew Bogott: Add more nslcd notifying. [puppet] - 10https://gerrit.wikimedia.org/r/235011 (https://phabricator.wikimedia.org/T110891) [14:57:44] andrewbogott: no trailing dot on commit message headers :) [14:57:54] (03PS4) 10Muehlenhoff: Enable debdeploy for an initial set of servers [puppet] - 10https://gerrit.wikimedia.org/r/234499 [14:58:02] paravoid: …ok, does that break something? [14:58:08] 6operations, 6Labs, 10Labs-Other-Projects: labstore1003 alerting because of network saturation - https://phabricator.wikimedia.org/T110881#1589345 (10scfc) [14:58:28] (03PS2) 10Andrew Bogott: Add more nslcd notifying [puppet] - 10https://gerrit.wikimedia.org/r/235011 (https://phabricator.wikimedia.org/T110891) [14:59:00] (03PS1) 10Eevans: set JVM_OPTS entirely [puppet] - 10https://gerrit.wikimedia.org/r/235012 [14:59:28] (03CR) 10Andrew Bogott: [C: 032] Add more nslcd notifying [puppet] - 10https://gerrit.wikimedia.org/r/235011 (https://phabricator.wikimedia.org/T110891) (owner: 10Andrew Bogott) [14:59:59] andrewbogott: no, but https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines [15:00:00] (03CR) 10Muehlenhoff: "(PS4 was just a manual rebase since gerrit choked on the move of the base::firewall include in the swift roles)" [puppet] - 10https://gerrit.wikimedia.org/r/234499 (owner: 10Muehlenhoff) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150831T1500). Please do the needful. [15:00:04] MatmaRex Krenair kart_ aude: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:25] 6operations, 6Labs, 10Labs-Other-Projects: labstore1003 alerting because of network saturation - https://phabricator.wikimedia.org/T110881#1589354 (10scfc) (The host `dumps-3.dumps.eqiad.wmflabs` is part of the [[https://wikitech.wikimedia.org/wiki/Nova_Resource:Dumps|Dumps project]] and not related to #Tool... [15:00:28] I'm here. [15:00:35] * aude waves [15:01:18] (03PS5) 10Muehlenhoff: Enable debdeploy for an initial set of servers [puppet] - 10https://gerrit.wikimedia.org/r/234499 [15:01:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable debdeploy for an initial set of servers [puppet] - 10https://gerrit.wikimedia.org/r/234499 (owner: 10Muehlenhoff) [15:01:29] okie doke. I can SWAT. [15:02:05] hi. [15:02:07] thcipriani: :) [15:02:48] could my stuff go first? i'll need to leave soon [15:03:51] MatmaRex: sorry, just saw your comment, Let me get wikibase (since I started there) done and then I'll push yours out. [15:04:02] right, cool. thanks [15:04:15] thcipriani: mine is a submodule bump, do you want me to submit a patch for it? [15:04:25] (if yes, please +2 https://gerrit.wikimedia.org/r/#/c/234553/ :) ) [15:04:45] MatmaRex: shouldn't be necessary anymore, thanks for checking though :) [15:05:12] might take some time for jenkins :/ [15:06:12] aude: sure enough, ok, lemme just get MatmaRex 's change out the door, hopefully be quick. [15:06:27] ok [15:07:13] MatmaRex, https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Updating_the_submodule [15:07:45] oh. neat [15:08:07] PROBLEM - puppet last run on neptunium is CRITICAL: CRITICAL: puppet fail [15:10:09] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1589373 (10greg) >>! In T102020#1589266, @hashar wrote: > Maybe the job should be made to not process submodules? Logically that makes sense, and shoul... [15:10:56] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [15:12:14] thcipriani: eh, gate-and-submit jobs are not simultaneous, so UploadWizard's quick tests are now waiting for the Wikidata change to get merged [15:12:23] indeed. [15:12:28] thcipriani: i've gotta disappear, marktraceur should be able to substitute for me [15:12:34] darn zuul being so smart! [15:13:13] (03CR) 10Dzahn: [C: 031] icinga: puppetise apache mods [puppet] - 10https://gerrit.wikimedia.org/r/235008 (https://phabricator.wikimedia.org/T110893) (owner: 10John F. Lewis) [15:13:17] (03PS2) 10Dzahn: icinga: puppetise apache mods [puppet] - 10https://gerrit.wikimedia.org/r/235008 (https://phabricator.wikimedia.org/T110893) (owner: 10John F. Lewis) [15:13:43] (actually, it merged, and i am still here for now) [15:13:47] ok [15:13:58] Oh, that's what's happening [15:14:09] I'm cool to take over, carry on MatmaRex [15:14:24] aight. thanks, see you later today [15:14:50] (03CR) 10Dzahn: [C: 032] "yes, both are enabled on neon and per Brandon's comments on the ticket that wasn't fully puppetized" [puppet] - 10https://gerrit.wikimedia.org/r/235008 (https://phabricator.wikimedia.org/T110893) (owner: 10John F. Lewis) [15:16:51] !log thcipriani@tin Synchronized php-1.26wmf20/extensions/UploadWizard/resources/controller/uw.controller.Step.js: SWAT: Keep the uploads sorted in the order they were created in initially [[gerrit:234553]] (duration: 00m 12s) [15:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:09] Checking. [15:17:18] marktraceur: thanks! [15:17:57] Looks fine to me, assuming the cache isn't still updating [15:18:10] I'll check again in like 20 minutes but it looks great, thanks thcipriani [15:18:21] marktraceur: kk, thank you. [15:18:33] aude: going to sync-dir wikidata now [15:18:37] ok [15:20:38] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: puppet fail [15:21:05] !log thcipriani@tin Synchronized php-1.26wmf20/extensions/Wikidata: SWAT: Update Wikidata - Fix formatting of client edit summaries [[gerrit:234991]] (duration: 00m 21s) [15:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:10] ^ aude check please [15:21:13] * aude checks [15:21:26] looks good :) [15:21:33] thanks [15:21:45] aude: awesome—thanks! [15:21:57] PROBLEM - puppet last run on nembus is CRITICAL: CRITICAL: puppet fail [15:22:06] (03PS1) 10Dzahn: icinga: libssl0.9.8 for NRPE checks to run [puppet] - 10https://gerrit.wikimedia.org/r/235017 (https://phabricator.wikimedia.org/T110893) [15:22:18] thcipriani: am I next? [15:22:39] kart_: I'm going to push out those quick config changes and then do the full swat for your patch [15:23:16] (03CR) 10BBlack: [C: 031] Set up varnishkafka instance on cache servers to log raw client side events to kafka [puppet] - 10https://gerrit.wikimedia.org/r/234543 (https://phabricator.wikimedia.org/T106255) (owner: 10Ottomata) [15:23:37] Okay! [15:23:38] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234040 (https://phabricator.wikimedia.org/T76957) (owner: 10Deskana) [15:23:40] (03CR) 10John F. Lewis: [C: 031] icinga: libssl0.9.8 for NRPE checks to run [puppet] - 10https://gerrit.wikimedia.org/r/235017 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [15:23:46] (03PS2) 10Ottomata: Set up varnishkafka instance on cache servers to log raw client side events to kafka [puppet] - 10https://gerrit.wikimedia.org/r/234543 (https://phabricator.wikimedia.org/T106255) [15:23:49] (03PS3) 10BBlack: Fix wikitech beacon 204 [puppet] - 10https://gerrit.wikimedia.org/r/234703 (https://phabricator.wikimedia.org/T104359) (owner: 10Alex Monk) [15:24:04] (03Merged) 10jenkins-bot: Remove files from Commons from search results on wikimediafoundation.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234040 (https://phabricator.wikimedia.org/T76957) (owner: 10Deskana) [15:24:53] (03PS1) 10Dzahn: icinga: no suprise upgrades with 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/235019 [15:25:01] (03CR) 10Ottomata: [C: 032] Set up varnishkafka instance on cache servers to log raw client side events to kafka [puppet] - 10https://gerrit.wikimedia.org/r/234543 (https://phabricator.wikimedia.org/T106255) (owner: 10Ottomata) [15:25:36] !log starting varnishkafka instances on frontend caches to produce eventlogging client side events to kafka [15:25:41] !log thcipriani@tin Synchronized wmf-config/CirrusSearch-common.php: SWAT: Remove files from Commons from search results on wikimediafoundation.org [[gerrit:234040]] (duration: 00m 11s) [15:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:46] ^ Krenair check please [15:26:11] thcipriani, looks good, thanks! [15:26:18] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: puppet fail [15:26:21] Krenair: cool, thank you [15:26:32] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234594 (https://phabricator.wikimedia.org/T109157) (owner: 10MarcoAurelio) [15:26:34] (03PS2) 10Dzahn: icinga: no suprise upgrades with 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/235019 [15:27:00] (03Merged) 10jenkins-bot: Creating closed-labs.dblist and closing es.wikipedia.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234594 (https://phabricator.wikimedia.org/T109157) (owner: 10MarcoAurelio) [15:28:34] !log thcipriani@tin Synchronized closed-labs.dblist: SWAT: Creating closed-labs.dblist and closing es.wikipedia.beta.wmflabs.org [[gerrit:234594]] (duration: 00m 13s) [15:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:37] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: puppet fail [15:29:41] (03PS4) 10BBlack: Fix wikitech beacon 204 [puppet] - 10https://gerrit.wikimedia.org/r/234703 (https://phabricator.wikimedia.org/T104359) (owner: 10Alex Monk) [15:29:46] 6operations, 10Wikimedia-General-or-Unknown, 7Performance: ishmael shows blank graphs - https://phabricator.wikimedia.org/T66581#1589447 (10Dzahn) re: the last question, also see T109777 [15:29:59] 6operations, 10Wikimedia-General-or-Unknown, 7Performance: ishmael shows blank graphs - https://phabricator.wikimedia.org/T66581#1589451 (10Dzahn) p:5Normal>3Low [15:30:05] (03CR) 10BBlack: [C: 032 V: 032] Fix wikitech beacon 204 [puppet] - 10https://gerrit.wikimedia.org/r/234703 (https://phabricator.wikimedia.org/T104359) (owner: 10Alex Monk) [15:30:27] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: puppet fail [15:30:55] 6operations, 7Monitoring, 5Patch-For-Review: Fix up icinga puppetization - https://phabricator.wikimedia.org/T110893#1589456 (10Dzahn) re: the 'ishmael' steps, also see: T109777 we might just decom it. and it's not a thing that should be in the icinga module. it was only related because it's running on node... [15:31:03] (03PS2) 10BBlack: Fix /static 404s in beta mobile [puppet] - 10https://gerrit.wikimedia.org/r/234733 (https://phabricator.wikimedia.org/T105541) (owner: 10Alex Monk) [15:31:14] (03CR) 10BBlack: [C: 032 V: 032] Fix /static 404s in beta mobile [puppet] - 10https://gerrit.wikimedia.org/r/234733 (https://phabricator.wikimedia.org/T105541) (owner: 10Alex Monk) [15:32:26] thcipriani, it doesn't seem to have had any effect :/ [15:33:14] !log terbium - Could not find dependent Service[nscd] for File[/etc/ldap/ldap.conf] [15:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:41] what's up with the LDAP changes [15:34:05] Krenair: hmm, it did make it out with the last beta-scap-eqiad, FWIW [15:34:15] yeah, I checked the file existed on an apache [15:34:37] andrewbogott: is it known that LDAP changes influence terbium? [15:34:44] thcipriani: i think jzerebecki was poking at beta [15:34:57] because scap was broken there since yesterday [15:35:02] it's updating again [15:35:27] (03PS2) 10BBlack: Align mobile VCL much closer to text VCL [puppet] - 10https://gerrit.wikimedia.org/r/234290 (https://phabricator.wikimedia.org/T109286) [15:35:29] (03PS2) 10BBlack: add various text backend defs to mobile [puppet] - 10https://gerrit.wikimedia.org/r/234289 (https://phabricator.wikimedia.org/T109286) [15:35:50] aude: yeah, looks like it was a keyholder that needed armed, that looks fixed now [15:35:56] yea keyholder was not armed for some reason, again. already happended yesterday [15:36:17] thcipriani, on deployment-bastion: [15:36:22] > var_dump( MWWikiversions::readDbListFile( getRealmSpecificFilename( "$IP/../closed.dblist" ) ) ); [15:36:22] array(1) { [15:36:22] [0]=> [15:36:22] string(6) "eswiki" [15:36:23] } [15:37:02] yup, so looks like it made it [15:38:33] (03PS3) 10Rush: elasticsearch: ferm for 4-7 [puppet] - 10https://gerrit.wikimedia.org/r/234671 [15:38:40] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: ferm for 4-7 [puppet] - 10https://gerrit.wikimedia.org/r/234671 (owner: 10Rush) [15:38:52] (03PS4) 10Rush: elasticsearch: ferm for 4-7 [puppet] - 10https://gerrit.wikimedia.org/r/234671 [15:39:15] !log thcipriani@tin Started scap: SWAT: Ask the user to log in if the session is lost [[gerrit:234228]] [15:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:25] ^ kart_ that's for your update [15:39:37] obviously from the message, I guess :) [15:39:42] jzerebecki: i think ori was working on the keyholder and that would be why [15:40:23] (03PS1) 10Muehlenhoff: Enable ferm on initial appservers [puppet] - 10https://gerrit.wikimedia.org/r/235025 (https://phabricator.wikimedia.org/T104968) [15:41:04] thcipriani: marktraceur: i'm back. thanks for taking care of it. i guess i could have gone last instead ;) [15:41:11] (03CR) 10jenkins-bot: [V: 04-1] Enable ferm on initial appservers [puppet] - 10https://gerrit.wikimedia.org/r/235025 (https://phabricator.wikimedia.org/T104968) (owner: 10Muehlenhoff) [15:41:13] (03PS3) 10BBlack: Align mobile VCL much closer to text VCL [puppet] - 10https://gerrit.wikimedia.org/r/234290 (https://phabricator.wikimedia.org/T109286) [15:41:15] (03PS3) 10BBlack: add various text backend defs to mobile [puppet] - 10https://gerrit.wikimedia.org/r/234289 (https://phabricator.wikimedia.org/T109286) [15:41:57] thcipriani: :) [15:42:18] (03CR) 10Rush: [C: 032] elasticsearch: ferm for 4-7 [puppet] - 10https://gerrit.wikimedia.org/r/234671 (owner: 10Rush) [15:45:36] (03PS2) 10Muehlenhoff: Enable ferm on initial appservers [puppet] - 10https://gerrit.wikimedia.org/r/235025 (https://phabricator.wikimedia.org/T104968) [15:46:27] (03PS4) 10BBlack: add various text backend defs to mobile [puppet] - 10https://gerrit.wikimedia.org/r/234289 (https://phabricator.wikimedia.org/T109286) [15:46:43] (03CR) 10BBlack: [C: 032 V: 032] add various text backend defs to mobile [puppet] - 10https://gerrit.wikimedia.org/r/234289 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [15:48:04] mutante, thcipriani, ori: yes, that is it. thx. someone changed it twice in puppet within a few days, which caused it to restart and thus be disarmed, but the person deploying the change didn't arm the keyholder on beta again. each time. please do so in the future. [15:48:34] * jzerebecki mumbles something about monitoring [15:48:41] jzerebecki: there is icinga for it [15:49:31] mutante: for production yes. and for beta in shinken? and does someone look at those? [15:49:55] jzerebecki: well, yea, good point. i dont know [15:51:18] PROBLEM - Apache HTTP on mw2187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 689 bytes in 0.119 second response time [15:51:37] PROBLEM - HHVM rendering on mw2187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 689 bytes in 0.127 second response time [15:52:01] was this the same than the one that failed before? mw2187 [15:52:02] jzerebecki: point taken, i forgot about beta [15:54:58] apparently it is not (looking at the logs) [15:57:41] (03CR) 10Jdlrobson: Enable WikidataPageBanner extension on Russian Wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234942 (https://phabricator.wikimedia.org/T110837) (owner: 10Glaisher) [15:59:04] !log restarting hhvm on mw2187 [15:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:46] kart_: scap still syncing apaches, FYI [16:00:51] 6operations: figure out what's wrong with confd + varnish currently - https://phabricator.wikimedia.org/T110899#1589517 (10BBlack) 3NEW [16:01:25] ah. [16:01:31] restarting apache and hhvm doesn't fix the issue [16:02:55] that one needs rebuildLocalisationCache.php [16:02:59] 6operations, 10Traffic: figure out what's wrong with confd + varnish currently - https://phabricator.wikimedia.org/T110899#1589524 (10BBlack) [16:03:03] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [16:04:18] thcipriani, okay, so on deployment-bastion I changed CommonSettings to always set up $wikiTags, and ran sync-common [16:04:35] var_dumped wikiTags, and closed was in there [16:04:44] something above a known cause for 503s to start rising? [16:04:56] but var_dump( $groupOverrides ) does not show the settings for closed wikis [16:05:14] little blue line taking off on the right, doesn't look like the usual brief spike: https://gdash.wikimedia.org/dashboards/reqerror/ [16:05:50] bblack, mine is creating exceptions, but only on codfw [16:06:23] !log thcipriani@tin Finished scap: SWAT: Ask the user to log in if the session is lost [[gerrit:234228]] (duration: 27m 07s) [16:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:32] yeah, exceptions from codfw hosts running enwiki, about the localisation cache [16:06:51] according to fluorine:/a/mw-log/exception.log [16:07:03] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [16:07:26] only mw2187, actually [16:07:34] yes, [16:07:46] lots of api queries on 5xx [16:08:05] kart_: your cx scap finished. Check please. [16:08:40] (03CR) 10Filippo Giunchedi: [C: 031] icinga: no suprise upgrades with 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/235019 (owner: 10Dzahn) [16:08:41] okay! [16:08:58] (03CR) 10John F. Lewis: [C: 031] icinga: no suprise upgrades with 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/235019 (owner: 10Dzahn) [16:09:03] it could be https://tools.wmflabs.org/para/Commons:Special:NewFiles [16:09:08] which is broken [16:09:31] (not mediawiki's fault) [16:09:48] (03PS1) 10Andrew Bogott: Revert "Add more nslcd notifying" [puppet] - 10https://gerrit.wikimedia.org/r/235031 [16:10:11] bd808: PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [16:10:12] \o/ [16:10:24] thcipriani: Tested. Working as expected. [16:10:27] i mean, /o\ about the fact that they're spiking, but \o/ about getting an alert [16:10:39] kart_: awesome. thanks! [16:10:45] (03CR) 10Andrew Bogott: [C: 032] Revert "Add more nslcd notifying" [puppet] - 10https://gerrit.wikimedia.org/r/235031 (owner: 10Andrew Bogott) [16:11:13] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [16:12:41] nutcracker SYSTEM ERROR on 2 hosts- there you have your specimens to debug :-) [16:13:05] mw1125 and mw1142 [16:14:03] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:24] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [16:15:23] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [16:15:36] Krinkle_: hiya, yt? [16:15:50] you mentioned that you have an eventlogging consumer somehwere, right? [16:17:18] (03PS3) 10Dzahn: [Italian Planet] Update Wikimedia Italia feeds [puppet] - 10https://gerrit.wikimedia.org/r/234921 (owner: 10Nemo bis) [16:17:27] jynus: are those depooled with nutcracker errors depooled already? [16:17:44] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:17:49] nope, I want to do too many things at the same time [16:17:56] but I intend to [16:18:21] (03CR) 10Dzahn: [C: 032] [Italian Planet] Update Wikimedia Italia feeds [puppet] - 10https://gerrit.wikimedia.org/r/234921 (owner: 10Nemo bis) [16:18:43] (03PS1) 10BBlack: conftool: disable etcd1002 [puppet] - 10https://gerrit.wikimedia.org/r/235034 (https://phabricator.wikimedia.org/T110899) [16:19:05] any help is welcome [16:19:54] RECOVERY - puppet last run on nembus is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:20:17] yeah it is logging like nuts [2015-08-31 16:19:45.794] nc_proxy.c:330 client connections 935 exceed limit 935 [16:20:21] depooling [16:22:21] !log depool mw1125 + mw1142 from api, nutcracker client connections exceeded [16:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:43] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:22:46] thanks, godog [16:23:27] jynus: np, how did you get to those two? [16:23:34] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [16:23:52] godog, kibana [16:24:00] let me give you the linkç [16:24:19] arg, too many tabs [16:24:50] https://logstash.wikimedia.org/#dashboard/temp/AU-Ekw8-k7jM4RPm6ljL <-- can confirm erors gone [16:26:14] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:26:32] my issue with memcached connection handling is that when it fails, the database usually has a spike in connections failed not because it cannot handle the load (it can, for a single host), but because it cannot handle 5000 connections starting at the same time [16:27:14] i have pooling at server side but I think the issue is TCP at client side [16:27:28] (only a suspicion, I have yet to prove it) [16:27:45] thanks jynus! [16:27:58] 500 back to normal, too [16:29:43] someone wanted to profile those, cannot remember who; let me search for the ticket [16:31:00] uh, mw1142 again? that smells fishy [16:31:09] https://phabricator.wikimedia.org/T105131 [16:32:03] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [16:33:03] RECOVERY - puppet last run on neptunium is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:33:07] 6operations, 6Labs, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1589617 (10jcrespo) mw1125 + mw1142 were depooled by @fgiunchedi just some minutes ago with the same kind of error: ``` Memcached error for key "enwiki:messages:en:status" on server "/... [16:33:19] jynus: kk, I'll update the ticket [16:33:28] I already did :) [16:37:23] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:37:54] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:46:12] 6operations, 6Labs, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1589655 (10fgiunchedi) on the nutcracker side, the logs were being spammed by errors: (mw1142) ``` [2015-08-31 15:02:00.428] nc_response.c:159 filter stray rsp 1553464878 len 41 on s 87... [16:46:33] yup, added some more info, I'm not sure it is related to the original failures though [16:47:40] sure, but same host, same kind of errors (it is up, but all connections failing)... there is a chance [16:50:24] yup, not sure who else was interested in the failures, ori perhaps? [16:58:00] 6operations, 10Analytics, 6Discovery, 10MediaWiki-General-or-Unknown, and 5 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1589674 (10GWicke) Etherpad notes: https://etherpad.wikimedia.org/p/scalable_events_system [17:00:26] ori: w00t! The alert actually worked. That makes me very happy [17:00:41] ^there we have it [17:15:18] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1589740 (10RobH) Well, I'm told by HR that Joel is still working on this & @JKrauska stated earlier in this thread he was working on it. He has stated that he wants Ops help, and stated h... [17:18:57] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1589751 (10JKrauska) Removing myself from this toxic phab ticket, it's not doing any help. [17:20:51] it seems wikitech search is broken. logs say: Couldn't connect to host, Elasticsearch down? [17:22:14] (03CR) 10Alexandros Kosiaris: [C: 031] "Yup. I am the one that removed that box. Forgot to remove it from conftool." [puppet] - 10https://gerrit.wikimedia.org/r/235034 (https://phabricator.wikimedia.org/T110899) (owner: 10BBlack) [17:26:57] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1589778 (10RobH) So I thought of what I think is a better solution. #procurement is the project for vendor quotes, S4 is the space. #procurement is already a closed membership, and I add in users o... [17:29:51] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1589789 (10mmodell) from the process list that @jcrespo posted, it looks to me like several processes are running somewhat slow queries, with each process holding... [17:30:16] 6operations, 10CirrusSearch, 6Discovery, 10hardware-requests: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1589790 (10RobH) ETA is today. [17:31:45] (03CR) 10BBlack: [C: 032] conftool: disable etcd1002 [puppet] - 10https://gerrit.wikimedia.org/r/235034 (https://phabricator.wikimedia.org/T110899) (owner: 10BBlack) [17:34:14] 6operations, 6Multimedia, 10Wikimedia-General-or-Unknown, 6Wikisource: Upgrade Ghostscript to 9.15 or later - https://phabricator.wikimedia.org/T110849#1589813 (10matmarex) I went through the bugs filed under #mediawiki-extensions-pdfhandler and found two more (seems like distinct issues) that I can't repr... [17:38:09] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1589831 (10jcrespo) > this query appears multiple times I, on purpose, did not include the full query, as this is a public ticket. I can do that on request (for... [17:39:18] 10Ops-Access-Requests, 6operations: Requesting research DB access for Alex Monk - https://phabricator.wikimedia.org/T110754#1589833 (10RobH) a:3RobH So neither one of those groups are sudo, this only needs the three day wait. Since I'm on clinic duty this week, I'm claiming to for implemention after the 3 d... [17:40:46] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1589839 (10mmodell) @jcrespo: Can you email me the full query or create a private task to post it? I'd like to try to figure out where that query is coming from a... [17:43:26] (03PS1) 10Faidon Liambotis: openstack: fix non-capitalized resource references [puppet] - 10https://gerrit.wikimedia.org/r/235046 [17:44:01] (03CR) 10Faidon Liambotis: [C: 032] openstack: fix non-capitalized resource references [puppet] - 10https://gerrit.wikimedia.org/r/235046 (owner: 10Faidon Liambotis) [17:44:58] bblack: that etcd1002 change was not merged [17:45:02] bblack: should I merge it? [17:45:11] oh sorry, yes [17:45:17] done [17:46:08] (03PS1) 10John F. Lewis: admin: add Krenair to researchers [puppet] - 10https://gerrit.wikimedia.org/r/235047 (https://phabricator.wikimedia.org/T110754) [17:46:10] (03CR) 10Mattflaschen: [C: 04-1] "Don't merge right away." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234207 (https://phabricator.wikimedia.org/T107204) (owner: 10Mattflaschen) [17:46:20] (03PS2) 10John F. Lewis: admin: add Krenair to researchers [puppet] - 10https://gerrit.wikimedia.org/r/235047 (https://phabricator.wikimedia.org/T110754) [17:46:44] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [17:49:12] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1589891 (10faidon) I've probably missed some discussion -- can someone explain why we need a separate "space" for this? [17:53:25] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1589900 (10brion) [17:53:34] PROBLEM - puppet last run on elastic1006 is CRITICAL: CRITICAL: Puppet last ran 2 days ago [17:54:51] (03PS1) 10JanZerebecki: Make search on wikitech work again [puppet] - 10https://gerrit.wikimedia.org/r/235048 [17:55:34] RECOVERY - puppet last run on elastic1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:56:28] (03CR) 10JanZerebecki: "I'm just guessing about the syntax. Please verify." [puppet] - 10https://gerrit.wikimedia.org/r/235048 (owner: 10JanZerebecki) [17:57:25] mark: can i please get some input on https://phabricator.wikimedia.org/T106447 ? [17:57:47] matanya: team meeting starting in 3 minutes, just fyi :) [17:57:51] so, not right now [17:58:08] replace the whole agenda with that one task link ;) [17:58:14] greg-g: i know, thanks. a place holder for mark to reply in his spare time :) [17:58:24] :) :) [18:22:08] Krenair: anything wrong in the file I commited (re eswiki-betalabs)? [18:40:40] (03PS1) 10EBernhardson: Configure titlesuggest index for dewiki and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235055 (https://phabricator.wikimedia.org/T110922) [18:41:53] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1590207 (10bd808) [18:44:36] (03CR) 10Muehlenhoff: [C: 04-1] "If multiple hosts are configured in an srange they need to be wrapped in double brackets "(("" [puppet] - 10https://gerrit.wikimedia.org/r/235048 (owner: 10JanZerebecki) [18:46:19] (03CR) 10Yuvipanda: [C: 04-1] "I think we should modify the package / init script than do this." [puppet] - 10https://gerrit.wikimedia.org/r/233413 (https://phabricator.wikimedia.org/T109497) (owner: 10Hashar) [18:48:30] (03CR) 10Yuvipanda: [C: 031] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/234669 (owner: 10Hashar) [18:50:39] (03CR) 10Rush: "Where do we use 1.6? AFAIK all things in prod and even beta are 1.7.1." [puppet] - 10https://gerrit.wikimedia.org/r/233413 (https://phabricator.wikimedia.org/T109497) (owner: 10Hashar) [18:53:05] (03CR) 10Yuvipanda: "While I don't have a problem with the change itself, as a matter of principle I think we shouldn't have sudo changes go through PuppetSWAT" [puppet] - 10https://gerrit.wikimedia.org/r/234539 (owner: 10Hashar) [18:53:50] (03CR) 10Yuvipanda: [C: 031] "(OK for puppetswat since this is a no-op)" [puppet] - 10https://gerrit.wikimedia.org/r/234670 (owner: 10Hashar) [18:54:25] (03PS2) 10JanZerebecki: Make search on wikitech work again [puppet] - 10https://gerrit.wikimedia.org/r/235048 [18:55:49] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Puppet has 1 failures [18:58:28] 6operations, 10Analytics, 6Discovery, 10MediaWiki-General-or-Unknown, and 5 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1590270 (10GWicke) [19:02:18] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:02:34] ^manually run finished fine so idk [19:05:01] (03PS3) 10BBlack: icinga: no suprise upgrades with 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/235019 (owner: 10Dzahn) [19:05:10] (03CR) 10BBlack: [C: 032 V: 032] icinga: no suprise upgrades with 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/235019 (owner: 10Dzahn) [19:05:17] (03PS2) 10BBlack: icinga: libssl0.9.8 for NRPE checks to run [puppet] - 10https://gerrit.wikimedia.org/r/235017 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [19:05:28] (03CR) 10BBlack: [C: 032 V: 032] icinga: libssl0.9.8 for NRPE checks to run [puppet] - 10https://gerrit.wikimedia.org/r/235017 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [19:08:03] (03CR) 10Aklapper: "> > Don't know if DB access has been granted for sql_name either." [puppet] - 10https://gerrit.wikimedia.org/r/233219 (https://phabricator.wikimedia.org/T85183) (owner: 10Aklapper) [19:13:07] (03CR) 10Alex Monk: "Related to T110635?" [puppet] - 10https://gerrit.wikimedia.org/r/235048 (owner: 10JanZerebecki) [19:16:23] 6operations, 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1590336 (10Krenair) 5Open>3Resolved a:3Krenair [19:17:32] 6operations, 10Traffic, 5Patch-For-Review: figure out what's wrong with confd + varnish currently - https://phabricator.wikimedia.org/T110899#1590341 (10BBlack) 5Open>3Resolved a:3BBlack (will open sep task about monitoring hole) [19:17:39] (03PS1) 10Ori.livneh: pybal (1.08) trusty; urgency=low [debs/pybal] - 10https://gerrit.wikimedia.org/r/235062 [19:18:29] 6operations: Not all confd errors throw icinga alerts - https://phabricator.wikimedia.org/T110933#1590345 (10BBlack) 3NEW [19:18:46] (03CR) 10JanZerebecki: "If the jobs are also being run on silver, yes." [puppet] - 10https://gerrit.wikimedia.org/r/235048 (owner: 10JanZerebecki) [19:19:46] (03CR) 10BBlack: [C: 04-1] "We don't have trusty anymore, just precise and jessie. Just release this for jessie-wikimedia and we'll roll it out as they upgrade to je" [debs/pybal] - 10https://gerrit.wikimedia.org/r/235062 (owner: 10Ori.livneh) [19:20:12] (03CR) 10Ori.livneh: "@bblack: yep, got it." [debs/pybal] - 10https://gerrit.wikimedia.org/r/235062 (owner: 10Ori.livneh) [19:22:28] 6operations, 6Services, 7Icinga, 7Monitoring: create service/user groups in icinga - https://phabricator.wikimedia.org/T107884#1590362 (10JohnLewis) I've looked at this and the implementation seems fairly straight forward (though depends on personal definitions). A contact can manage their services if the... [19:22:37] (03PS3) 10BBlack: Added maps-cluster referer rules (e.g. Phab) [puppet] - 10https://gerrit.wikimedia.org/r/234600 (owner: 10Yurik) [19:22:54] (03PS2) 10Ori.livneh: pybal (1.08) jessie-wikimedia; urgency=low [debs/pybal] - 10https://gerrit.wikimedia.org/r/235062 [19:25:02] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1590377 (10srodlund) Hi All, Voice out of the blue on this public ticket. Megan N. and I walked through some scenarios for using Phabricator as a system to help with on-boarding and off-b... [19:26:19] (03PS4) 10BBlack: Added maps-cluster referer rules (e.g. Phab) [puppet] - 10https://gerrit.wikimedia.org/r/234600 (owner: 10Yurik) [19:26:48] (03CR) 10BBlack: [C: 031] pybal (1.08) jessie-wikimedia; urgency=low [debs/pybal] - 10https://gerrit.wikimedia.org/r/235062 (owner: 10Ori.livneh) [19:27:21] (03CR) 10BBlack: [C: 032] Added maps-cluster referer rules (e.g. Phab) [puppet] - 10https://gerrit.wikimedia.org/r/234600 (owner: 10Yurik) [19:34:25] (03PS4) 10BBlack: Align mobile VCL much closer to text VCL [puppet] - 10https://gerrit.wikimedia.org/r/234290 (https://phabricator.wikimedia.org/T109286) [19:42:19] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [19:48:15] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1590441 (10Aklapper) [OT] For general information regarding private tickets in Phabricator, see the [[ https://www.mediawiki.org/wiki/Phabricator/Creating_and_renaming_projects#Restricting... [19:48:16] (03PS1) 10John F. Lewis: Move host contact_groups to hiera and migrate existing [puppet] - 10https://gerrit.wikimedia.org/r/235065 [19:48:37] (03PS2) 10John F. Lewis: Move host contact_groups to hiera and migrate existing [puppet] - 10https://gerrit.wikimedia.org/r/235065 [19:55:52] (03CR) 10Alex Monk: "Silver runs it's own jobs, yes. Please add the task to the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/235048 (owner: 10JanZerebecki) [19:58:28] 6operations, 6Labs, 3Labs-sprint-112: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1590473 (10yuvipanda) [19:58:49] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [19:58:54] (03CR) 10RobH: [C: 031] admin: remove dupe 'haithams' from statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/234669 (owner: 10Hashar) [20:00:01] (03CR) 10RobH: [C: 031] admin: add 'demon' to gerrit-admins group [puppet] - 10https://gerrit.wikimedia.org/r/234670 (owner: 10Hashar) [20:00:05] gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150831T2000). [20:03:35] (03CR) 10RobH: [C: 031] "allowing them to run puppet seems sane, in particular since it will automatically run on its own within 30m." [puppet] - 10https://gerrit.wikimedia.org/r/234539 (owner: 10Hashar) [20:05:28] (03PS3) 10Deskana: Make search on wikitech work again [puppet] - 10https://gerrit.wikimedia.org/r/235048 (owner: 10JanZerebecki) [20:07:34] starting parsoid deploy [20:08:30] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:28] !log deployed parsoid version c3e4df5e [20:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:49] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:26:01] 6operations, 10Analytics, 6Discovery, 10MediaWiki-General-or-Unknown, and 5 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#933968 (10Ottomata) [20:26:23] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#933968 (10Ottomata) [20:29:22] (03CR) 10RobH: [C: 031] elasticsearch: ensure /var/run subdir exists [puppet] - 10https://gerrit.wikimedia.org/r/233413 (https://phabricator.wikimedia.org/T109497) (owner: 10Hashar) [20:31:02] (03PS4) 10JanZerebecki: Make search on wikitech work again [puppet] - 10https://gerrit.wikimedia.org/r/235048 (https://phabricator.wikimedia.org/T110635) [20:31:31] YuviPanda: i reviewed a bunch of them [20:31:44] for what its worth, thanks for pushing this idea forward (puppet swat) its awesome. [20:31:47] =] [20:34:53] robh: thanks :D there seems to be broad support, maybe in a month or so we can start having an actual rotation :) [20:35:55] (03CR) 10RobH: "So I put +1, but even the commit message says this isn't the proper fix." [puppet] - 10https://gerrit.wikimedia.org/r/233413 (https://phabricator.wikimedia.org/T109497) (owner: 10Hashar) [20:36:19] I agree it shouldn't default to who is on clinic duty [20:36:35] since clinic duty will typically be handling tasks no one has snagged, where these have the majority of the work done [20:36:47] but, this week i wanna do it, and i happen to be on clinic [20:37:01] not sure if that part was clear when i signed up =] [20:38:11] (03CR) 10Dzahn: [C: 031] "+1. Moritz asked me to run it in compiler. but that fails due to unrelated compiler issue. Error: Failed to compile catalog for node silve" [puppet] - 10https://gerrit.wikimedia.org/r/235048 (https://phabricator.wikimedia.org/T110635) (owner: 10JanZerebecki) [20:40:47] (03CR) 10RobH: [C: 04-1] "technically this may be a sudo level request and require ops meeting approval. After some IRC discussion, I'm changing my vote to reject " [puppet] - 10https://gerrit.wikimedia.org/r/234539 (owner: 10Hashar) [20:44:38] !log ferm for elastic100[4-7] and adjust ferm to include wikitech source [20:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:03] robh: yeah clinic vs swat is something we need to talk about in a week or to [20:46:04] Two [20:46:13] I want _Joe_ back for that rho [20:46:51] indeed [20:49:43] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1590596 (10RobH) I've emailed back into the WordPress support thread and CC'd @slaporte with the update. The email content is below: Mark/Simon/Support, Ok guys, I've attached the policy.wikimed... [20:52:26] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1590605 (10hashar) 3NEW [20:52:37] (03PS2) 10Hashar: admin: let contint-admins run puppet [puppet] - 10https://gerrit.wikimedia.org/r/234539 (https://phabricator.wikimedia.org/T110943) [20:52:59] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1590616 (10RobH) So Arul confirmed they don't provide PDUs, but cannot tell me why they are then installed @ eqdfw. As such, I've chatted with Mark/Chris/Papaul during the ops meeting and we'll be including t... [20:53:07] (03CR) 10Hashar: "I removed this patch from tomorrow PuppetSWAT and filled ops-access-request task T110943 :-)" [puppet] - 10https://gerrit.wikimedia.org/r/234539 (https://phabricator.wikimedia.org/T110943) (owner: 10Hashar) [20:53:29] (03PS3) 10Ori.livneh: pybal (1.08) jessie-wikimedia; urgency=low [debs/pybal] - 10https://gerrit.wikimedia.org/r/235062 [20:57:16] (03CR) 10Ori.livneh: [C: 032] pybal (1.08) jessie-wikimedia; urgency=low [debs/pybal] - 10https://gerrit.wikimedia.org/r/235062 (owner: 10Ori.livneh) [20:58:11] !log imported pybal_1.08_amd64.changes to jessie-wikimedia [20:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:59:46] (03PS1) 10RobH: appending in ryan lanes access request task [puppet] - 10https://gerrit.wikimedia.org/r/235130 [21:00:10] (03CR) 10RobH: [C: 032] appending in ryan lanes access request task [puppet] - 10https://gerrit.wikimedia.org/r/235130 (owner: 10RobH) [21:00:26] 6operations: audit hr staff and tracking sheet (2015-08-17 revision) against shell access/ldap wmf group - https://phabricator.wikimedia.org/T109382#1590629 (10RobH) [21:00:27] 6operations: determine ryan lane's access rights - https://phabricator.wikimedia.org/T109521#1590628 (10RobH) 5Open>3Resolved [21:01:23] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1590634 (10Dzahn) >>! In T108131#1590377, @srodlund wrote: > Because issues of on-boarding and off-boarding can be somewhat sensitive, I wouldn't recommend using Phabricator as tool for ma... [21:01:30] (03Merged) 10jenkins-bot: pybal (1.08) jessie-wikimedia; urgency=low [debs/pybal] - 10https://gerrit.wikimedia.org/r/235062 (owner: 10Ori.livneh) [21:01:39] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1420 bytes in 0.174 second response time [21:05:01] (03PS1) 10RobH: updating nik everett's access [puppet] - 10https://gerrit.wikimedia.org/r/235131 [21:06:18] (03PS2) 10RobH: updating nik everett's access [puppet] - 10https://gerrit.wikimedia.org/r/235131 [21:06:56] (03CR) 10RobH: [C: 032 V: 032] updating nik everett's access [puppet] - 10https://gerrit.wikimedia.org/r/235131 (owner: 10RobH) [21:07:29] 6operations: determine nik everett's shell/production access levels - https://phabricator.wikimedia.org/T109390#1590654 (10RobH) @EBernhardson: Thanks for the feedback, it was quite useful! I've gone ahead and removed him from all the groups listed (wikidatea-query-roots, statistics-privatedata-users, and udp2l... [21:07:57] so who would know about the rc bot :) [21:08:35] as in "i know how to make it come back even after server reboots" [21:08:46] i mean the bot on irc.wikimedia.org [21:11:34] Which of the many bots? The source of all the info? :) [21:12:49] Fun, https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org only says not to touch it ;) [21:13:26] mutante, the source is in a folder in svn... [21:13:42] it's a long time I don't look at it [21:14:01] I think it was python [21:14:48] who knows on which git repo it got moved [21:14:50] .. in svn ? uhm.. ok, so hopefully imported [21:15:06] hehe [21:15:06] Nemo_bis: yes, that's why i ask this way, kind of :) [21:15:11] "it's a long time I don't look at it" [21:16:05] mutante, logging into the server where it is supposed to run [21:16:21] hopefully it won't be too hard to see what was supposed to be running [21:16:23] Platonides: you got shell there? [21:17:09] sorry [21:17:13] udpmixircecho.py ? [21:17:16] "if you log into the server..." [21:17:27] I don't have any shell on production [21:17:40] as should be obvious from that phab ticket [21:17:45] yea, ok, i just remember we rebooted this one before [21:17:45] mutante, looks like it [21:17:49] and the bot didnt come back [21:18:00] that's what i want to avoid this time [21:19:21] oh, an init script is there even [21:19:28] maybe that got added then [21:19:36] /etc/init/ircecho.conf:exec /usr/local/bin/udpmxircecho.py rc-pmtpa localhost [21:19:41] /etc/init/udpmxircecho:exec /usr/local/bin/udpmxircecho.py rc-eqiad argon.wikimedia.org [21:19:45] gwicke / subbu: which of you two (maybe more!) are best to be questioned about parsoid? (running out of the WMF, different project) [21:19:47] pretty sure it was still the "pmtpa" one though [21:20:10] yea, only that is running, not that eqiad stuff [21:20:13] mutante: yes, because eqiad is breaking apparently :) [21:20:50] !log installing package upgrades on argon [21:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:21:20] JohnFLewis, either of us can answer .. go ahead. [21:22:17] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1590687 (10RobH) @faidon: spaces are the best way to ensure complete privacy of the task. It is outlined in the task above (plus a lot of out of band discussion). My understanding is the best way t... [21:22:21] subbu: okay - this may be a discussion or not but firstly; is the apt repo listed on MediaWiki (parsoid.wmflabs.org:8080/deb/) maintained actively or are docs out of date? (asking since I see quite a few parsoid deploys and code changes but no deb changes) [21:22:40] JohnFLewis, also, #mediawiki-parsoid is a good channel to ask in since there are others who can answer as well. [21:23:11] we'll see depending on the answer about :) [21:23:15] JohnFLewis, gwicke maintains those debs .. and we update it occasionally .. we tried updating it on Friday, but ran into some issues with missing gpg keys [21:23:24] * subbu looks for the bug report [21:23:51] ah so it is then. thought I was missing something blatant. thanks for looking up the bug report [21:24:37] JohnFLewis, https://phabricator.wikimedia.org/T110698 [21:25:38] subbu: thanks. YuviPanda ^^ a ticket for you related to labs NFS missing files [21:25:46] (03PS5) 10BBlack: Align mobile VCL much closer to text VCL [puppet] - 10https://gerrit.wikimedia.org/r/234290 (https://phabricator.wikimedia.org/T109286) [21:29:16] (03CR) 10BBlack: [C: 032] Align mobile VCL much closer to text VCL [puppet] - 10https://gerrit.wikimedia.org/r/234290 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [21:35:55] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590739 (10Jalexander) 3NEW [21:37:49] 6operations: Change Google Webmaster password for noc@ - https://phabricator.wikimedia.org/T110951#1590753 (10Jalexander) 3NEW [21:37:58] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590760 (10JohnLewis) a:3Dzahn [21:43:00] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590778 (10JohnLewis) Probably coordinate this with the migration as we're planning on changing it anyway. [21:43:53] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590779 (10Dzahn) Indeed, we are going to change it anyways as part of the migration and that should happen within the same timeframe. [21:45:59] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590789 (10Jalexander) >>! In T110949#1590779, @Dzahn wrote: > Indeed, we are going to change it anyways as part of the migration and that should happen within the same timeframe. Works for me :)... [21:48:26] JohnFLewis: added it to this sprint! [21:48:43] YuviPanda: okay [21:48:47] Thanks [21:54:59] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590802 (10Dzahn) p:5Triage>3Normal [21:57:22] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590808 (10JohnLewis) @jalexander also to make things easier for Daniel, to clear up uncertainty and document something relating to mailman correctly (T109534 as an irrelevant quip), who outside o... [21:57:29] 6operations, 10Wikimedia-Mailing-lists: send follow-up email, announce changes with new mailman version if any that have user impact - https://phabricator.wikimedia.org/T110140#1590810 (10Dzahn) we need a summary of the above or we can just link over here in that email [21:59:54] 6operations, 10Wikimedia-Mailing-lists: send follow-up email, announce changes with new mailman version if any that have user impact - https://phabricator.wikimedia.org/T110140#1590812 (10JohnLewis) A summary? Sure. Will sum up and add any breaking / fun stuff separated for people. [22:06:22] 6operations, 10Wikimedia-Mailing-lists: publish statistics about number of held messages per list - https://phabricator.wikimedia.org/T110609#1590824 (10Dzahn) @sodium:/var/lib/mailman/restore/var/lib/mailman/data# `find . | sed -e "s/pck//g" -e "s/[0-9]//g" | sort | uniq -c | sort -nr | sed -e "s/\.\/heldms... [22:07:15] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1590825 (10RobH) That being stated, perhaps @Aklapper or @chasemp can give a better reasoning, as they both recommended spaces and understand the reasoning behind it. [22:07:59] 6operations, 10Traffic: Upgrade Pybal to 1.08 - https://phabricator.wikimedia.org/T110954#1590827 (10chasemp) 3NEW [22:08:26] 6operations, 10Wikimedia-Mailing-lists: publish statistics about number of held messages per list - https://phabricator.wikimedia.org/T110609#1590834 (10Dzahn) p:5Triage>3Normal [22:08:30] 6operations, 10Wikimedia-Mailing-lists: Identify lists with *large* moderation queues - https://phabricator.wikimedia.org/T110438#1590838 (10JohnLewis) [22:08:31] 6operations, 10Wikimedia-Mailing-lists: publish statistics about number of held messages per list - https://phabricator.wikimedia.org/T110609#1590836 (10JohnLewis) 5Open>3Resolved Resolved in my opinion. Now let's look at these lists and figure why they have tens of thousands of emails in moderation. [22:08:48] 6operations, 10Wikimedia-Mailing-lists: publish statistics about number of held messages per list - https://phabricator.wikimedia.org/T110609#1590839 (10Dzahn) [22:10:52] 6operations, 10Wikimedia-Mailing-lists: Identify lists with *large* moderation queues - https://phabricator.wikimedia.org/T110438#1590842 (10Dzahn) @sodium:/var/lib/mailman/restore/var/lib/mailman/data# `find . | sed -e "s/pck//g" -e "s/[0-9]//g" | sort | uniq -c | sort -nr | sed -e "s/\.\/heldmsg-//g" -e "s/... [22:12:13] 6operations, 10Wikimedia-Mailing-lists: Identify lists with *large* moderation queues - https://phabricator.wikimedia.org/T110438#1590843 (10Dzahn) lists identified. now what, is the ticket resolved too and was a duplicate? or did we want more here, like a cron or something. this was just historic data from s... [22:12:43] 6operations, 10Wikimedia-Mailing-lists: Identify lists with *large* moderation queues - https://phabricator.wikimedia.org/T110438#1590844 (10Dzahn) should still check these lists though to stop them from growing so much again. maybe have to find new admins/moderators [22:14:34] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: right before the switch: lower TTL to 10 seconds - https://phabricator.wikimedia.org/T110135#1590855 (10Dzahn) We agreed this is not really needed and 5 minutes is enough. [22:14:50] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: right before the switch: lower TTL to 10 seconds - https://phabricator.wikimedia.org/T110135#1590856 (10Dzahn) 5Open>3declined [22:14:51] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1590857 (10Dzahn) [22:14:58] 6operations, 10Wikimedia-Mailing-lists: Identify lists with *large* moderation queues - https://phabricator.wikimedia.org/T110438#1590858 (10JohnLewis) this task was meant generally for looking at that list and creating tickets for action with lists if there is a characteristic to look at (e.g. a list has no m... [22:15:12] (03Abandoned) 10Dzahn: lists: lower TTL to 10 seconds [dns] - 10https://gerrit.wikimedia.org/r/233637 (https://phabricator.wikimedia.org/T110135) (owner: 10Dzahn) [22:15:34] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590860 (10Jalexander) >>! In T110949#1590808, @JohnLewis wrote: > @jalexander also to make things easier for Daniel, to clear up uncertainty and document something relating to mailman correctly (... [22:16:10] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1590862 (10JohnLewis) [22:16:46] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: import all lists with the script we wrote for that - https://phabricator.wikimedia.org/T110131#1590865 (10Dzahn) We should just blindly rsync it all so it's exactly like before and if in doubt we can deal with it any time on the new server. That mean... [22:17:18] 6operations, 10Traffic: Upgrade Pybal to 1.08 - https://phabricator.wikimedia.org/T110954#1590866 (10chasemp) [22:19:00] 6operations, 10Wikimedia-Mailing-lists: wikinews-l: no active listadmin - https://phabricator.wikimedia.org/T110956#1590868 (10JohnLewis) 3NEW a:3Dzahn [22:19:19] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1590875 (10Krenair) A space ensures that even if you accidentally (or deliberately) open an object's policy to something that shouldn't actually be able to view the contents, it's still restricted to... [22:21:37] 6operations, 10Wikimedia-Mailing-lists: wikinews-l: no active listadmin - https://phabricator.wikimedia.org/T110956#1590876 (10Dzahn) 15:21 -!- Irssi: #wikinews: Total of 36 nicks [1 ops, 0 halfops, 0 voices, 35 normal] 15:21 < mutante> hey, wikinews people 15:21 < mutante> we need an admin for your mailing li... [22:22:05] 6operations, 10Wikimedia-Mailing-lists: Disable wikiru-l - https://phabricator.wikimedia.org/T110957#1590877 (10JohnLewis) 3NEW a:3Dzahn [22:25:44] 6operations, 10Wikimedia-Mailing-lists: Disable wikiru-l - https://phabricator.wikimedia.org/T110957#1590892 (10Dzahn) 15:26 -!- Channel #wikimedia-ru created Wed Dec 30 05:59:07 2009 15:26 -!- Irssi: Join to #wikimedia-ru was synced in 1 secs 15:26 < mutante> hey putnik 15:27 < mutante> is wikiru-l used at al... [22:26:31] 6operations, 10Wikimedia-Mailing-lists: Disable wikiru-l - https://phabricator.wikimedia.org/T110957#1590896 (10Dzahn) [22:29:35] How do we subscribe to ops@lists.wikimedia.org? [22:29:39] AndyRussG: ^ [22:30:07] dunnow awight, I was just gonna ask you the same question! [22:30:10] (jk) [22:30:34] hehe, echo echo! [22:30:44] awight: ask a listadmin. mutante / robh :) rob might be best as he is on clinic [22:30:58] heh, i can add you both [22:30:59] JohnFLewis: wonderful, thanks for the tip! [22:31:30] robh: pls also add eeggleston, cdentinger, and dkozlowski [22:31:31] robh: thanks much! [22:32:53] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590916 (10Krenair) Can I suggest that it may be a good idea to keep a central always-up-to-date list of people who should have the master/list creation passwords? And in an ideal world, all list... [22:33:15] awight / AndyRussG done and done [22:33:19] plus all those other folks [22:33:19] 6operations, 10Wikimedia-Mailing-lists: Disable wikiru-l - https://phabricator.wikimedia.org/T110957#1590917 (10MaxSem) Wikimedia Russia uses different means of communication, I vote for closure. [22:33:34] robh: fantastic :D [22:34:25] robh: wicked, much appreciated! [22:34:51] You should be able to request subscription in the same way as all other lists [22:34:54] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590923 (10RobH) @Krenair: You can, and I agree with your proposal. Unfortunately, the mailman passwords have a horrible habit of being distributed beyond that list. List admin passwords are set... [22:35:34] i hate shared passwords. [22:36:15] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590924 (10Dzahn) I would like to see such a list as well. That would remove all ambiguity who should have it when we need to change it. (for master and list creator). All admin passwords should n... [22:36:43] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590926 (10RobH) Also I suggest these lists (of who has them) be kept someplace public. If its simply a public listing, not any kind of software maintained list, perhaps wikitech on the mailing l... [22:36:49] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590927 (10Krenair) >>! In T110949#1590923, @RobH wrote: > @Krenair: You can, and I agree with your proposal. Unfortunately, the mailman passwords have a horrible habit of being distributed beyon... [22:37:32] i get the impression that if you asked krenair and i to comment on a dozen different tasks on how to resolve them [22:37:39] we'd end up putting 95% of the same answers =] [22:37:42] :) [22:38:03] PROBLEM - puppet last run on elastic1005 is CRITICAL: CRITICAL: Puppet last ran 3 days ago [22:38:18] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590929 (10JohnLewis) So; on record we have the following: Site password: * Ops * James * Philippe List creator password: * Me [22:38:21] I know my suggestion about list admin passwords was a bit hopeless [22:38:33] but I felt I needed to make the point that shared passwords like these are terrible [22:38:33] doesnt mean someone shouldnt make the suggestion on a nice public task. [22:38:36] indeed [22:39:03] Pretty sure I have the daily article list password somewhere... [22:39:46] 6operations, 10Wikimedia-Mailing-lists: Change Mailman master password - https://phabricator.wikimedia.org/T110949#1590933 (10Dzahn) I don't think it's really that related to NDA or not. It's about having a definitive list of people who ought to have it because they need it. We should try and avoid using it fo... [22:40:00] RECOVERY - puppet last run on elastic1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:42:34] Krenair: maybe you do, maybe you don't - passwords change ;) [22:42:50] but yeah, shared passwords by default is bad and is one thing solved in Mailman 3! (more hype for Rob :p) [22:43:15] Is Rob migrating us to version 3? [22:43:22] :) [22:44:21] 6operations: move racktables and RT to a VM - https://phabricator.wikimedia.org/T105555#1590944 (10Dzahn) p:5Normal>3Low [22:44:49] nope, he just likes what I've told him about it :) [22:44:59] !log disabled puppet on elastic hosts temporarily to safely roll out fw change. elastic seems to have not taken it well and I'm holding for green cluster state. [22:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:45:05] Krenair: i also want that list, thanks! [22:46:10] (03PS1) 10Andrew Bogott: Copy over ca-certificates to labs base image [puppet] - 10https://gerrit.wikimedia.org/r/235142 (https://phabricator.wikimedia.org/T110891) [22:46:21] there is way to much support talk about "HyperKittie" in #mailman :p [22:46:25] re: mailman3 [22:46:45] it's the new archive thing [22:47:05] (03CR) 10Andrew Bogott: [C: 032] Copy over ca-certificates to labs base image [puppet] - 10https://gerrit.wikimedia.org/r/235142 (https://phabricator.wikimedia.org/T110891) (owner: 10Andrew Bogott) [22:47:13] i distrust anything that has to use a cat in the name to appeal to users ;] [22:47:41] "Django application" [22:48:39] RECOVERY - Apache HTTP on mw2187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.194 second response time [22:48:59] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 70705 bytes in 5.717 second response time [22:49:05] 6operations, 10Wikimedia-Mailing-lists: import old staff list archives ? - https://phabricator.wikimedia.org/T109395#1590951 (10Dzahn) 5Open>3stalled [22:52:13] 6operations, 10Wikimedia-Mailing-lists: Maps-l: Disable or re-assign moderators - https://phabricator.wikimedia.org/T110962#1590963 (10JohnLewis) 3NEW a:3Tfinc [23:00:05] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150831T2300). Please do the needful. [23:00:05] ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:08:02] 6operations, 10Wikimedia-Mailing-lists, 6Wiktionary: wiktionary-l: assign new moderators - https://phabricator.wikimedia.org/T110969#1591088 (10JohnLewis) 3NEW a:3Dzahn [23:12:14] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [23:14:06] 6operations, 10Wikimedia-Mailing-lists, 6Wiktionary: wiktionary-l: assign new moderators - https://phabricator.wikimedia.org/T110969#1591095 (10Dzahn) 16:14 -!- Irssi: Join to #wiktionary was synced in 1 secs 16:14 < mutante> hey 16:14 < mutante> would like to assign a new list admin for wiktionary-l 16:14... [23:16:04] (03PS1) 10BBlack: vcl: merge cluster_options into vcl_config, refactor [puppet] - 10https://gerrit.wikimedia.org/r/235145 (https://phabricator.wikimedia.org/T96847) [23:16:06] (03PS1) 10BBlack: standardize hiera-overridable class/config params [puppet] - 10https://gerrit.wikimedia.org/r/235146 (https://phabricator.wikimedia.org/T96847) [23:16:08] (03PS1) 10BBlack: mobile: add bits compat code [puppet] - 10https://gerrit.wikimedia.org/r/235147 (https://phabricator.wikimedia.org/T109286) [23:16:10] (03PS1) 10BBlack: clean up text/mobile whitespace-only diffs [puppet] - 10https://gerrit.wikimedia.org/r/235148 (https://phabricator.wikimedia.org/T109286) [23:16:32] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1591107 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/234702/ https://gerrit.wikimedia.org/r/#/c/234708/ [23:16:50] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1591110 (10Dzahn) [23:16:52] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1591108 (10Dzahn) 5Open>3Resolved Apache issues fixed. Monitoring issues fixed. [23:17:23] 6operations, 10Wikimedia-Mailing-lists: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1591112 (10Dzahn) [23:17:33] (03CR) 10EBernhardson: [C: 032] Configure titlesuggest index for dewiki and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235055 (https://phabricator.wikimedia.org/T110922) (owner: 10EBernhardson) [23:17:58] (03Merged) 10jenkins-bot: Configure titlesuggest index for dewiki and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235055 (https://phabricator.wikimedia.org/T110922) (owner: 10EBernhardson) [23:21:41] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: update config of cirrussearch experimental suggestions api (duration: 00m 12s) [23:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:19] 6operations, 10Wikimedia-Mailing-lists: wikimediave-l: decide status of list - https://phabricator.wikimedia.org/T110974#1591149 (10JohnLewis) 3NEW a:3Dzahn [23:24:20] 6operations, 10Wikimedia-Mailing-lists: rsync all configs and archives one more time - https://phabricator.wikimedia.org/T110129#1591156 (10Dzahn) need to get rsyncd back up manually because we need to rsync but that was added via puppet in the migration role. now we apply the prod role [23:24:25] (03PS1) 10EBernhardson: Revert "Configure titlesuggest index for dewiki and enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235150 [23:24:34] (03CR) 10EBernhardson: [C: 032] Revert "Configure titlesuggest index for dewiki and enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235150 (owner: 10EBernhardson) [23:24:40] (03Merged) 10jenkins-bot: Revert "Configure titlesuggest index for dewiki and enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235150 (owner: 10EBernhardson) [23:25:17] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: revert update for cirrussearch experimental suggestions api (duration: 00m 12s) [23:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:33] 6operations, 10Wikimedia-Mailing-lists: wikimedia-ke: close list - https://phabricator.wikimedia.org/T110975#1591158 (10JohnLewis) 3NEW a:3Dzahn [23:27:31] 6operations, 10Wikimedia-Mailing-lists: wikimediabe-l: decide status of list - https://phabricator.wikimedia.org/T110974#1591165 (10Dzahn) [23:27:56] !log ebernhardson@tin Synchronized php-1.26wmf20/extensions/CirrusSearch/: (no message) (duration: 00m 12s) [23:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:28:29] 6operations, 10Wikimedia-Mailing-lists: wikimediabe-l: decide status of list - https://phabricator.wikimedia.org/T110974#1591149 (10Dzahn) The last valid message in archives is from May 2015. [23:31:03] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1591177 (10JohnLewis) Communicating with 'education' list admins to discuss spam practices. List is active and admins are current WMF employees. Just noting. [23:31:45] 6operations, 10Wikimedia-Mailing-lists: wikimediabe-l: decide status of list - https://phabricator.wikimedia.org/T110974#1591180 (10Dzahn) a:5Dzahn>3None [23:34:48] 6operations, 5Patch-For-Review: Ferm rules for elasticsearch - https://phabricator.wikimedia.org/T104962#1591184 (10chasemp) Regarding the elastic1001 problem (where we have concerns about instability on the master). We discussed "preseeding" the ferm stuff to ensure a full config. I created: ```aptitude in... [23:37:10] 6operations, 10Wikimedia-Mailing-lists: Disable wikiru-l - https://phabricator.wikimedia.org/T110957#1591196 (10Dzahn) I ran our "disable_list.sh" script on this. It finished with: "wikiru-l disabled. Archives should be available at current location, all mail should be moderated and the list should not be on... [23:38:04] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [23:38:06] 6operations, 10Wikimedia-Mailing-lists: Disable wikiru-l - https://phabricator.wikimedia.org/T110957#1591201 (10Dzahn) p:5High>3Low [23:38:39] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1591207 (10Dzahn) [23:38:41] 6operations, 10Wikimedia-Mailing-lists: Disable wikiru-l - https://phabricator.wikimedia.org/T110957#1591205 (10Dzahn) 5Open>3Resolved feel free to reopen if we want anything else to change, like new admins or something [23:39:10] (03PS1) 10Dzahn: mailman: ferm, allow rsync from sodium for migration [puppet] - 10https://gerrit.wikimedia.org/r/235155 (https://phabricator.wikimedia.org/T110129) [23:39:22] (03PS5) 10Rush: Make search on wikitech work again [puppet] - 10https://gerrit.wikimedia.org/r/235048 (https://phabricator.wikimedia.org/T110635) (owner: 10JanZerebecki) [23:40:32] !log ori@tin Synchronized php-1.26wmf20/extensions/EducationProgram: 97ab82eab2: Updated mediawiki/core Project: mediawiki/extensions/EducationProgram 85a7d3932c1a4ad28f1a8dd05704f4e524152349 (duration: 00m 14s) [23:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:04] (03CR) 10John F. Lewis: [C: 031] mailman: ferm, allow rsync from sodium for migration [puppet] - 10https://gerrit.wikimedia.org/r/235155 (https://phabricator.wikimedia.org/T110129) (owner: 10Dzahn) [23:43:54] (03CR) 10Rush: [C: 032] Make search on wikitech work again [puppet] - 10https://gerrit.wikimedia.org/r/235048 (https://phabricator.wikimedia.org/T110635) (owner: 10JanZerebecki) [23:47:31] (03PS1) 10EBernhardson: Revert "Revert "Configure titlesuggest index for dewiki and enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235156 [23:47:38] (03CR) 10EBernhardson: [C: 032] Revert "Revert "Configure titlesuggest index for dewiki and enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235156 (owner: 10EBernhardson) [23:47:46] (03Merged) 10jenkins-bot: Revert "Revert "Configure titlesuggest index for dewiki and enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235156 (owner: 10EBernhardson) [23:48:56] revert revert revert [23:49:18] turned out problem was pre-existing :P [23:49:23] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: reenable config changes for cirrus experimental completion api (duration: 00m 12s) [23:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:49:33] started on the 28th [23:54:58] (03CR) 10Alex Monk: [C: 032] Use CodeEditor for HTML templates on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233665 (https://phabricator.wikimedia.org/T110151) (owner: 10Legoktm) [23:55:25] (03Merged) 10jenkins-bot: Use CodeEditor for HTML templates on Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233665 (https://phabricator.wikimedia.org/T110151) (owner: 10Legoktm) [23:56:50] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/233665/ (duration: 00m 11s) [23:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master