[00:00:59] (03CR) 10Dzahn: "it's that passwords::mysql::phabricator "mysql_appuser" and "mysql_apppass" are actually just "appuser" and "apppass" without the mysql_ p" [puppet] - 10https://gerrit.wikimedia.org/r/307354 (https://phabricator.wikimedia.org/T144112) (owner: 10Chad) [00:01:23] legoktm: Signed and uploaded. [00:01:37] (03CR) 10Platonides: ":NickServ!NickServ@services. NOTICE %nick% :You are now identified for %account%." [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) (owner: 10BryanDavis) [00:01:51] thank you :)! [00:01:58] my key is published at https://www.mediawiki.org/keys/keys.html [00:01:58] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: puppet fail [00:02:05] ostriches i think swat is finished according to https://wikitech.wikimedia.org/wiki/Deployments [00:02:24] I think it's finished when RoanKattouw tells me its finished. [00:02:35] oh [00:02:51] I'm not done yet [00:02:58] Only the Echo+Flow patches and then I'll be done [00:04:59] !log catrope@tin Synchronized php-1.28.0-wmf.17/extensions/Echo: SWAT (duration: 00m 53s) [00:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:09] !log catrope@tin Synchronized php-1.28.0-wmf.17/extensions/Flow: SWAT (duration: 01m 09s) [00:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:11:00] (03PS4) 10Dzahn: Phab: Remove config abstraction. Useless & confusing [puppet] - 10https://gerrit.wikimedia.org/r/307354 (https://phabricator.wikimedia.org/T144112) (owner: 10Chad) [00:14:12] (03PS5) 10Dzahn: Phab: Remove config abstraction. Useless & confusing [puppet] - 10https://gerrit.wikimedia.org/r/307354 (https://phabricator.wikimedia.org/T144112) (owner: 10Chad) [00:18:20] RoanKattouw have you finished swat? [00:18:26] Yes, sorry [00:18:35] * RoanKattouw promises to do better next time and not get distracted so much [00:18:40] ostriches: All yours [00:18:42] Ok thanks [00:19:45] Hmmm, ok now to figure out what's up with systemd [00:19:49] And why it don't like me [00:22:40] Hmmm, gerrit was running from something that wasn't systemd/puppet, that's fun. [00:23:07] lol [00:23:15] Ok fixed. [00:23:19] Whatever, I'm done for the day. [00:23:43] ostriches thankyou [00:24:18] hmm, ok, thanks [00:25:12] paladox' fix works, links to phab comments un-broken :) [00:25:23] Yep :) [00:25:27] nice [00:29:08] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:30:41] Yep finally closed the task now [00:30:44] :) [00:34:42] RoanKattouw: jenkins no workie :'( [00:34:59] AaronSchulz: Link/context? [00:35:43] bearloga: yes, you can use hadoop without having access to private data like webrequest logs [00:35:51] if you are not in the analytics-privatedata-users group [00:35:55] you can't read the webrequest logs [00:36:23] RoanKattouw: barfage at https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/73961/console plus random rake failures [00:36:43] @ottomata: Cool! Thank you! [00:37:21] AaronSchulz: Jenkins failed to download from Gerrit because Gerrit was being restarted just then [00:37:27] A recheck should fix it [00:38:26] the reandom rake failure is not due to the restart [00:38:53] it's https://phabricator.wikimedia.org/T143233 [00:39:03] but either way not related to your actual change [00:41:17] grrrit-wm: off and on again [00:42:42] (03CR) 10Dzahn: "done separately because i'm not sure about the others yet https://gerrit.wikimedia.org/r/#/c/307665/1" [puppet] - 10https://gerrit.wikimedia.org/r/306501 (owner: 10Dzahn) [00:53:04] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2596548 (10Dzahn) @Andrew i see https://wikitech.wikimedia.org/wiki/User:VolkerE but i also dont find it in LDAP. is it wiki user name vs. shell user name again? i think so [00:57:46] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2596554 (10Dzahn) yes, it is. VolkerE is the "sn". this is how i found now with ldapsearch instead of ldaplist ``` [terbium:~] $ ldapsearch -x sn=VolkerE | grep uid dn: uid=volker-e... [00:58:27] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2596559 (10Dzahn) so it's underscore vs. dash vs. camel case [terbium:~] $ ldaplist -l passwd volker-e [01:07:17] (03PS1) 10Dzahn: admin: create shell account for Volker E. [puppet] - 10https://gerrit.wikimedia.org/r/307667 (https://phabricator.wikimedia.org/T143465) [01:08:42] (03PS2) 10Dzahn: admin: create shell account for Volker E. [puppet] - 10https://gerrit.wikimedia.org/r/307667 (https://phabricator.wikimedia.org/T143465) [01:09:00] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2596573 (10Dzahn) @Volker_E just one last thing, could you sign L3? [01:14:10] (03PS1) 10Madhuvishy: Specify minute param for git pull cdnjs cron [puppet] - 10https://gerrit.wikimedia.org/r/307668 [01:16:49] (03CR) 10jenkins-bot: [V: 04-1] Specify minute param for git pull cdnjs cron [puppet] - 10https://gerrit.wikimedia.org/r/307668 (owner: 10Madhuvishy) [01:17:53] (03CR) 10Madhuvishy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/307668 (owner: 10Madhuvishy) [01:17:58] wtf jenkins [01:19:13] (03CR) 10Madhuvishy: [C: 032 V: 032] Specify minute param for git pull cdnjs cron [puppet] - 10https://gerrit.wikimedia.org/r/307668 (owner: 10Madhuvishy) [01:23:34] PROBLEM - MariaDB Slave Lag: m3 on db1043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1401.33 seconds [01:35:07] madhuvishy: the random jenkins thing is https://phabricator.wikimedia.org/T144325 and it looks like it's the fault of fastly [01:35:16] you're not the only one today [01:42:12] (03PS1) 10Andrew Bogott: Labs firstboot.sh: Use the new project_id setting [puppet] - 10https://gerrit.wikimedia.org/r/307671 (https://phabricator.wikimedia.org/T105891) [01:46:07] RECOVERY - MariaDB Slave Lag: m3 on db1043 is OK: OK slave_sql_lag Replication lag: 0.60 seconds [01:47:08] (03CR) 10Alex Monk: [C: 031] "Seems to work" [puppet] - 10https://gerrit.wikimedia.org/r/307671 (https://phabricator.wikimedia.org/T105891) (owner: 10Andrew Bogott) [01:48:54] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: Puppet has 1 failures [01:51:05] (03CR) 10Andrew Bogott: [C: 032] Labs firstboot.sh: Use the new project_id setting [puppet] - 10https://gerrit.wikimedia.org/r/307671 (https://phabricator.wikimedia.org/T105891) (owner: 10Andrew Bogott) [01:58:50] ostriches: Did the Gerrit->Phab bot break? [01:59:00] https://gerrit.wikimedia.org/r/307672 but https://phabricator.wikimedia.org/T138356 [01:59:15] ostriches: Ugh, I fatfingered it, never mind [02:13:56] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:25:45] 06Operations, 10Analytics, 07LDAP: Can't log into https://piwik.wikimedia.org/ - https://phabricator.wikimedia.org/T144326#2596648 (10Peachey88) [02:34:50] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.16) (duration: 12m 45s) [02:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:03] 06Operations, 10Analytics: Can't log into https://piwik.wikimedia.org/ - https://phabricator.wikimedia.org/T144326#2596656 (10Tbayer) [02:37:57] 06Operations, 10Analytics: Can't log into https://piwik.wikimedia.org/ - https://phabricator.wikimedia.org/T144326#2596371 (10Tbayer) @Peachey88 : As explained in the task description, this issue is specifically *not* about LDAP. [02:40:14] (03PS3) 10BryanDavis: Wait for ack of identify from NickServ before joining channels [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) [02:57:14] (03CR) 10BryanDavis: [V: 031] "Tested via cherry-pick on tool labs" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) (owner: 10BryanDavis) [02:59:50] jouncebot: next [02:59:51] In 10 hour(s) and 0 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160831T1300) [03:10:24] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.17) (duration: 19m 02s) [03:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:11:22] !log krinkle@tin Synchronized php-1.28.0-wmf.17/resources/src/mediawiki/mediawiki.js: Ie4e7464aa811 and I90eea4bfe (duration: 00m 47s) [03:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:17:37] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Aug 31 03:17:37 UTC 2016 (duration 7m 13s) [03:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:35:47] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 944.21 seconds [03:48:49] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 173.27 seconds [04:27:36] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 209, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [04:33:21] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [04:34:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [04:40:18] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 211, down: 0, dormant: 0, excluded: 0, unused: 0 [05:20:14] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [05:22:45] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4997520 keys - replication_delay is 0 [05:30:34] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 209, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [05:35:34] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 211, down: 0, dormant: 0, excluded: 0, unused: 0 [05:57:24] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Puppet has 1 failures [06:11:07] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:16:56] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:24:47] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:31:12] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [06:32:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [07:00:22] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [07:01:56] (03PS4) 10Jcrespo: Remove db1027 from internal dns entries [dns] - 10https://gerrit.wikimedia.org/r/289168 (https://phabricator.wikimedia.org/T135253) [07:05:17] !log reimaging mw2104-mw2108 to jessie [07:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:07:38] !log removed db1027 from our dns [07:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:10:48] !log removed cached files on stat1001 (/var/www/limn-public-data/caching) after checking with joal, disk space on root partition was depleted [07:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:11:42] RECOVERY - Disk space on stat1001 is OK: DISK OK [07:21:58] PROBLEM - Disk space on stat1001 is CRITICAL: DISK CRITICAL - free space: / 50 MB (0% inode=93%) [07:25:48] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:33:20] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.80 seconds [07:41:53] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-management: Request access to data for reader research - https://phabricator.wikimedia.org/T143718#2597272 (10ph_singer) @Cmjohnson Thanks a lot! Unfortunately, we have some troubles accessing the host. ``` ssh -vv psinger@stat1002.eqia... [07:45:51] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.31 seconds [07:47:01] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:47:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [07:50:21] 06Operations, 10ops-eqiad, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685#2597361 (10Gehel) elastic1046 is still documented as in rack D4 in racktables. @Cmjohnson I assume t... [07:52:00] (03PS1) 10Gehel: elastic104[4567] moved to new racks [puppet] - 10https://gerrit.wikimedia.org/r/307700 (https://phabricator.wikimedia.org/T143685) [07:52:21] RECOVERY - Disk space on stat1001 is OK: DISK OK [07:52:37] moritzm: --^ [07:53:32] yep, saw that in -analytics :-) [07:53:53] (03CR) 10Hashar: [C: 031] contint: fix resource conflict with service::deploy::common [puppet] - 10https://gerrit.wikimedia.org/r/307561 (https://phabricator.wikimedia.org/T143065) (owner: 10BryanDavis) [07:55:54] I think I am the one to blame, I have probably messed up with rsync during the reimage [07:55:57] months ago [07:56:03] otherwise I don't explain this issue [07:56:04] sorry :( [08:05:36] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2597465 (10SindyM3) >>! In T118388#2595125, @Multichill wrote: >>>! In T118388#2594683, @Dzahn wrote: >> @SindyM3 I think this should probably have a new tick... [08:05:57] (03PS1) 10Elukey: Clean up unused AQS to Hadoop firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/307703 (https://phabricator.wikimedia.org/T138609) [08:12:48] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2597481 (10SindyM3) @Dzahn I'm sorry for not creating a new ticket. The admin of schippers.wikimedia.nl is Lemonbit and the send me this email. So I'm lookin... [08:18:23] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:23] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:23] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:26] phabricator is down? [08:18:34] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:35] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:42] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:42] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:42] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:47] https://phabricator.wikimedia.org/T144341 Request from 71.202.162.80 via cp4002 cp4002, Varnish XID 131467205 Error: 503, Backend fetch failed at Wed, 31 Aug 2016 08:18:28 GMT [08:18:48] hmm [08:18:52] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:55] wikis are down for me [08:18:56] getting an error when trying to search at enwp [08:19:11] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:12] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:12] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:12] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:24] elukey, moritzm: ^^ [08:19:31] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:31] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:31] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:31] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:31] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:32] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:32] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:37] yes it seems that ulsfo is down [08:19:38] PROBLEM - Host misc-web-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::3:d [08:20:00] PROBLEM - Host misc-web-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:20:02] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:20:02] (03CR) 10DCausse: [C: 031] Report partial result from mwgrep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307652 (https://phabricator.wikimedia.org/T127788) (owner: 10EBernhardson) [08:20:03] ema you there? [08:20:16] Lots of ISPs are being DDoSed right now [08:20:33] (not related to here) [08:20:41] uh oh [08:20:42] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [08:20:49] PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1 [08:20:55] http://venturebeat.com/2016/08/30/cyberattackers-spoil-world-of-warcraft-legion-launch-with-ddos-attack/ [08:20:55] PROBLEM - Host maps-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:d [08:20:57] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [08:20:57] PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100% [08:20:57] PROBLEM - Host lvs4004 is DOWN: PING CRITICAL - Packet loss = 100% [08:20:57] PROBLEM - Host cr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:20:57] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [08:21:03] PROBLEM - Host upload-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:21:03] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [08:21:06] preparing a patch to depool ulsfo [08:21:10] godog: should we remove ulsfo from gdns? [08:21:13] super [08:21:15] thanks :) [08:21:35] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:b [08:21:36] PROBLEM - Host asw-ulsfo.mgmt.ulsfo.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [08:21:36] PROBLEM - Host cr2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:21:43] PROBLEM - Host maps-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:21:43] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:22:26] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 75.03 ms [08:22:26] RECOVERY - Host cp4013 is UP: PING OK - Packet loss = 0%, RTA = 74.65 ms [08:22:27] RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 74.95 ms [08:22:27] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 75.97 ms [08:22:27] RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 75.22 ms [08:22:27] RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 75.23 ms [08:22:27] RECOVERY - Host cp4009 is UP: PING OK - Packet loss = 0%, RTA = 74.86 ms [08:22:28] RECOVERY - Host cp4010 is UP: PING OK - Packet loss = 0%, RTA = 75.52 ms [08:22:28] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 76.26 ms [08:22:29] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 75.27 ms [08:22:29] RECOVERY - Host cp4006 is UP: PING OK - Packet loss = 0%, RTA = 76.06 ms [08:22:30] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 75.24 ms [08:22:30] RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 75.89 ms [08:22:31] RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 74.83 ms [08:22:42] back for me now... [08:23:03] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 74.64 ms [08:23:23] Oh whew. :P [08:23:33] RECOVERY - Host maps-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 75.75 ms [08:23:33] (03PS1) 10Alexandros Kosiaris: ulsfo: Drain it, there are issues [dns] - 10https://gerrit.wikimedia.org/r/307706 [08:23:42] RECOVERY - Host upload-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 74.95 ms [08:23:43] legoktm: Did you know that for the above link? [08:23:46] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [08:23:46] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [08:23:46] RECOVERY - Host asw-ulsfo.mgmt.ulsfo.wmnet is UP: PING OK - Packet loss = 0%, RTA = 75.39 ms [08:24:00] PROBLEM - Disk space on stat1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=93%) [08:24:06] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:24:06] RECOVERY - Host cr2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.95 ms [08:24:15] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 75.41 ms [08:24:20] (03CR) 10Alexandros Kosiaris: [C: 032] ulsfo: Drain it, there are issues [dns] - 10https://gerrit.wikimedia.org/r/307706 (owner: 10Alexandros Kosiaris) [08:24:22] RECOVERY - Host maps-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 74.85 ms [08:24:22] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.42 ms [08:24:53] RECOVERY - Host cr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 76.55 ms [08:24:56] !log drained ulsfo of traffic, it seems to be unstable currently [08:24:59] RECOVERY - Host misc-web-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 75.43 ms [08:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:25:12] RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 76.55 ms [08:25:21] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 0.89 seconds [08:26:33] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [08:26:45] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [08:26:45] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [08:26:46] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [08:26:46] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [08:26:59] RECOVERY - Host cp4010 is UP: PING OK - Packet loss = 0%, RTA = 75.08 ms [08:26:59] RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 74.49 ms [08:26:59] RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 74.89 ms [08:27:03] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 74.84 ms [08:27:09] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 75.22 ms [08:28:29] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [08:28:30] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: puppet fail [08:28:39] PROBLEM - PyBal backends health check on lvs4002 is CRITICAL: PYBAL CRITICAL - uploadlb_80 - Could not depool server cp4005.ulsfo.wmnet because of too many down!: misc_weblb_80 - Could not depool server cp4001.ulsfo.wmnet because of too many down!: mapslb6_80 - Could not depool server cp4012.ulsfo.wmnet because of too many down!: mapslb_80 - Could not depool server cp4020.ulsfo.wmnet because of too many down!: uploadlb6_80 - Could not [08:28:39] PROBLEM - PyBal backends health check on lvs4001 is CRITICAL: PYBAL CRITICAL - geoiplookuplb_80 - Could not depool server cp4009.ulsfo.wmnet because of too many down!: textlb_80 - Could not depool server cp4017.ulsfo.wmnet because of too many down!: geoiplookuplb_443 - Could not depool server cp4016.ulsfo.wmnet because of too many down! [08:29:38] (03CR) 10Filippo Giunchedi: [C: 031] partman/cassandra: rename recipes, delete unused recipes [puppet] - 10https://gerrit.wikimedia.org/r/307665 (owner: 10Dzahn) [08:31:14] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/3899/" [puppet] - 10https://gerrit.wikimedia.org/r/307703 (https://phabricator.wikimedia.org/T138609) (owner: 10Elukey) [08:31:18] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [08:31:46] PROBLEM - LVS HTTP IPv6 on misc-web-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection timed out [08:31:47] PROBLEM - IPsec on cp4005 is CRITICAL: Timeout while attempting connection [08:31:47] PROBLEM - Varnish traffic logger - varnishxcps on cp4008 is CRITICAL: Timeout while attempting connection [08:31:47] PROBLEM - IPsec on cp4011 is CRITICAL: Timeout while attempting connection [08:31:53] PROBLEM - Host upload-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:32:20] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:29] PROBLEM - puppet last run on lvs4002 is CRITICAL: Timeout while attempting connection [08:32:30] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:40] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:32:40] PROBLEM - Juniper alarms on asw-ulsfo.mgmt.ulsfo.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.128.128.6 [08:32:40] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:40] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:40] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:49] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [08:33:05] PROBLEM - Host misc-web-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:33:09] I'm silencing at least lvs for a bit [08:33:19] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [08:33:19] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [08:33:19] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [08:33:19] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [08:33:35] PROBLEM - LVS HTTP IPv6 on maps-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: No route to host [08:33:43] PROBLEM - Host text-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:33:43] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [08:33:43] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [08:33:43] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [08:33:43] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:01] PROBLEM - SSH on lvs4002 is CRITICAL: Connection timed out [08:34:08] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:08] PROBLEM - Host lvs4004 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:08] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:08] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:24] PROBLEM - Host maps-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:d [08:34:29] PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1 [08:34:32] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:32] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:32] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:38] PROBLEM - Host maps-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:34:38] PROBLEM - Host cr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:34:38] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:38] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [08:35:02] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [1000.0] [08:35:15] 06Operations, 10Traffic: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2597553 (10Danielsberger) I have been working for some time with Ramesh on cache admission policies for Akamai workloads, and I can contribute the following takeaways # frequency-based admission -... [08:35:31] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [08:35:34] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ffff::6 [08:35:43] PROBLEM - Host cr2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:35:43] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:37:31] PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100% [08:37:58] PROBLEM - Host misc-web-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::3:d [08:38:17] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:b [08:38:53] PROBLEM - Host asw-ulsfo.mgmt.ulsfo.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [08:40:33] PROBLEM - Host cr1-ulsfo IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ffff::1 [08:40:33] PROBLEM - Host cr2-ulsfo IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ffff::2 [08:40:33] godog: good morning! thx for the puppet swat merges yesterday. Can we get to update jenkins-debian-glue deb package for Jessie? [08:40:34] RECOVERY - Host cp4014 is UP: PING WARNING - Packet loss = 93%, RTA = 74.48 ms [08:40:34] RECOVERY - Host cp4009 is UP: PING WARNING - Packet loss = 50%, RTA = 74.91 ms [08:40:34] RECOVERY - Host cp4015 is UP: PING WARNING - Packet loss = 44%, RTA = 74.63 ms [08:40:34] RECOVERY - Host cp4013 is UP: PING OK - Packet loss = 16%, RTA = 75.22 ms [08:40:34] RECOVERY - Host ripe-atlas-ulsfo is UP: PING WARNING - Packet loss = 64%, RTA = 74.78 ms [08:40:43] task being https://phabricator.wikimedia.org/T141114 with all relevant links [08:40:47] RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING WARNING - Packet loss = 37%, RTA = 74.85 ms [08:40:47] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 74.91 ms [08:40:48] RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 75.45 ms [08:40:48] RECOVERY - Host cp4008 is UP: PING OK - Packet loss = 0%, RTA = 75.63 ms [08:40:48] RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 74.60 ms [08:40:48] RECOVERY - Host lvs4004 is UP: PING OK - Packet loss = 0%, RTA = 75.29 ms [08:40:48] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 75.12 ms [08:40:49] RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 75.03 ms [08:40:49] RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 74.97 ms [08:40:50] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 74.69 ms [08:40:50] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 75.11 ms [08:40:51] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 75.41 ms [08:40:51] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 74.79 ms [08:40:52] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 74.57 ms [08:41:10] RECOVERY - Host maps-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 75.80 ms [08:41:16] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 78.01 ms [08:41:36] RECOVERY - Host maps-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 74.88 ms [08:41:36] RECOVERY - Host cr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 77.65 ms [08:41:36] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.04 ms [08:41:47] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 75.61 ms [08:42:02] RECOVERY - LVS HTTP IPv6 on maps-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 390 bytes in 0.153 second response time [08:42:12] RECOVERY - LVS HTTP IPv6 on misc-web-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 338 bytes in 0.150 second response time [08:42:22] RECOVERY - Host misc-web-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 74.61 ms [08:42:26] RECOVERY - SSH on lvs4002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [08:42:40] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [08:42:40] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 28 ESP OK [08:42:41] RECOVERY - Varnish traffic logger - varnishxcps on cp4008 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishxcps, UID = 0 (root) [08:42:59] RECOVERY - Host cr2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.23 ms [08:42:59] RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 77.76 ms [08:43:18] RECOVERY - Host upload-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 74.56 ms [08:43:39] RECOVERY - PyBal backends health check on lvs4001 is OK: PYBAL OK - All pools are healthy [08:43:45] RECOVERY - Host misc-web-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 74.73 ms [08:43:54] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 75.38 ms [08:46:54] hashar: yeah I'll take a look [08:46:58] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: puppet fail [08:47:09] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: puppet fail [08:47:18] RECOVERY - Host cr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 77.20 ms [08:47:18] RECOVERY - Host cr2-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 75.84 ms [08:47:39] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [08:47:51] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [08:47:58] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: puppet fail [08:47:59] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 3 failures [08:47:59] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: puppet fail [08:47:59] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail [08:48:50] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [08:48:50] PROBLEM - puppet last run on lvs4004 is CRITICAL: Timeout while attempting connection [08:48:58] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:48:59] PROBLEM - puppet last run on cp4008 is CRITICAL: Timeout while attempting connection [08:48:59] PROBLEM - puppet last run on cp4020 is CRITICAL: Timeout while attempting connection [08:49:14] godog: it is solely for the CI Jessie slaves which all have the package manually updated already [08:49:29] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:29] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:29] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:39] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:39] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:48] PROBLEM - Host cr2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [08:49:51] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:52] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:52] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:52] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:52] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:10] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 75.01 ms [08:50:21] RECOVERY - Host cp4008 is UP: PING OK - Packet loss = 0%, RTA = 74.74 ms [08:50:21] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 75.92 ms [08:50:21] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 75.35 ms [08:50:21] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 75.58 ms [08:50:21] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 76.52 ms [08:50:22] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 75.31 ms [08:50:22] RECOVERY - Host cp4010 is UP: PING OK - Packet loss = 0%, RTA = 75.28 ms [08:50:23] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 74.56 ms [08:50:23] RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 75.29 ms [08:50:58] RECOVERY - Host cr2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 76.90 ms [08:51:14] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:51:43] hashar: ack, thanks [08:52:19] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:53:52] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [08:53:52] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [08:53:59] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:00] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:00] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:00] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:00] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:00] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:00] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:01] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:19] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:19] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:39] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:51] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:58] RECOVERY - Host cp4018 is UP: PING WARNING - Packet loss = 61%, RTA = 83.29 ms [08:54:59] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 74.77 ms [08:55:10] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 74.80 ms [08:55:10] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 74.72 ms [08:55:10] RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 74.91 ms [08:55:10] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 75.32 ms [08:55:10] RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 75.26 ms [08:55:10] RECOVERY - Host cp4010 is UP: PING OK - Packet loss = 0%, RTA = 74.60 ms [08:55:11] RECOVERY - Host cp4009 is UP: PING OK - Packet loss = 0%, RTA = 75.28 ms [08:55:11] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 74.56 ms [08:55:12] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 74.88 ms [08:55:12] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 75.93 ms [08:55:13] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 74.68 ms [08:55:19] RECOVERY - Host cp4014 is UP: PING OK - Packet loss = 0%, RTA = 75.89 ms [08:55:48] PROBLEM - PyBal backends health check on lvs4004 is CRITICAL: PYBAL CRITICAL - uploadlb_80 - Could not depool server cp4006.ulsfo.wmnet because of too many down!: misc_weblb_80 - Could not depool server cp4002.ulsfo.wmnet because of too many down!: uploadlb6_80 - Could not depool server cp4006.ulsfo.wmnet because of too many down!: misc_weblb6_80 - Could not depool server cp4002.ulsfo.wmnet because of too many down!: misc_weblb_443 - [08:57:44] PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:04] sigh, sorry about the page [08:58:29] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:29] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:29] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:29] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:44] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:45] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:48] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:48] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:09] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:09] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:18] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:18] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:18] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:18] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:39] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:39] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:40] that shouldn't have paged though, downtimed in icinga heh [08:59:48] RECOVERY - Host cp4014 is UP: PING WARNING - Packet loss = 50%, RTA = 74.85 ms [08:59:49] RECOVERY - Host cp4009 is UP: PING OK - Packet loss = 0%, RTA = 75.43 ms [08:59:50] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 74.84 ms [08:59:50] RECOVERY - Host cp4010 is UP: PING OK - Packet loss = 0%, RTA = 74.83 ms [08:59:50] RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 74.82 ms [08:59:50] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 75.98 ms [08:59:50] RECOVERY - Host cp4008 is UP: PING OK - Packet loss = 0%, RTA = 75.36 ms [08:59:51] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 76.67 ms [08:59:51] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 75.04 ms [08:59:52] RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 75.90 ms [08:59:52] RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 76.27 ms [08:59:53] RECOVERY - Host cp4006 is UP: PING OK - Packet loss = 0%, RTA = 75.66 ms [08:59:53] RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 74.57 ms [08:59:54] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 75.07 ms [08:59:54] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 75.62 ms [09:00:01] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 75.52 ms [09:00:14] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 75.02 ms [09:01:52] RECOVERY - PyBal backends health check on lvs4004 is OK: PYBAL OK - All pools are healthy [09:03:12] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:03:42] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:04:02] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:04:41] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:05:01] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:05:13] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:06:13] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [1000.0] [09:06:22] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:06:32] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:07:03] hashar: I take it jenkins-debian-glue 0.11.0 you won't need anymore? [09:07:43] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:09:03] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:09:04] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:09:24] 06Operations, 10Continuous-Integration-Infrastructure: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2597635 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi this is completed, package uploaded ``` # reprepro list jessie-wikimedia... [09:10:41] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:11:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:13:52] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:14:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:15:20] (03PS2) 10Filippo Giunchedi: Revert "Revert "Change-Prop: Rerender summary on wikidata item update"" [puppet] - 10https://gerrit.wikimedia.org/r/307641 (owner: 10Mobrovac) [09:17:56] (03CR) 10Filippo Giunchedi: [C: 032] Revert "Revert "Change-Prop: Rerender summary on wikidata item update"" [puppet] - 10https://gerrit.wikimedia.org/r/307641 (owner: 10Mobrovac) [09:18:11] (03CR) 10Alexandros Kosiaris: [C: 031] Clean up unused AQS to Hadoop firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/307703 (https://phabricator.wikimedia.org/T138609) (owner: 10Elukey) [09:18:17] 06Operations, 10netops, 13Patch-For-Review: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2597652 (10akosiaris) Given the patch set above, I 've removed the rule for port 7000. I 've LGTMed the patch as well.Thanks! [09:18:25] mobrovac: ^ merged [09:18:34] grazie godog aka filippo [09:18:40] i'll run puppet and restart [09:18:56] np mobrovac ! [09:19:10] 06Operations, 10netops, 13Patch-For-Review: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2597653 (10elukey) thanks a lot and sorry for the delay! [09:19:27] (03PS2) 10Elukey: Clean up unused AQS to Hadoop firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/307703 (https://phabricator.wikimedia.org/T138609) [09:19:42] RECOVERY - Disk space on stat1001 is OK: DISK OK [09:21:12] (03PS2) 10Filippo Giunchedi: puppet_compiler: add script to update facts [puppet] - 10https://gerrit.wikimedia.org/r/306890 [09:21:29] (03CR) 10Elukey: [C: 032] Clean up unused AQS to Hadoop firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/307703 (https://phabricator.wikimedia.org/T138609) (owner: 10Elukey) [09:22:23] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: Puppet has 8 failures [09:22:43] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: Puppet has 8 failures [09:22:50] (03CR) 10Filippo Giunchedi: [C: 032] puppet_compiler: add script to update facts [puppet] - 10https://gerrit.wikimedia.org/r/306890 (owner: 10Filippo Giunchedi) [09:22:56] (03PS3) 10Filippo Giunchedi: puppet_compiler: add script to update facts [puppet] - 10https://gerrit.wikimedia.org/r/306890 [09:23:03] PROBLEM - puppet last run on mw2105 is CRITICAL: Timeout while attempting connection [09:24:48] 06Operations, 05Prometheus-metrics-monitoring: MySQL monitoring with prometheus - https://phabricator.wikimedia.org/T143896#2597663 (10jcrespo) [09:26:35] 06Operations, 10netops, 13Patch-For-Review: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2597664 (10akosiaris) 05Open>03Resolved Resolving then. Thanks! [09:32:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor inline comments, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/306442 (https://phabricator.wikimedia.org/T138073) (owner: 10Filippo Giunchedi) [09:33:02] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [09:33:22] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:33:32] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:12] (03PS7) 10Filippo Giunchedi: base: support for multiple syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/306442 (https://phabricator.wikimedia.org/T138073) [09:34:32] (03CR) 10Filippo Giunchedi: base: support for multiple syslog hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/306442 (https://phabricator.wikimedia.org/T138073) (owner: 10Filippo Giunchedi) [09:34:49] akosiaris: thanks for ^ ! [09:35:28] (03PS8) 10Filippo Giunchedi: base: support for multiple syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/306442 (https://phabricator.wikimedia.org/T138073) [09:35:48] 06Operations, 10Continuous-Integration-Infrastructure: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2597676 (10hashar) Confirmed. Since 0.17.0 got installed to `main` can you drop the old version from `thirdparty` please? That is to... [09:36:05] godog: thank you! can you also cleanup the old jenkins-debian-glue 0.11.0 from jessie-wikimedia/thirdparty please? [09:37:03] hashar: yep, {{done}} [09:37:16] (03CR) 10Filippo Giunchedi: [C: 032] base: support for multiple syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/306442 (https://phabricator.wikimedia.org/T138073) (owner: 10Filippo Giunchedi) [09:42:57] !log reimaging mw2108-mw2111 to jessie [09:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:43:21] PROBLEM - puppet last run on cp2020 is CRITICAL: CRITICAL: puppet fail [09:43:23] PROBLEM - puppet last run on mc1005 is CRITICAL: CRITICAL: puppet fail [09:44:03] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: puppet fail [09:44:51] those puppet failures should recover by themselves btw, I've seen in happening where the master isn't synced yet [09:44:53] PROBLEM - puppet last run on db2043 is CRITICAL: CRITICAL: puppet fail [09:45:04] case in point, Error: Could not retrieve catalog from remote server: Error 400 on SERVER: ::base::remote_syslog::central_host required at /etc/puppet/modules/base/manifests/remote_syslog.pp:21 on node ms-fe1001.eqiad.wmnet [09:45:10] and then recovers [09:45:21] 06Operations, 10Continuous-Integration-Infrastructure: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2597698 (10hashar) Sorry I have forgot about a bunch of other ones: | jenkins-debian-glue-buildenv-slave | 0.13.0 | http://apt.wikim... [09:45:54] RECOVERY - puppet last run on cp2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:46:09] godog: sorry I have forgot a bunch of other legacy .deb packages related to jenkins-debian-glue . I have listed them all at https://phabricator.wikimedia.org/T141114#2597698 :((( [09:47:12] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2597699 (10Gehel) [09:47:45] 06Operations, 10ops-codfw, 06Discovery: rack/setup/deploy wdqs200[12] - https://phabricator.wikimedia.org/T142864#2549225 (10Gehel) 05Open>03Resolved Closing this task, further configuration is tracked at T144380. [09:48:24] 06Operations, 10Continuous-Integration-Infrastructure: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2597726 (10Paladox) Also it seems the packages it depends on do not exist in debian package manager causing installs to fail. [09:50:01] hashar: so those need updating or deleting? [09:51:20] godog: delete [09:51:31] !log restarting cassandra on aqs100[123] to verify performances after daemon restart [09:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:50] godog: previous versions produced a bunch of binaries package. The new one 0.17.0 only produce two binary packages: jenkins-debian-glue and jenkins-debian-glue-buildenv . All the others have been phased out [09:52:03] phew, yeah that's much saner [09:52:38] and easier to manage :} [09:53:24] !log change-prop deploying e999f517 [09:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:54:20] 06Operations, 10Continuous-Integration-Infrastructure: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2597732 (10hashar) @paladox what do you mean? Do you have some kind of trace? On Jessie `apt-get update && apt-get install jenkins-deb... [09:55:43] 06Operations, 10Continuous-Integration-Infrastructure: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2597733 (10Paladox) >>! In T141114#2597732, @hashar wrote: > @paladox what do you mean? Do you have some kind of trace? > > On Jessie... [09:57:08] hashar: all done, can you double check? [10:01:27] godog: there is still jenkins-debian-glue-buildenv | 0.11.0 | http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/thirdparty [10:01:34] sorry :( [10:01:47] that is nitpicking really. But I would rather avoid a version mismatch later on [10:02:20] hashar: aye no it is better to cleanup [10:03:42] (meanwhile on rspec planet: role::mediawiki::appserver "should compile into a catalogue without dependency cycles" OK ) [10:07:15] (03PS1) 10Elukey: Add analytics-admins/roots to the Druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/307715 [10:10:26] RECOVERY - puppet last run on mc1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:10:48] RECOVERY - puppet last run on db2043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:10:56] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:13:28] 06Operations, 10Continuous-Integration-Infrastructure: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2597764 (10hashar) @Paladox `dpkg -i` does not resolve dependencies so you then have to manually fix them. Now that the package has bee... [10:13:34] godog: all set thank you very much. [10:14:02] 06Operations, 10Continuous-Integration-Infrastructure: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2597765 (10Paladox) Oh ok, thanks :) [10:15:04] (03PS2) 10Elukey: Add analytics-admins/roots to the Druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/307715 [10:15:09] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [10:15:37] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [10:15:58] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [10:16:15] hashar: np! [10:16:28] looks like telia might be back [10:17:47] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/3901/" [puppet] - 10https://gerrit.wikimedia.org/r/307715 (owner: 10Elukey) [10:20:12] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/307354 (https://phabricator.wikimedia.org/T144112) (owner: 10Chad) [10:22:37] (03PS1) 10Filippo Giunchedi: hieradata: send logs to syslog.codfw.wmnet too [puppet] - 10https://gerrit.wikimedia.org/r/307717 (https://phabricator.wikimedia.org/T138073) [10:27:09] (03PS1) 10Elukey: Revert "Add analytics-admins/roots to the Druid hosts" [puppet] - 10https://gerrit.wikimedia.org/r/307718 [10:27:17] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/3902/" [puppet] - 10https://gerrit.wikimedia.org/r/307717 (https://phabricator.wikimedia.org/T138073) (owner: 10Filippo Giunchedi) [10:28:47] (03PS3) 10Muehlenhoff: keyholder-proxy/agent: Convert to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) [10:28:57] (03CR) 10Elukey: [C: 032] Revert "Add analytics-admins/roots to the Druid hosts" [puppet] - 10https://gerrit.wikimedia.org/r/307718 (owner: 10Elukey) [10:29:48] (03PS4) 10Muehlenhoff: keyholder-proxy/agent: Convert to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/307510 (https://phabricator.wikimedia.org/T144043) [10:32:59] !log starting a gradual deployment of wikidiff2 across the mw* fleet. T140443 [10:33:00] T140443: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443 [10:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:35:35] PROBLEM - Apache HTTP on mw2226 is CRITICAL: Connection refused [10:38:34] PROBLEM - Apache HTTP on mw2219 is CRITICAL: Connection refused [10:38:34] PROBLEM - Apache HTTP on mw2223 is CRITICAL: Connection refused [10:39:14] PROBLEM - Apache HTTP on mw2232 is CRITICAL: Connection refused [10:41:13] we are not reimaging these ones afaik [10:41:22] so it could be wikidiff2 related? [10:43:54] PROBLEM - Apache HTTP on mw2217 is CRITICAL: Connection refused [10:44:35] from the logs it seems that systemd stopped apache [10:45:13] PROBLEM - Apache HTTP on mw2220 is CRITICAL: Connection refused [10:45:29] akosiaris: --^ [10:46:19] elukey: possibly [10:47:15] PROBLEM - Apache HTTP on mw2230 is CRITICAL: Connection refused [10:47:18] looking [10:47:24] weird [10:47:35] yeah I don't see any clear indication on 2226 [10:47:37] checking another host [10:47:42] the package has been installed [10:47:52] but nothing relevant from the err logs [10:48:00] or maybe I am missing something super clear [10:48:34] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Puppet has 1 failures [10:48:38] apache is not running on these hosts [10:48:44] RECOVERY - Apache HTTP on mw2223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.128 second response time [10:48:44] PROBLEM - Apache HTTP on mw2221 is CRITICAL: Connection refused [10:49:00] which makes sense since it is not installed [10:49:02] yeah but I don't see a clear reason why except systemd shutting it down [10:49:25] ?? [10:50:01] the package is in rc state [10:50:17] and definitely NOT by me [10:50:29] at least not directly... still looking [10:50:51] according to at/history.log, it got deinstalled when hhvm-wikidiff2 was installed [10:51:02] wat ? [10:52:15] no idea yet, I'll try to manually reinstall on mw2220 to see whether apt complains [10:52:23] Upgrade: hhvm-wikidiff2:amd64 (1.3.5~jessie1, 1.4.1) [10:52:24] Remove: apache2-utils:amd64 (2.4.10-10+deb8u4+wmf1), apache2-data:amd64 (2.4.10-10+deb8u4+wmf1), apache2:amd64 (2.4.10-10+deb8u4+wmf1) [10:52:28] :O [10:52:45] 06Operations, 10ops-eqiad, 06Discovery, 10Elasticsearch, and 2 others: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685#2597813 (10Cmjohnson) @gehel confirmed and racktables updated [10:52:45] wait, those are jessie's ? [10:52:59] ah so these are the new hosts [10:53:01] yes because there is apache 2.4.10 [10:53:16] mmm but 2.4.10-10+deb8u4+wmf1 is not the latest version [10:53:30] 2.4.10-10+deb8u6+wmf2 is [10:53:35] I didn't upgrade these ones [10:53:49] so maybe hhvm-wikidiff2 wants 2.4.10-10+deb8u6+wmf2 ? [10:53:54] RECOVERY - Apache HTTP on mw2219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.791 second response time [10:54:11] hmm puppet ran and fixed the issue [10:54:34] installing 2.4.10-10+deb8u6+wmf2 [10:54:50] remind me why we have our own apache package [10:55:08] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2228.codfw.wmnet because of too many down! [10:55:09] because of a lot of mod-proxy-fcgi patches :) [10:55:13] PROBLEM - Apache HTTP on mw2224 is CRITICAL: Connection refused [10:55:21] this one is totally my fault [10:55:27] I should have upgraded codfw [10:55:30] I completely missed it [10:55:33] RECOVERY - Apache HTTP on mw2220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.781 second response time [10:55:45] sorry akosiaris [10:55:54] RECOVERY - Apache HTTP on mw2226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.134 second response time [10:56:08] not sure it's you fault alone... probably mine as well.. should have tested on a jessie host as well [10:56:25] https://servermon.wikimedia.org/packages/ doesn't show hhvm-wikidiff2, is there a jessie host which doesn't have the new version for debugging this? [10:56:45] RECOVERY - Apache HTTP on mw2217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.727 second response time [10:56:54] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2098.codfw.wmnet because of too many down! [10:56:55] akosiaris: did you upgrade also the api servers? I didn't install apache2 in there, only on the appservers [10:57:07] elukey: yes the api servers as well [10:57:41] in codfw only or everywhere? [10:57:43] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [10:57:47] everywhere [10:57:54] PROBLEM - Apache HTTP on mw2216 is CRITICAL: Connection refused [10:58:22] let's check in api eqiad if we have apache2 in rc state [10:59:08] it's not a problem during fresh installs, version 1.4.1 of hhvm-wikidiff was already on apt.wikimedia.org, so the reimages I did today, already installed the new version without deinstalling apache [10:59:28] also not seeing any package relationship which could have resulted in this [10:59:54] akosiaris: do you have the salt command used for the update still in your shell history? [11:00:10] it's still running [11:00:26] it's was a staggered -b5% which is what probably saved us from an outage [11:00:46] sudo salt -b5% 'mw*' --out raw cmd.run 'dpkg -l | awk "/hhvm-wikidiff2/ {print \"aptitude install -y hhvm-wikidiff2; service hhvm stop ; sleep 5 ; service hhvm start ; sleep 5 \" }" | sh ' [11:00:54] PROBLEM - Apache HTTP on mw2218 is CRITICAL: Connection refused [11:01:24] it's clearly only happening in codfw [11:02:07] akosiaris: I checked mw1276 (that should be a jessie api) and afaics the pkg has not been installed [11:02:22] maybe your command didn't reach any eqiad jessie yet? [11:02:24] PROBLEM - Apache HTTP on mw2215 is CRITICAL: Connection refused [11:02:28] RECOVERY - Apache HTTP on mw2232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.594 second response time [11:02:29] a bit weird I know [11:02:51] but lexicographically the trustys comes first [11:03:06] Executing run on ['mw1304.eqiad.wmnet', 'mw1190.eqiad.wmnet', 'mw2189.codfw.wmnet', 'mw2212.codfw.wmnet', 'mw2241.codfw.wmnet', 'mw2065.codfw.wmnet', 'mw2243.codfw.wmnet', 'mw2197.codfw.wmnet', 'mw2227.codfw.wmnet', 'mw2205.codfw.wmnet', 'mw1171.eqiad.wmnet', 'mw2082.codfw.wmnet', 'mw2234.codfw.wmnet', 'mw2062.codfw.wmnet'] [11:03:13] it's quite random as you see [11:03:13] RECOVERY - Apache HTTP on mw2216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.147 second response time [11:03:20] that's that current run [11:03:27] just started [11:03:33] PROBLEM - Apache HTTP on mw2222 is CRITICAL: Connection refused [11:03:38] ah sorry nevermind [11:03:42] :) [11:03:45] PROBLEM - Apache HTTP on mw2229 is CRITICAL: Connection refused [11:03:52] ok so now I am really confused [11:04:54] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [11:04:54] RECOVERY - Apache HTTP on mw2215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.813 second response time [11:04:55] (03PS2) 10Filippo Giunchedi: swift: allow disabling account stats [puppet] - 10https://gerrit.wikimedia.org/r/305294 [11:05:31] ok it just finished [11:05:35] RECOVERY - Apache HTTP on mw2230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 9.960 second response time [11:05:42] so anything matched by salt should have been upgraded [11:06:02] but eqiad is totally fine and codfw is the only one showing the issue [11:06:07] (03PS2) 10Gehel: elastic104[4567] moved to new racks [puppet] - 10https://gerrit.wikimedia.org/r/307700 (https://phabricator.wikimedia.org/T143685) [11:06:33] PROBLEM - Apache HTTP on mw2227 is CRITICAL: Connection refused [11:07:13] mw1300 for example is jessie and has 1.4.1 (has been upgraded) and apache is running fine [11:07:38] (03CR) 10Gehel: [C: 032] elastic104[4567] moved to new racks [puppet] - 10https://gerrit.wikimedia.org/r/307700 (https://phabricator.wikimedia.org/T143685) (owner: 10Gehel) [11:08:12] and it does have 2.4.10-10+deb8u4+wmf1 and NOT 2.4.10-10+deb8u6+wmf2 installed [11:08:17] what on earth ? [11:08:20] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2241.codfw.wmnet because of too many down! [11:08:23] RECOVERY - Apache HTTP on mw2224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.140 second response time [11:08:43] RECOVERY - Apache HTTP on mw2222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.139 second response time [11:08:43] RECOVERY - Apache HTTP on mw2218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.503 second response time [11:09:09] maybe the apache2 version is a red herring [11:09:17] (03CR) 10Filippo Giunchedi: [C: 032] swift: allow disabling account stats [puppet] - 10https://gerrit.wikimedia.org/r/305294 (owner: 10Filippo Giunchedi) [11:09:20] and $something else was differnt [11:09:22] (03PS3) 10Filippo Giunchedi: swift: allow disabling account stats [puppet] - 10https://gerrit.wikimedia.org/r/305294 [11:09:46] as moritzm said there should be no relationship between the two packages [11:09:55] (03CR) 10Filippo Giunchedi: [V: 032] swift: allow disabling account stats [puppet] - 10https://gerrit.wikimedia.org/r/305294 (owner: 10Filippo Giunchedi) [11:10:03] !log restarting elasticsearch104[456] to take new rack configuration into account - T143685 [11:10:05] T143685: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685 [11:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:10:13] no least nothing obvious, also apt didn't complain when I manually reinstalled on mw2220 [11:10:31] (03PS6) 10Filippo Giunchedi: hieradata: add thumbor swift account [puppet] - 10https://gerrit.wikimedia.org/r/305275 (https://phabricator.wikimedia.org/T139606) [11:10:57] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [11:11:50] I just checked apt/history.log on mw2082 and on that host the installation of hhvm-wikidiff2 removed other packages: iucode-tools, libicu48, libapt-pkt-perl [11:12:07] it seems to me as if aptitude runs auto-removal in this case? [11:12:25] the equivalent of "apt-get autoremove" (or maybe the cmd itself) [11:12:34] and that would explain the errors on some hosts [11:12:49] maybe the package dependencies failed to declare some apache dependency properly [11:13:01] and passing "-y" to aptitude caused some fallout [11:13:13] that it removes iucode-tools, libicu48, libapt-pkt-perl makes sense: [11:13:26] libicu48 is replaced by libicu52 (this was some bigger task) [11:13:42] libapt-pkg-perl is also only installed indirectly be d-i and unused later on [11:14:14] RECOVERY - Apache HTTP on mw2229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.447 second response time [11:14:23] RECOVERY - Apache HTTP on mw2227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.593 second response time [11:14:44] and iucode-tools was used in the past for updating Intel CPU microsoft, but at some point we refrained from using that [11:14:44] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:14:55] RECOVERY - Apache HTTP on mw2221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.575 second response time [11:15:48] moritzm: any theory about the codfw vs eqiad differences? [11:15:56] moritzm: I 've run this before salting and vetted one by one those packages [11:16:02] all looked innocuous enough [11:16:12] even run it on multiple hosts just to be certain [11:16:22] all of them in eqiad however [11:16:22] debugging further, one really strange thing: [11:18:05] note that only the hosts in mw2216-mw2230 reported issues [11:18:10] range* [11:18:18] definitely not many hosts... [11:18:37] mw2215- actually [11:19:04] mw2215- are API servers, might be a total red herring, though [11:24:12] but mw2229 is an api one [11:24:15] mmmm [11:24:20] sorry, appserver one [11:25:05] maybe the new app/api servers got a weird pkg config during the last round of installs ? [11:25:31] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add thumbor swift account [puppet] - 10https://gerrit.wikimedia.org/r/305275 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [11:25:33] I don't think so, then it would also have happened during the reimagings [11:28:00] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2597897 (10Gehel) Ideally, I'd like to have the data directory entirely configurable from puppet. At the mo... [11:28:15] 21 hosts in total says neon [11:28:32] and some of these were actually boxes being reimaged at the time [11:28:45] like mw2111 [11:29:19] only mw2215-mw2232 were affected according to my salt run on apt/history.log [11:29:34] ok, thanks for confirming [11:29:45] so, wat ? [11:29:51] what happened ... [11:29:53] * akosiaris stumped [11:29:59] I'll make a detailed comparison between mw2214 and mw2215 [11:30:43] but first some lunch [11:34:11] akosiaris: I'll schedule eqiad api apache upgrades during the next days if you are ok [11:34:24] will double check all codfw [11:34:31] and then upgrade eqiad apis [11:34:48] ok [11:39:38] !log swift eqiad-prod: set weight for ms-be1021 sd[h-n] to 0 - T139767 [11:39:39] T139767: ms-be1021.eqiad.wmnet: slot=1I:1:2 dev=sdh failed - https://phabricator.wikimedia.org/T139767 [11:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:40:14] it must be a state issue or something [11:40:18] can't explain it otherwise [11:41:17] (03PS1) 10Gehel: elastic102[1289] moved to new racks [puppet] - 10https://gerrit.wikimedia.org/r/307733 (https://phabricator.wikimedia.org/T143685) [11:42:03] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-management: Request access to data for reader research - https://phabricator.wikimedia.org/T143718#2597935 (10AlexMonk-WMF) >>! In T143718#2597272, @ph_singer wrote: > @Cmjohnson Thanks a lot! Unfortunately, we have some troubles accessin... [11:42:43] (03CR) 10Gehel: "As rack / row configuration isn't used at the moment, this can be merged before the servers are physically moved." [puppet] - 10https://gerrit.wikimedia.org/r/307733 (https://phabricator.wikimedia.org/T143685) (owner: 10Gehel) [11:47:50] 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2597938 (10mark) @hashar and I just had a long chat on IRC, where we clarified some things in both directions. A f... [11:53:25] 06Operations, 06Community-Tech, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2597942 (10akosiaris) 05Open>03Resolved a:03akosiaris The package has been upgraded throughout the mediawiki fleet as well as some other hosts like snap... [12:00:04] (03CR) 10Alex Monk: "Shouldn't you send the identify command in on_welcome, then deal with the result in on_privnotice?" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) (owner: 10BryanDavis) [12:01:06] (03CR) 10Alex Monk: "Because if you get nickinuse, it'll choose a different name which won't cause the "This nickname is registered. Please choose a different " [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) (owner: 10BryanDavis) [12:06:00] (03CR) 10Glaisher: "Instead of authing to NickServ, you could also use SASL which would ensure that it's identified before anything else happens." [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) (owner: 10BryanDavis) [12:09:38] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [12:10:26] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [12:10:30] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-management: Request access to data for reader research - https://phabricator.wikimedia.org/T143718#2597957 (10ph_singer) Here you go: ``` Host * SendEnv LANG LC_* HashKnownHosts yes GSSAPIAuthentication yes GSSAPIDelegateCredentials no... [12:11:00] 06Operations, 10Traffic: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2597958 (10BBlack) >>! In T144187#2597553, @Danielsberger wrote: > frequency-based admission - admitting an object after N requests - often works much better when not only checking for one-hit-wonder... [12:11:48] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-management: Request access to data for reader research - https://phabricator.wikimedia.org/T143718#2597959 (10ph_singer) Maybe it has something to do with the DNS? [12:15:36] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [12:16:17] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-management: Request access to data for reader research - https://phabricator.wikimedia.org/T143718#2597962 (10AlexMonk-WMF) When you read and signed L3, you should have noticed the suggested SSH configuration: ```ForwardAgent no Host !ba... [12:21:47] 06Operations, 06Community-Tech, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2597965 (10MoritzMuehlenhoff) php-wikidiff2 can be safely removed on all jessie hosts (and mostly probably also on trusty, but these are being reimaged anyway... [12:22:37] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [12:23:28] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [12:27:16] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [12:31:26] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [12:34:21] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-management: Request access to data for reader research - https://phabricator.wikimedia.org/T143718#2597966 (10ph_singer) Thanks, indeed forgot to add that. Works now! [12:35:18] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [12:38:58] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 67 probes of 393 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [12:44:04] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2597985 (10jcrespo) [12:50:25] jouncebot: next [12:50:25] In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160831T1300) [12:51:04] no patches scheduled for SWAT [12:51:31] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [12:54:07] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [12:55:13] hashar: no swat today? nothing to deploy [12:55:31] yeah it is all quiet on the wikimedia front [12:57:27] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:00:04] hashar, Dereckson, addshore, and aude: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160831T1300). Please do the needful. [13:00:52] no patches scheduled. Nothing todo. [13:01:15] akosiaris, elukey: I couldn't really nail this, all the affected servers have been installed on the same day, but there were other installed on the same day which were not affected, diffing package history between an affected and unaffected host is also identical [13:02:46] I'm considering a puppet patch to remove aptitude cluster-wide, though. it's full of surprising heuristics, development has stalled compared to apt and debdeploy also uses apt. whatever this was, I'm pretty sure it would not have happened with apt [13:04:24] * akosiaris likes aptitude :-( [13:04:43] +1 [13:05:37] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:07:05] It's mostly a suggestion, I'll add all Ops as reviewers to voice objections or approval :-) [13:07:58] 06Operations, 10Traffic: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2598030 (10BBlack) Tying the X-Cache idea together with the tuneable hitcount and size from the paper, we could look at something of the form `if (hc / (os * X) < Y) { uncacheable; }`, where `hc` is... [13:09:16] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 330.95 seconds [13:09:57] (03PS12) 10BBlack: nginx (1.11.3-1+wmf2) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.3) - 10https://gerrit.wikimedia.org/r/303170 [13:10:40] (03PS1) 10Filippo Giunchedi: prometheus: add node_exporter to syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/307741 [13:10:57] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:11:40] moritzm: chase has aptitude in its shell aliases, I heard of a few using it [13:12:27] alias aptitude=apt [13:12:31] (03CR) 10Dereckson: [C: 04-1] "Blocked on ops, need an access to swift from silver for that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson) [13:12:36] FIFY [13:12:56] ohh [13:13:08] !log reimaging mw2212-mw2215 to jessie [13:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:20] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add node_exporter to syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/307741 (owner: 10Filippo Giunchedi) [13:13:24] (03PS1) 10Ema: upload VCL: do not cache objects with CL:0 and status 200 [puppet] - 10https://gerrit.wikimedia.org/r/307742 (https://phabricator.wikimedia.org/T144257) [13:13:28] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:16:29] (03CR) 10Dereckson: "https://gerrit.wikimedia.org/r/#/c/268625/ has been merged so yes we can proceed with this one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268627 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [13:18:58] (03CR) 10BBlack: [C: 031] upload VCL: do not cache objects with CL:0 and status 200 [puppet] - 10https://gerrit.wikimedia.org/r/307742 (https://phabricator.wikimedia.org/T144257) (owner: 10Ema) [13:20:01] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: send logs to syslog.codfw.wmnet too [puppet] - 10https://gerrit.wikimedia.org/r/307717 (https://phabricator.wikimedia.org/T138073) (owner: 10Filippo Giunchedi) [13:20:06] (03PS2) 10Filippo Giunchedi: hieradata: send logs to syslog.codfw.wmnet too [puppet] - 10https://gerrit.wikimedia.org/r/307717 (https://phabricator.wikimedia.org/T138073) [13:21:26] (03CR) 10Filippo Giunchedi: "merged, though this fails to build from source for me in jessie:" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/307292 (owner: 10Gilles) [13:23:57] (03CR) 10Gilles: "I added librsvg2-bin to the runtime dependencies but forgot to add it to the build dependencies... Should I make a new changeset that fixe" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/307292 (owner: 10Gilles) [13:24:17] (03PS2) 10Ema: upload VCL: do not cache objects with CL:0 and status 200 [puppet] - 10https://gerrit.wikimedia.org/r/307742 (https://phabricator.wikimedia.org/T144257) [13:24:27] (03CR) 10Ema: [C: 032 V: 032] upload VCL: do not cache objects with CL:0 and status 200 [puppet] - 10https://gerrit.wikimedia.org/r/307742 (https://phabricator.wikimedia.org/T144257) (owner: 10Ema) [13:25:32] !log installing libarchive security updates on jessie [13:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:26:42] gilles: just the control file is fine [13:26:54] alright, I'll do that [13:27:37] (03PS2) 10Dereckson: Use extension registration for CategoryTree [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268627 (https://phabricator.wikimedia.org/T119117) [13:28:21] (03PS1) 10Gilles: Add missing build dep [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/307748 [13:30:02] Amir1: zeljkof: hi, do you want to check this one? ^ it's a tricky case to deploy, as it offers to update InitialiseSettings and CommonsSettings. We remove a value in IS, so it's needed by CS. But if we deploy CS first, the new value isn't here. [13:30:12] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.49 seconds [13:31:28] (03CR) 10Filippo Giunchedi: [C: 032] Add missing build dep [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/307748 (owner: 10Gilles) [13:31:29] Dereckson: I'm not expert but what if you use sync-dir [13:31:34] gilles: thanks! [13:31:38] deploy them at once [13:31:39] sync dir isn't atomic [13:31:49] Amir1: zeljkof: in both case, if we deploy them "as the same time", for example with a scap sync-dir, we'll serve a few notice which will flood the logs during two hours. [13:32:18] Amir1: rsync will copy one, then the other, but meanwhile servers will be hit [13:32:36] Solutions are: 1. do an intermediary patch to avoid to remove a needed info, or still use the old [13:33:02] (03PS1) 10BBlack: Revert "ulsfo: Drain it, there are issues" [dns] - 10https://gerrit.wikimedia.org/r/307750 [13:33:04] (03PS1) 10BBlack: Revert "depool upload in ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/307751 [13:33:24] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:33:37] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:33:58] Dereckson: is there solution #2? :) [13:34:05] yes: or 2. deploy a first file with the removed information manually added back, ie edit manually on tin before deployment, then sync dir, then you when is CS okay and wants the new version, you can resync the genuine IS version from the change. [13:34:36] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 2 failures [13:34:36] Dereckson: uh, not sure what to do, hashar what do you think? [13:34:37] the second is what ree.dy did, the first what th.cipriani did for such changes [13:34:37] 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium): Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2598087 (10chasemp) I am good with following in the foot step of gallium here as pragmatic. It seems like the mos... [13:35:08] both are okay, and both are hacky, what we need is an atomic sync dir, it's a question of what you want to spam: gerrit/server log [13:35:13] Dereckson: I would default to how thcipriani|afk does deploys :) [13:35:29] o/ [13:35:47] (03CR) 10BBlack: [C: 032] Revert "ulsfo: Drain it, there are issues" [dns] - 10https://gerrit.wikimedia.org/r/307750 (owner: 10BBlack) [13:35:56] (03CR) 10BBlack: [C: 032] Revert "depool upload in ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/307751 (owner: 10BBlack) [13:36:06] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:36:11] ah extension registration [13:36:15] !log repooling ulsfo traffic [13:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:36:23] Dereckson: I do not see it in the calendar https://wikitech.wikimedia.org/wiki/Deployments#Wednesday.2C.C2.A0August.C2.A031 [13:37:04] so, we are talking about https://gerrit.wikimedia.org/r/#/c/268627/ [13:37:06] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 2 failures [13:37:15] I can add it if you wish to deploy it now [13:37:24] Dereckson: zeljkof yeah potentially sync-dir but would prefer ostriches / thcipriani|afk to confirm how to properly sync both common and initialise settings [13:37:25] FWIW, I usually try to have the person who proposed the patch make an intermediary patch that doesn't cause the problem [13:37:26] PROBLEM - swift-account-replicator on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [13:37:31] if hashar thinks we should deploy it, please do [13:37:38] Dereckson: ^ [13:37:39] ideally should be split [13:37:43] with a first patch that is safe [13:37:49] then a second that is safe as well [13:38:00] * Dereckson nods. [13:38:03] not sure whether that case is covered in the deployment doc [13:38:06] would be nice to have if not [13:38:17] i myself not going to swat it. I am too rusty on that front [13:38:31] Dereckson: I could have sworn you made a phab ticket for atomic sync-dir somewhere, but I couldn't find it when I went looking the other day [13:38:39] I prefer also the dual patches method. But Reed.y has an interesting argument: that method spams the history log with our deployment kludges [13:39:05] (the repo history log) [13:39:20] thcipriani|afk: https://phabricator.wikimedia.org/T141913 [13:39:44] thanks [13:40:07] !log restart rabbitmq on labcontrol1001 [13:40:09] ah, closed as dupe, that's why I couldn't find it :) [13:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:57] * Dereckson prepares a safer patch [13:42:00] (03PS3) 10Dereckson: Use extension registration for CategoryTree [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268627 (https://phabricator.wikimedia.org/T119117) [13:44:54] zeljkof: okay I've added to the calendar, and I'm on https://hangouts.google.com/hangouts/_/wikimedia.org/euswat [13:45:50] On fluorine: tail: cannot open `/a/mw-log/hhvm.log' for reading: No such file or directory [13:46:13] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:46:44] 06Operations, 10ops-codfw: Broken disk on wtp2016 - https://phabricator.wikimedia.org/T144260#2598102 (10MoritzMuehlenhoff) The broken disk is sda1; root@wtp2016:~# zgrep "Disk failure" /var/log/kern.log* /var/log/kern.log.1:Aug 24 00:27:41 wtp2016 kernel: [1331271.155647] md/raid1:md0: Disk failure on sda1,... [13:47:04] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.06 seconds [13:47:08] 06Operations: /a/mw-log/hhvm.log: file not found on Fluorine - https://phabricator.wikimedia.org/T144389#2598103 (10Dereckson) [13:47:35] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 2 failures [13:48:29] Dereckson: ok, joining in a minute [13:49:07] hashar: should I swat https://gerrit.wikimedia.org/r/#/c/268627/ with Dereckson? want to join us in the hangout? [13:49:22] moritzm, akosiaris - can I upgrade apache on the codfw jessie appservers and apis? Just double checking [13:49:35] (the ones not running the latest version) [13:51:17] elukey: ok from me [13:51:20] elukey: should be fine if you batch it [13:51:34] thcipriani|afk: any advice on https://gerrit.wikimedia.org/r/#/c/268627/ ? should we deploy it? [13:51:47] moritzm: sure [13:51:55] zeljkof: I am not confident doing it [13:52:10] hashar: should we leave it to thcipriani|afk ? ;) [13:52:18] !log upgrading the codfw jessie appservers to 2.4.10-10+deb8u6+wmf2 [13:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:52:27] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 2 failures [13:52:30] zeljkof: yeah will be better to have it handled by a real pro :} [13:52:42] hashar: ok, thanks [13:53:06] Dereckson: could you please add the patch to the US morning SWAT, so thcipriani|afk can deploy it? [13:53:07] Personnally, I'm confident to do it. We did similar changes in the past, and with the variables array copied, that will be fine. [13:53:16] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 2 failures [13:53:49] Dereckson: if hashar is nervous about it, imagine me :D [13:53:50] Dereckson: 's kung fu is likely stronger than my own when it comes to SWAT stuff :) [13:54:33] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 7 probes of 393 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:55:00] ok, what is the decision, Dereckson is deploying himself? I am doing the deploy with him helping? thcipriani|afk is doing the deploy later today? something else? [13:55:37] RECOVERY - swift-account-replicator on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [13:55:39] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 2 failures [13:55:54] please keep in mind that there is 5 minutes left in the swat window, too [13:56:00] we could extend it [13:56:20] (03PS1) 10Ema: upload VCL: fix VCL compilation error [puppet] - 10https://gerrit.wikimedia.org/r/307754 [13:57:01] zeljkof: 5-9 minutes are fine, but we've another issue [13:57:22] so I vote for defer it [13:57:25] (03CR) 10BBlack: [C: 031] upload VCL: fix VCL compilation error [puppet] - 10https://gerrit.wikimedia.org/r/307754 (owner: 10Ema) [13:57:48] (03CR) 10Ema: [C: 032 V: 032] upload VCL: fix VCL compilation error [puppet] - 10https://gerrit.wikimedia.org/r/307754 (owner: 10Ema) [13:57:49] zeljkof: thcipriani|afk: hashar: fatalmonitor isn't available right now, /a/mw-log/hhvm.log doesn't currently exist on fluorine [13:57:52] Dereckson: ok, in that case, please add it to another deployment window [13:58:24] any idea what to do when log collector doesn't collect? [13:58:26] !log running alter table on db1047 enwiki.categorylinks to mitigate lag issues [13:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:01] (03CR) 10BBlack: [C: 032 V: 032] nginx (1.11.3-1+wmf2) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.3) - 10https://gerrit.wikimedia.org/r/303170 (owner: 10BBlack) [13:59:42] !log uploaded nginx 1.11.3-1+wmf2 for jessie-wikimedia to carbon (ssl curve variable support) [13:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:00:11] https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor still working it seems [14:00:56] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:04:16] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:06:04] (03CR) 10Alex Monk: "Not for Aaron's filesystem-based suggestion, Dereckson" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson) [14:07:47] (03PS1) 10Andrew Bogott: Switch primary nova-network and nova-api server over to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/307755 [14:09:17] (03CR) 10Dereckson: "When we discussed the issue with yesterday, Hashar doesn't like the idea to store directly files on local filesystem directly on silver. A" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307494 (https://phabricator.wikimedia.org/T126338) (owner: 10Dereckson) [14:11:56] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:56] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:13:30] 06Operations: Setup basic infrastructure services in codfw - https://phabricator.wikimedia.org/T84350#2598210 (10fgiunchedi) [14:13:32] 06Operations, 13Patch-For-Review: setup syslog server in codfw - https://phabricator.wikimedia.org/T138073#2598208 (10fgiunchedi) 05Open>03Resolved wezen is in service and receiving traffic, resolving [14:13:53] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2598217 (10Dzahn) @SindyM3 No problem. I don't know who is in charge off that, but WMNL people would know. When looking at https://www.wikimedia.nl/blog i se... [14:18:37] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:18:59] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:31] !log packages updates for cp* (new kernel, new nginx, misc updates) [14:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:22:35] mobrovac: ready when you are [14:23:29] godog: i am [14:25:06] !log stopping puppet on wtp* prior to merging https://gerrit.wikimedia.org/r/#/c/304470/ [14:25:09] ok! [14:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:55] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2598286 (10SindyM3) @Dzahn WMNL people don't know because I am one of them :( [14:25:55] (03PS1) 10BBlack: varnishxcps: map EC= to ssl_curve stat [puppet] - 10https://gerrit.wikimedia.org/r/307758 [14:26:22] (03PS3) 10Filippo Giunchedi: Parsoid: Switch to Scap3 deployments [puppet] - 10https://gerrit.wikimedia.org/r/304470 (https://phabricator.wikimedia.org/T120103) (owner: 10Mobrovac) [14:27:07] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Puppet has 1 failures [14:27:13] (03CR) 10BBlack: [C: 032] varnishxcps: map EC= to ssl_curve stat [puppet] - 10https://gerrit.wikimedia.org/r/307758 (owner: 10BBlack) [14:29:05] (03CR) 10Filippo Giunchedi: [C: 032] Parsoid: Switch to Scap3 deployments [puppet] - 10https://gerrit.wikimedia.org/r/304470 (https://phabricator.wikimedia.org/T120103) (owner: 10Mobrovac) [14:29:10] (03PS4) 10Filippo Giunchedi: Parsoid: Switch to Scap3 deployments [puppet] - 10https://gerrit.wikimedia.org/r/304470 (https://phabricator.wikimedia.org/T120103) (owner: 10Mobrovac) [14:29:45] (03CR) 10Filippo Giunchedi: [V: 032] Parsoid: Switch to Scap3 deployments [puppet] - 10https://gerrit.wikimedia.org/r/304470 (https://phabricator.wikimedia.org/T120103) (owner: 10Mobrovac) [14:30:18] mobrovac: I'll run puppet on mira/tin too [14:30:21] godog: let's first run it on tin and then enable and run puppet on wtp2001 [14:30:22] hehe [14:30:57] (03PS1) 10BBlack: tlsproxy: add EC to X-Connection-Properties [puppet] - 10https://gerrit.wikimedia.org/r/307759 [14:34:05] mobrovac: ok running on wtp2001 [14:34:09] k [14:36:20] mobrovac: {{done}}, so far so good, next is deploy to wtp2001 I guess [14:36:33] ok, lemme do it [14:36:47] i'll manually fix the scap config on tin for that [14:38:36] !log parsoid first scap3 deploy to wtp2001 only [14:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:49] !log upgrading the remaining codfw mw* to 2.4.10-10+deb8u6+wmf2 [14:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:07] ah snap I forgot apache [14:39:19] !log the previous entry was related to apache2 [14:39:22] godog: Finished Deploy: parsoid/deploy (duration: 00m 27s) \o/ [14:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:48] mobrovac: very nice! [14:39:49] godog: so let's enable puppet on all wtp codfw nodes and deploy there [14:40:00] godog: run puppet too, before the deployment :P [14:40:07] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:40:35] will do [14:40:51] !log reenable puppet on wtf* post-merge of scap3 conversion [14:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:03] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2598330 (10siebrand) @sindyM3: it's your server that's being exploited and its maintained by a third party that you pay to keep you safe. It's probably time t... [14:42:15] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2598332 (10MoritzMuehlenhoff) @Papaul: Given that this is affecting seven hosts as of now, this is likely not a hardware error, but probably some kind of configuration setting we'... [14:43:21] Hmm, I've seen that type of injection before [14:44:43] Well, it was Russian SEO spam [14:44:46] https://wordpress.org/support/topic/security-latest-injection-from-russian-seo-spam-hack [14:44:48] haha [14:45:29] (03PS2) 10Andrew Bogott: Switch primary nova-network and nova-api server over to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/307755 (https://phabricator.wikimedia.org/T142567) [14:46:02] Usually it is placed in the footer... [14:47:17] (03PS25) 10BBlack: beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) (owner: 10Alex Monk) [14:48:35] mobrovac: all done everywhere [14:48:59] ok, deploying [14:49:00] (03CR) 10BBlack: [C: 032] beta: Use Let's Encrypt cert [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) (owner: 10Alex Monk) [14:49:57] !log parsoid first scap3 deploy to wtp20xx [14:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:41] (03CR) 10GWicke: [C: 031] "Minor: The indentation looks off." [puppet] - 10https://gerrit.wikimedia.org/r/306979 (https://phabricator.wikimedia.org/T125226) (owner: 10Ppchelko) [14:55:33] godog: {{done}}, tutto bene! [14:55:38] yeah baby [14:55:52] godog: ok, i think we are safe to proceed to eqiad [14:55:53] alla grande! [14:58:34] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2598350 (10Cmjohnson) New disk has been sent We would like to inform you that your order for case ID# 5311440162 has shipped. The estimated time of arrival is Thursday, September 01,... [14:59:26] (03PS1) 10BBlack: deployment caches: enforce secure POST (etc) [puppet] - 10https://gerrit.wikimedia.org/r/307762 [14:59:50] (03PS1) 10Alex Monk: Follow-up I450316d7: Check acme_subjects length [puppet] - 10https://gerrit.wikimedia.org/r/307763 [14:59:58] bblack, ^ [15:00:05] andrewbogott and chasemp: Dear anthropoid, the time has come. Please deploy Labs network maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160831T1500). [15:00:17] (03CR) 10BBlack: [C: 032 V: 032] Follow-up I450316d7: Check acme_subjects length [puppet] - 10https://gerrit.wikimedia.org/r/307763 (owner: 10Alex Monk) [15:02:53] (03CR) 10Hashar: "Looks fine to me." [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) (owner: 10BryanDavis) [15:02:58] (03CR) 10Andrew Bogott: [C: 032] Switch primary nova-network and nova-api server over to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/307755 (https://phabricator.wikimedia.org/T142567) (owner: 10Andrew Bogott) [15:03:05] (03PS3) 10Andrew Bogott: Switch primary nova-network and nova-api server over to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/307755 (https://phabricator.wikimedia.org/T142567) [15:04:27] (03CR) 10Andrew Bogott: [C: 032] Switch primary nova-network and nova-api server over to labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/307755 (https://phabricator.wikimedia.org/T142567) (owner: 10Andrew Bogott) [15:07:07] godog: puppet ran in eqiad? [15:07:22] according to salt it did yeah [15:07:41] k, deploying EVERYWHERE [15:07:45] yupi [15:08:18] (03PS1) 10Alex Monk: Follow-up Idba60e3c: Check !@acme_subjects.empty? instead [puppet] - 10https://gerrit.wikimedia.org/r/307769 [15:08:32] (03PS2) 10Dzahn: partman/cassandra: rename recipes, delete unused recipes [puppet] - 10https://gerrit.wikimedia.org/r/307665 [15:08:34] (03PS1) 10BBlack: Follow-up I450316d7: Ruby hates programmers [puppet] - 10https://gerrit.wikimedia.org/r/307770 [15:08:37] \o/ [15:08:41] heh [15:08:44] I'll just merge yours :) [15:09:02] (03CR) 10BBlack: [C: 032 V: 032] Follow-up Idba60e3c: Check !@acme_subjects.empty? instead [puppet] - 10https://gerrit.wikimedia.org/r/307769 (owner: 10Alex Monk) [15:27:34] bblack: hi! just waiting for some +2 goodness... [15:28:03] AndyRussG: I would, but I have no business reviewing client-side JS and/or extension code :) [15:28:08] !log depooling/powering down mw2148 and mw2088 for hardware check (not mw210[12]) as mentioned before [15:30:15] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Disable cron job to clear elasticsearch caches and validate that it does not have significant impact on GC - https://phabricator.wikimedia.org/T144396#2598460 (10Gehel) [15:30:19] bblack: heh no worries, I wouldn't have anything sane to say about that Varnish magic you conjure... Yeah I think folks know it's urgent but also have been looking at other urgent stuff... [15:30:53] I'll ask for some final review again today [15:30:53] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Disable cron job to clear elasticsearch caches and validate that it does not have significant impact on GC - https://phabricator.wikimedia.org/T144396#2598479 (10Gehel) p:05Triage>03Normal [15:30:57] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2598480 (10MoritzMuehlenhoff) @Papaul : mw2088 is working fine, mw2148 is not working. Both are powered off and depooled. [15:31:55] bblack: is there anything additional that I can tell the team about the urgency of turning off the lookup service? [15:32:43] (03PS2) 10Alex Monk: Reenable $wgMWOAuthSecureTokenTransfer=true; on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302630 (https://phabricator.wikimedia.org/T67421) (owner: 10Gergő Tisza) [15:32:43] (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/307759 (owner: 10BBlack) [15:34:04] AndyRussG: well if the patch ends up idling and nothing gets done, we'd probably eventually turn off the service regardless. We never really set a hard date for that possibility, but I'd guess during next week if it came to it. [15:35:24] (03CR) 10Alex Monk: [C: 032] Reenable $wgMWOAuthSecureTokenTransfer=true; on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302630 (https://phabricator.wikimedia.org/T67421) (owner: 10Gergő Tisza) [15:35:31] most likely we could do it now and it doesn't harm WMF CN usage anyways, since most users have valid cookies, and the few empty-results cookies that would trigger the callback, it probably doesn't affect the client if the (async) callback fails. Waiting for CN release is mostly about ensuring no possibility of async callback failure fallout and giving 3rd parties a way to cope. [15:35:51] (03Merged) 10jenkins-bot: Reenable $wgMWOAuthSecureTokenTransfer=true; on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302630 (https://phabricator.wikimedia.org/T67421) (owner: 10Gergő Tisza) [15:37:26] (03CR) 10BBlack: [C: 032] tlsproxy: add EC to X-Connection-Properties [puppet] - 10https://gerrit.wikimedia.org/r/307759 (owner: 10BBlack) [15:39:19] !log krenair@tin Synchronized wmf-config/CommonSettings-labs.php: no-op in prod, this change is only for deployment-prep in labs (duration: 02m 51s) [15:39:46] apparently mw2088 and mw2148 are down [15:39:57] !log banning elastic102[128] from elasticsearch eqiad cluster to prepare the move - T143685 [15:40:28] first one was being reimaged by moritzm [15:44:44] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2598553 (10Papaul) @MoritzMuehlenhoff Thanks will start working on it. [15:46:43] (03PS5) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [15:47:41] (03PS1) 10Filippo Giunchedi: thumbor: use 'mw' as thumbor account [puppet] - 10https://gerrit.wikimedia.org/r/307777 (https://phabricator.wikimedia.org/T139606) [15:49:10] 06Operations, 10Traffic, 10Wiki-Loves-Monuments, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2598561 (10Akoopal) @siebrand I will send a mail about this, but imho lemonbit has done the correct steps. [15:50:00] bblack: K thx!! anyway hopefully it'll get +2 today. If so deploy might be best tomorrow morning... :) [15:50:45] (03PS2) 10Filippo Giunchedi: thumbor: use 'mw' as thumbor account [puppet] - 10https://gerrit.wikimedia.org/r/307777 (https://phabricator.wikimedia.org/T139606) [15:52:48] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Disable cron job to clear elasticsearch caches and validate that it does not have significant impact on GC - https://phabricator.wikimedia.org/T144396#2598568 (10Gehel) [15:54:45] (03PS1) 10Gehel: elasticsearch - disable cron to clear elasticsearch caches [puppet] - 10https://gerrit.wikimedia.org/r/307779 (https://phabricator.wikimedia.org/T144396) [15:55:23] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: use 'mw' as thumbor account [puppet] - 10https://gerrit.wikimedia.org/r/307777 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [15:57:40] (03PS2) 10Rush: labstore nfs-exports daemon [puppet] - 10https://gerrit.wikimedia.org/r/307629 (https://phabricator.wikimedia.org/T126083) [16:05:11] !log Updated iegreview to 26e3cd1 (Convert ParsoidClient to RESTBase api) [16:05:30] Niharika: ^ want to verify that things still work? [16:05:50] (03PS2) 10Gehel: elasticsearch - disable cron to clear elasticsearch caches [puppet] - 10https://gerrit.wikimedia.org/r/307779 (https://phabricator.wikimedia.org/T144396) [16:06:15] bd808: I still see a bunch of "Cannot POST /enwiki/" [16:06:42] boo. Let me see if we are setting that key in .env [16:06:57] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Disable cron job to clear elasticsearch caches and validate that it does not have significant impact on GC - https://phabricator.wikimedia.org/T144396#2598688 (10Gehel) The elasticsearch search cluster has [[ https://gr... [16:07:13] Niharika: damn. we are [16:07:37] bd808: We need to comment out the parsoid url line in .env, right? [16:07:41] Yeah. [16:07:48] Didn't read your previous message. [16:08:00] yeah. it's going to need a patch to ops/puppet [16:08:03] * bd808 makes that patch [16:09:30] (03CR) 10Gehel: "puppet-compiler: https://puppet-compiler.wmflabs.org/3907/" [puppet] - 10https://gerrit.wikimedia.org/r/307779 (https://phabricator.wikimedia.org/T144396) (owner: 10Gehel) [16:12:04] (03PS3) 10Urbanecm: Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) [16:12:37] 06Operations, 10Gerrit, 07LDAP: Update LDAP real names to be coherent with committer information in Gerrit - https://phabricator.wikimedia.org/T144404#2598701 (10Dereckson) [16:12:40] (03PS4) 10Urbanecm: Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) [16:13:23] (03PS1) 10BryanDavis: iegreview: Switch from Parsoid to RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/307781 (https://phabricator.wikimedia.org/T114186) [16:14:34] (03CR) 10jenkins-bot: [V: 04-1] iegreview: Switch from Parsoid to RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/307781 (https://phabricator.wikimedia.org/T114186) (owner: 10BryanDavis) [16:15:01] (03PS3) 10Dzahn: partman/cassandra: rename recipes, delete unused recipes [puppet] - 10https://gerrit.wikimedia.org/r/307665 [16:15:12] (03CR) 10Dzahn: [C: 032] partman/cassandra: rename recipes, delete unused recipes [puppet] - 10https://gerrit.wikimedia.org/r/307665 (owner: 10Dzahn) [16:15:29] (03PS2) 10BryanDavis: iegreview: Switch from Parsoid to RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/307781 (https://phabricator.wikimedia.org/T114186) [16:16:13] mutante: if you've got time, this ^ is bordering on UBN. reviews of current grant round are being blocked by it [16:16:32] (03PS10) 10Madhuvishy: nfs: Modify /data/scratch on nfs clients to point to mount from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/306019 (https://phabricator.wikimedia.org/T134896) [16:16:38] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 05codfw-rollout: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2086341 (10Dereckson) That would have been interesting to get that, as we've a fluorine failure at T144389. [16:17:20] Woah. One would think a patch to change .env would be smaller than this. :D [16:17:55] bd808: looking in a sec [16:18:15] mutante: my hero :) [16:19:17] Niharika: it could have been but I got annoyed by the code written by my past self [16:19:38] * bd808 is actually forcing himself not to rewrite that whole module [16:19:59] :D It's not so bad. [16:20:02] it's pre-hiera and a bit gross [16:20:15] it could be much much cleaner [16:20:40] I'll save it for the switch from trebuchet to scap3 [16:22:42] bd808: no more "mysql_db" parameter for iegreview? [16:23:11] and "hostname" [16:23:30] mutante: those I took out were exactly the defaults for the class [16:23:35] https://gerrit.wikimedia.org/r/#/c/307781/2/modules/role/manifests/iegreview/app.pp [16:23:42] oh [16:23:47] but I can put them back if you'd like a smaller diff [16:24:27] i wouldnt mind that [16:24:32] k [16:26:09] (03PS3) 10BryanDavis: iegreview: Switch from Parsoid to RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/307781 (https://phabricator.wikimedia.org/T114186) [16:26:38] learns about the https://en.wikipedia.org/api/rest_v1/transform/wikitext/to/html .. aha [16:27:05] its the restbase wrapper around the parsoid endpoint we used to ues [16:27:15] (03CR) 10jenkins-bot: [V: 04-1] iegreview: Switch from Parsoid to RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/307781 (https://phabricator.wikimedia.org/T114186) (owner: 10BryanDavis) [16:27:16] i cant directly open that in browser like that [16:27:17] calling parsoid directly has been deprecated [16:27:20] but i see https://en.wikipedia.org/api/rest_v1/ [16:27:22] yea [16:27:30] grr lint [16:27:48] bd808: oh bryan (re email, er, email) [16:28:10] (03PS4) 10BryanDavis: iegreview: Switch from Parsoid to RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/307781 (https://phabricator.wikimedia.org/T114186) [16:28:30] greg-g: getting reviews in core for esoteric things is ... fun [16:29:20] (03PS5) 10Dzahn: iegreview: Switch from Parsoid to RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/307781 (https://phabricator.wikimedia.org/T114186) (owner: 10BryanDavis) [16:29:27] (03CR) 10Dzahn: [C: 032] iegreview: Switch from Parsoid to RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/307781 (https://phabricator.wikimedia.org/T114186) (owner: 10BryanDavis) [16:30:31] waits for submit button to appear [16:30:57] there it is. merged [16:31:07] bd808: i'll run puppet on krypton [16:32:06] bd808: switched. please check ? [16:32:51] !log iegreview switched from parsoid to RESTBase [16:33:15] mutante: mostly fixed! One more thing while you are on krypton: delete all the files in /var/cache/iegreview please [16:33:39] things that weren't cached are showing fine now, but we cached some broken stuff [16:33:55] bd808: there are files ending in .parsoid and files ending in .restbase, ALL ALL or just .parsoid [16:34:03] all all [16:34:16] the parsoid ones won't get used anymore [16:34:30] and some of the .restbase are junk [16:34:35] Awesome. [16:34:37] Works now. [16:35:20] bd808: uhm.. there are still a bunch of subdirectories [16:35:25] nuke that too? [16:35:34] i took "files" literally [16:35:45] hmm.. subdirs should be ok [16:35:49] ok, then done [16:36:01] mutante: <3 I'll have the grants folks send you a hug [16:36:04] or maybe a beer :) [16:36:14] no problem:) [16:36:41] Does anyone know if there were DB replication issues earlier today? [16:38:34] !log mutante updated puppet role and cleared cache for iegreview (T114186) [16:38:37] jynus: ^ ? [16:38:46] where's the bot? [16:38:55] between 13:00 and 13:40 UTC? [16:39:07] s3 is under maintenance: https://tools.wmflabs.org/replag/ [16:39:27] it will go up and down for a few [16:40:20] hmm, no logging bot indeed [16:40:44] (speaking of logging, fluorine doesn't have hhvm log anymore, so fatalmonitor dead) [16:40:57] jynus: hmmm K could that be responsible for https://phabricator.wikimedia.org/T144393 ? [16:41:45] (i.e., CentralNotice admin interface showing updated campaign configuration for admin user, but not propagating to clients as expcted?) [16:42:20] AndyRussG, no, production is the one where the maintanance is ongoing, but a) it has not yet reach user-visible servers and b) central-stuff and meta are on s7, not a3 [16:42:46] I can check s7, though [16:43:11] K thx! [16:44:29] bd808: are you looking into kicking morebots/stashbot already? I'm going to do that otherwise [16:44:42] godog: I'm on it [16:44:52] sweet, thanks [16:45:28] (03PS1) 10Andrew Bogott: Pare down the cloud-init commands on precise [puppet] - 10https://gerrit.wikimedia.org/r/307784 [16:45:31] morebots: hows things? [16:45:31] I am a logbot running on tools-exec-1221. [16:45:32] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [16:45:32] To log a message, type !log . [16:45:35] I'm surprised they don't get restarted and/or don't join back [16:45:46] jynus: the way centralnotice works (note: not centralauth, something different) is by directly reading the meta database server-side (calls to load.php that send current campaign configuration to clients as JSON in a ResourceLoader module) [16:47:00] godog: that bot ... needs to be rewritten/retired. I keep thinking that I'll fix stashbot to do all the things it does on-wiki [16:47:18] but I've not gotten to it because low priority [16:48:08] thanks for fixing the bot, meanwhile i pasted that stuff on the ticket [16:48:09] The responses are cahced Varnish-side but should roll over frequently, like all RL stuff [16:48:22] bd808: aye, the irc bot situation is pretty dire in puppet too, I think there's like 4 different implementations committed [16:48:37] ok, i will puppetize eggdrop :p [16:48:47] then we write TCL scripts [16:49:08] godog: yeah... the deb package should die in a fire [16:49:25] haha thanks mutante, make sure to commit the eggdrop code directly in puppet.git [16:49:34] bd808: *nod* but again, low priority -.- [16:49:43] xdcc send latest_mediawiki.tar.gz [16:51:11] godog: then we have _all_ of the powers of http://www.egghelp.org/tcl.htm [16:51:18] godog: as soon as I get stashbot on k8s and detached from NFS then I'll get more excited about making it capable of taking over for several other aging bots [16:51:43] AndyRussG, I do not see anything relevant; however, unless we have a major outage, replication lag should have nothing to do with those errors [16:52:08] when there is lag, a server gets out of the load balancer [16:52:33] and at most, you could get like issues for 10 seconds; not 30 minutes [16:52:35] (03CR) 1020after4: [C: 031] WIP: Remove "p" symlink from WMF config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306471 (owner: 10Chad) [16:52:56] jynus: great, thx...! P.S. can I quote you in the phabricator bug? [16:53:01] AndyRussG, I remember, however, some issues related to Banners and campaings [16:53:07] yes, please do [16:53:11] :) [16:53:14] Which issues? [16:53:15] Let me see if I can find it [16:53:20] K! [16:54:08] bd808: nice! yeah would be really nice to see some gone/updated [16:56:15] (03CR) 10Alex Monk: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/307762 (owner: 10BBlack) [16:56:17] (03PS11) 10Madhuvishy: nfs: Modify /data/scratch on nfs clients to point to mount from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/306019 (https://phabricator.wikimedia.org/T134896) [16:56:30] (03CR) 10Madhuvishy: [C: 032 V: 032] nfs: Modify /data/scratch on nfs clients to point to mount from labstore1003 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/306019 (https://phabricator.wikimedia.org/T134896) (owner: 10Madhuvishy) [16:56:49] 06Operations, 10Mail, 10Phabricator: DomainKeys Identified Mail (DKIM) for phabricator.wikimedia.org - https://phabricator.wikimedia.org/T116805#1758884 (10greg) [16:57:26] jynus: The issue can be summarized as, a campaign not behaving the way it expected to, given its configuration in the administration interface. Both the server-side code that shows how campaigns are allocated, and the client-side code that allocates them to actual users, were unexpectedly determining that the campaign should not be seen [16:58:05] (03PS1) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) [16:58:11] 06Operations, 10Mail, 10Phabricator: DomainKeys Identified Mail (DKIM) for phabricator.wikimedia.org - https://phabricator.wikimedia.org/T116805#1758884 (10greg) As of https://gmail.googleblog.com/2016/02/making-email-safer-for-you-posted-by.html Google is now identifying Phabricator emails as potentially da... [16:58:49] (03CR) 10Gehel: [C: 04-1] "I'd like to run some tests before merging this in prod." [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel) [16:59:17] Since I don't know of any recent changes in any of the code that handles any of that, the first place I'm looking is outside of CentralNotice [16:59:51] Could also be some weird edge case we've never had, or some side effect of other CN code changes (like extension registration?) [16:59:59] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2598882 (10Papaul) @MoritzMuehlenhoff please see below for the comparison table between the 2 systems. the only thing that i can see that can cause this problem is 1- IDRAC IPMI... [17:05:08] (03Abandoned) 10Kaldari: Adding AssessmentSaveJob (from PageAssessments) to low-priority jobs [puppet] - 10https://gerrit.wikimedia.org/r/305871 (owner: 10Kaldari) [17:05:38] (03CR) 10Andrew Bogott: [C: 032] Pare down the cloud-init commands on precise [puppet] - 10https://gerrit.wikimedia.org/r/307784 (owner: 10Andrew Bogott) [17:05:43] (03PS2) 10Andrew Bogott: Pare down the cloud-init commands on precise [puppet] - 10https://gerrit.wikimedia.org/r/307784 [17:07:18] (03PS1) 10Dereckson: Enable Flow personal talk opt-in Beta Feature on el.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307788 (https://phabricator.wikimedia.org/T144384) [17:07:30] (03CR) 10jenkins-bot: [V: 04-1] maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel) [17:09:43] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2598902 (10MoritzMuehlenhoff) Where is that "IDRAC IPMI Over LAN" setting coming from, in idrac or some other config tool? Can you enable "IDRAC IPMI over LAN", so that I can test... [17:15:48] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2598925 (10Papaul) the left window is mw2088 and the right windows is mw2148. I enable the setting on mw2148. {F4420474} [17:17:29] 06Operations, 10Gerrit, 07LDAP: Update LDAP real names to be coherent with committer information in Gerrit - https://phabricator.wikimedia.org/T144404#2598930 (10bd808) So you want to force rename my Wikitech account because it doesn't match my ~/.gitconfig `user.name` setting? I don't think I'm excited abou... [17:20:30] (03CR) 10BryanDavis: elasticsearch - disable cron to clear elasticsearch caches (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307779 (https://phabricator.wikimedia.org/T144396) (owner: 10Gehel) [17:22:09] 06Operations, 10netops: configure port for frdb1001 - https://phabricator.wikimedia.org/T143248#2598949 (10Jgreen) [17:22:12] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB - https://phabricator.wikimedia.org/T136200#2598950 (10Jgreen) [17:26:31] (03PS3) 10Gehel: elasticsearch - disable cron to clear elasticsearch caches [puppet] - 10https://gerrit.wikimedia.org/r/307779 (https://phabricator.wikimedia.org/T144396) [17:27:09] (03CR) 10Gehel: elasticsearch - disable cron to clear elasticsearch caches (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307779 (https://phabricator.wikimedia.org/T144396) (owner: 10Gehel) [17:28:27] (03CR) 10Yurik: [C: 04-1] maps - use new tileshell.js script to notify tilerator of expired tiles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) (owner: 10Gehel) [17:33:39] (03PS6) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [17:35:23] (03PS7) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [17:35:52] (03PS1) 10Mobrovac: Change-Prop: Ignore non-main NS titles for Wikidata updates [puppet] - 10https://gerrit.wikimedia.org/r/307791 [17:36:31] (03PS8) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [17:37:18] (03PS1) 10Andrew Bogott: Remove one more cloudinit script from Precise [puppet] - 10https://gerrit.wikimedia.org/r/307792 [17:41:20] !log change-prop deploying bf221599c [17:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:42:57] (03PS4) 10BryanDavis: Wait for ack of identify from NickServ before joining channels [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) [17:44:07] (03CR) 10Andrew Bogott: [C: 032] Remove one more cloudinit script from Precise [puppet] - 10https://gerrit.wikimedia.org/r/307792 (owner: 10Andrew Bogott) [17:45:10] (03CR) 10BryanDavis: "> Because if you get nickinuse, it'll choose a different name which" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) (owner: 10BryanDavis) [17:50:35] (03CR) 10BryanDavis: [C: 031] elasticsearch - disable cron to clear elasticsearch caches [puppet] - 10https://gerrit.wikimedia.org/r/307779 (https://phabricator.wikimedia.org/T144396) (owner: 10Gehel) [17:50:56] (03PS5) 10ArielGlenn: abstract out code for adds/changes dumps generation, for general library [dumps] - 10https://gerrit.wikimedia.org/r/307257 (https://phabricator.wikimedia.org/T133547) [17:51:09] (03CR) 10jenkins-bot: [V: 04-1] abstract out code for adds/changes dumps generation, for general library [dumps] - 10https://gerrit.wikimedia.org/r/307257 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [17:51:35] (03PS6) 10ArielGlenn: abstract out code for adds/changes dumps generation, for general library [dumps] - 10https://gerrit.wikimedia.org/r/307257 (https://phabricator.wikimedia.org/T133547) [17:51:50] (03CR) 10jenkins-bot: [V: 04-1] abstract out code for adds/changes dumps generation, for general library [dumps] - 10https://gerrit.wikimedia.org/r/307257 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [17:55:56] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2599097 (10dr0ptp4kt) Heads up @Jhernandez, I think this implies there's a different, more can... [17:57:38] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2562534 (10Mattflaschen-WMF) If this affects window.Geo, could you coordinate to avoid breakin... [18:02:10] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2599113 (10BBlack) The cookie itself isn't going away (and is normally set by our servers in t... [18:04:45] (03PS2) 10Gehel: maps - use new tileshell.js script to notify tilerator of expired tiles [puppet] - 10https://gerrit.wikimedia.org/r/307787 (https://phabricator.wikimedia.org/T139451) [18:04:55] looks like i'm the only one on SWAT today, will ship it [18:05:49] 06Operations, 10Gerrit, 07LDAP: Update LDAP real names to be coherent with committer information in Gerrit - https://phabricator.wikimedia.org/T144404#2598701 (10demon) I'm not sure why this is a bug? If Github doesn't recognize it, that's their problem. Also, Gerrit does not use the real name, it uses the C... [18:17:26] !log ebernhardson@tin Synchronized php-1.28.0-wmf.17/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: Enable CirrusSearch BM25 AB test (duration: 00m 50s) [18:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:55] 06Operations, 06Labs, 06Release-Engineering-Team, 10wikitech.wikimedia.org, 07LDAP: Rename specific account in LDAP, Wikitech and Gerrit - https://phabricator.wikimedia.org/T133968#2599177 (10demon) 05Open>03Resolved >>! In T133968#2551726, @Sophivorus wrote: > I modified my rename request. The reque... [18:19:37] 06Operations, 06Community-Tech, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2599180 (10Legoktm) Hmm, what about wikitech/silver which is still using PHP5 on trusty? [18:20:03] 06Operations, 10Gerrit, 07LDAP: Update LDAP real names to be coherent with committer information in Gerrit - https://phabricator.wikimedia.org/T144404#2599181 (10Dereckson) And what's the expected value for this common name field, a full name or a login? [18:20:37] (03PS2) 10Dzahn: partman: delete some more unused recipes [puppet] - 10https://gerrit.wikimedia.org/r/306501 [18:22:06] !log ebernhardson@tin Synchronized php-1.28.0-wmf.16/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: Enable CirrusSearch BM25 AB test (duration: 00m 47s) [18:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:22:28] (03CR) 10Dzahn: "anyone know if "analytics-cisco" can be deleted? the one with "squid" also sounds obvious. what about the others.." [puppet] - 10https://gerrit.wikimedia.org/r/306501 (owner: 10Dzahn) [18:25:30] (03CR) 10Dzahn: "should be possible but only after https://phabricator.wikimedia.org/T133548 is resolved" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [18:25:56] 06Operations, 10Gerrit, 07LDAP: Update LDAP real names to be coherent with committer information in Gerrit - https://phabricator.wikimedia.org/T144404#2599223 (10demon) Both since we don't really use displayName in Gerrit. We could've awhile ago, but didn't and it's a bit late to change now. We use CN for b... [18:26:13] (03CR) 10Dzahn: "should be possible after https://phabricator.wikimedia.org/T133548 gets resolved" [puppet] - 10https://gerrit.wikimedia.org/r/254305 (owner: 10Odder) [18:26:56] so what is adds all the logs to /a/mw-log/hhvm.log? [18:27:01] (03CR) 10Dzahn: "the Apache change is blocked by https://phabricator.wikimedia.org/T133548 but the DNS change would not be? did the actual domain transfer" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [18:27:13] because it seems that that file no longer exists on fluorine [18:28:04] logstash seems to be monitoring and working [18:28:32] but I do tend to watch both during a deployment: logstash as well as fatalmonitor on fluorine [18:28:56] some days I trust one more than the other, so is weird that one is gone missing and the other is fine. [18:30:00] I haven't used flourine in well over a year :) [18:30:34] (03CR) 10Dzahn: "waiting for replies on this thread: https://phabricator.wikimedia.org/T143969#2589903" [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) (owner: 10Paladox) [18:31:06] eh, sometimes faster to see log messages climb up in fatalmonitor. I'm faster at switching term windows than browser tabs :) [18:31:23] 06Operations, 10Gerrit, 07LDAP: Update LDAP real names to be coherent with committer information in Gerrit - https://phabricator.wikimedia.org/T144404#2599241 (10Dereckson) 05Open>03Invalid I'm marking that as invalid, as I thought the schema has two field, one to store the username used as login in all... [18:33:28] fatalmonitor works well in a dropdown terminal too during SWAT [18:34:03] !log RESTBase deploy fa1a5aab to staging [18:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:35:09] thcipriani: that probably needs a task, not good [18:35:20] greg-g: https://phabricator.wikimedia.org/T144389 [18:35:35] Dereckson filed one already [18:47:14] !log RESTBase deploy fa1a5aab canary on restbase1007 [18:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:38] (03Abandoned) 10Dzahn: extdist: move role to module, fix class name, labs-only [puppet] - 10https://gerrit.wikimedia.org/r/298906 (owner: 10Dzahn) [18:51:19] (03CR) 1020after4: [C: 031] Phab: Remove config abstraction. Useless & confusing [puppet] - 10https://gerrit.wikimedia.org/r/307354 (https://phabricator.wikimedia.org/T144112) (owner: 10Chad) [18:52:40] !log RESTBase deploy fa1a5aab [18:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:53] !log shutting down elasticsearch on elastic1028 to prepare moving server - T143685 [18:54:55] T143685: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685 [18:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:56:20] so logs are no more written on fluorine.eqiad.wmnet :( T144389 [18:56:20] T144389: /a/mw-log/hhvm.log: file not found on Fluorine - https://phabricator.wikimedia.org/T144389 [18:56:26] though they are sent to logstash at least [19:02:51] I am blocking the train pending fluorine logs [19:03:13] could use someone with root to look at tthe puppet log on fluorine.eqiad.wmnet to check what it might have broken [19:03:17] udp2log no more write to disk :( [19:03:40] or maybe that is rsylog [19:04:01] it's got to be udp2log that's messed up [19:04:56] pstree looks crazy for that parent proc. 299*[python] + 472*[sh] [19:06:32] yeah [19:06:33] hashar: start screaming for root help [19:06:40] last spawned on Feb 18th [19:06:51] and it is known to have sub process to Zombie once in a while [19:06:58] tell me where to lok [19:07:01] look [19:07:33] fluorine.eqiad.wmnet [19:07:36] in /var/log/puppet.log [19:07:42] hello apergos :) [19:07:50] might just want to restart udp2log I am looking for doc [19:07:56] apergos: it looks like the procs for /etc/init.d/udp2log-mw may have gone crazy. [19:08:12] so I should look for anything weird in the puppet log? [19:08:13] checking... [19:08:41] visible symptom is basically no logs in fluorine:/a/mw-log [19:09:07] so maybe just restart the udp2log-mw process will solve it [19:09:11] demux.py is the bit that outputs those files as far as I recall [19:09:16] it does receive event as far as I can tell [19:09:18] /etc/rsyslog.d/30-remote-syslog.conf was changed to *.info;mail.none;authpriv.none;cron.none @syslog.eqiad.wmnet;LongTagForwardFormat [19:09:49] +*.info;mail.none;authpriv.none;cron.none @syslog.codfw.wmnet;LongTagForwardFormat [19:09:49] this was added later [19:09:59] can confirm it is demux.py that write directly to disk [19:10:29] nothing else in puppet log except host ssh key changes [19:10:33] the hhvm log should be written by rsylog I think [19:10:54] at least we send the data via rsylog forwarding from the MW servers [19:11:01] maybe udp2log with all its zombie process ends up with too many file descriptors ? [19:11:36] or whatever craziness [19:12:49] lemme know if you want me to try a simple restart [19:13:59] apergos: worth a shot. All I can see are a zillion zombie python and sh procs that started from udp2log [19:14:14] yep, a whole load of crap there [19:15:03] I am doing a stop/start rather than a reload/restart to get those zombies gone [19:15:41] !log stopped/started udp2log-mw on fluorine [19:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:54] good bot. nice bot. [19:15:58] * apergos gives morebot a biscuit [19:16:02] at least the Zombies are gone [19:16:58] there should be data immediately. something else is up [19:17:07] lest see when the last data in the rotated logs was [19:17:14] hrmph [19:17:32] bd808: https://phabricator.wikimedia.org/T144389#2599373 [19:17:40] that logrotateed this morning at 6:25 as usual [19:17:52] those changes are the ones I reported above in puppet [19:18:48] oh the logrotate. uh huh. there were messages until when, I wonder? [19:20:11] apergos biscuit can mean different things in english lol [19:20:20] I know! [19:20:37] (03CR) 10Alex Monk: [C: 032] "Not sure the do_identify call in on_privnotice is necessary, but it should work" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) (owner: 10BryanDavis) [19:20:45] I figured it would be the bot's choice between a tea biscuit and a dog biscuit [19:20:53] (03Merged) 10jenkins-bot: Wait for ack of identify from NickServ before joining channels [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) (owner: 10BryanDavis) [19:21:12] so the last line in the old hhvm.log was timestamped Aug 30 08:21:35 [19:21:18] oh [19:21:24] Aug 30? huh [19:21:29] I forgot to tail the log files [19:21:36] bah wrong file [19:21:43] * bd808 tries again [19:21:47] heh [19:21:49] 06Operations: /a/mw-log/hhvm.log: file not found on Fluorine - https://phabricator.wikimedia.org/T144389#2599468 (10hashar) @bd808 wrote: the last line in the old hhvm.log was timestamped Aug 30 08:21:35 [19:22:00] I should stop copy pasting to the task :D [19:22:05] apergos Digestive biscuits? [19:22:05] gz and tail aren't friends [19:22:11] * apergos sees an edit to that comment coming any second now... [19:22:13] ok, Aug 31 06:25:13 [19:22:19] thank goodness for that edit comment feature [19:22:26] hm so the log rot and then nada. [19:22:33] that's for hhvm.log [19:23:10] api.log would tell us just about exactly where it broke, but since the only way to get the last line of a gzipped file is to cat the whole thing that would take a while [19:23:14] and apache2.log is Aug 31 06:25:23 [19:24:41] -rw-r--r-- 1 udp2log udp2log 28888114932 Aug 31 06:25 api.log-20160831.gz [19:24:45] this file you mean? [19:25:25] well I'm patient. zcat started :-P [19:25:39] yeah. there are several thousand api.log entires per second, so ti would be pretty much exact on when logs stopped working (if they did) [19:26:24] anything change with firewall rules on fluorine? [19:26:30] I'll bet the last entry was just before log rot [19:26:39] I'll bet a tea or doggie biscuit [19:27:55] logrotate sends a HUP via killall to udp2log post rotate [19:28:36] so if the logs were working right up until then... I'm not sure. [19:28:55] restart should have fixed things if it was proc local [19:29:43] * bd808 needs food [19:29:44] git lgo says nothin about firewall rules in the last couple of days [19:29:58] and /etc/logrotate.d/udp2log-mw got last changed on Aug 16th [19:30:10] but that is 15 days ago [19:30:18] and rot is every day [19:30:32] still waiting for the zcat | tail to finish [19:30:35] what's going on with fluorine? [19:30:42] I edited some of that code in puppet recently [19:30:56] maybe it is a permission issue and it cant write to the dir? [19:31:08] Krenair: no logs to /a/mw-log at all post rotate [19:31:26] logrotate send a HUP [19:31:39] so logs just... stopped getting written? [19:31:43] and there were hudreds of zombie children of the udp2log proc before apergos restarted it [19:31:44] drwxr-xr-x 4 udp2log udp2log 138 Aug 31 09:42 /a/mw-log [19:32:02] bad perms? [19:32:05] ^^ [19:32:15] owned/writeable by udp2log [19:32:25] /a/mw-log/archive is Aug 31 9:43 [19:34:06] Aug 30 19:48 /usr/local/bin/demux.py [19:34:18] ooohhhh [19:34:21] there's a good clue [19:34:29] * de93e39 - Remove the hard-coded /a/mw-log references scattered around everywhere (24 hours ago) [19:34:31] in puppet [19:34:35] oh... dear [19:34:43] I just noticed something about one of my changes in that file [19:34:48] args.baseDir [19:34:50] not args.basedir [19:34:59] so maybe https://gerrit.wikimedia.org/r/#/c/305767/ ? [19:35:00] python is picky like that [19:35:13] (03PS1) 10Urbanecm: Update frwikt logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307809 (https://phabricator.wikimedia.org/T144427) [19:35:27] :D [19:36:03] it's probably writing to '/' + name [19:36:06] or wishing it could [19:36:33] 06Operations: /a/mw-log/hhvm.log: file not found on Fluorine - https://phabricator.wikimedia.org/T144389#2599583 (10hashar) [19:37:04] nope [19:37:18] think it'll just be crashing? [19:37:38] 06Operations: /a/mw-log/hhvm.log: file not found on Fluorine - https://phabricator.wikimedia.org/T144389#2598103 (10hashar) After a bunch of investigation with @bd808 @ArielGlenn @Krenair : Aug 30 19:48 /usr/local/bin/demux.py Most probably caused by de93e39b978ae213c61b0ef76ac14e0240158110 [19:37:53] (03CR) 10Odder: "Yes, the domain has been transferred to the Foundation on 23 November 2015." [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [19:37:58] I just made the edit on deployment-fluorine02, which I just noticed had the same issue [19:38:00] yep it probably exceptions out, well I didn't look at the demux vode [19:38:01] it works again there [19:38:02] code [19:38:03] I'll submit a patch [19:38:04] (03PS7) 10Dzahn: Phab: Remove config abstraction. Useless & confusing [puppet] - 10https://gerrit.wikimedia.org/r/307354 (https://phabricator.wikimedia.org/T144112) (owner: 10Chad) [19:38:07] (03PS1) 10Ppchelko: Ignore 503 from ORES updates. [puppet] - 10https://gerrit.wikimedia.org/r/307810 [19:38:14] but it ought to really complain visibly instead of silently dying [19:38:21] anyhoo [19:38:25] I told them this would happen... [19:39:31] well I guess it tries to write to /hhvm.log , get a perm deny. That raise an exception which is caught. It then happily pass [19:39:54] going to prepare the group1 while the puppet is fixed [19:39:58] yay for pass instead of at least logging that in the udp2log error log >_< [19:40:49] apergos, it just does except: pass ? [19:41:18] see hashar's comment [19:41:41] more or less [19:41:52] attempt to close the open FD if there is any matching [19:42:01] but in this case there is no FD open since it got denied [19:42:04] so it ... pass :] [19:42:15] that is resilient. [19:42:17] yes that's it [19:42:22] (03PS1) 10Alex Monk: Follow-up I6df802b9: Fix udp2log's demux.py [puppet] - 10https://gerrit.wikimedia.org/r/307812 (https://phabricator.wikimedia.org/T144389) [19:42:28] while true... do some stuff.. if there's an exception close the file... move on [19:42:32] (03CR) 10Dzahn: [C: 032] "no-op in compiler http://puppet-compiler.wmflabs.org/3908/" [puppet] - 10https://gerrit.wikimedia.org/r/307354 (https://phabricator.wikimedia.org/T144112) (owner: 10Chad) [19:43:21] Krenair: do you have +2 in puppet, I forget [19:43:37] ie should I +1 or +2 it [19:44:00] (03Merged) 10jenkins-bot: group1 wikis to 1.28.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307813 (owner: 10Hashar) [19:44:15] I could change the ACLs so that I did, but it'd be fairly pointless since I can't actually log into the production puppetmaster and pull there [19:44:25] ok. I'll just +2 merge deploy [19:44:43] (03PS2) 10ArielGlenn: Follow-up I6df802b9: Fix udp2log's demux.py [puppet] - 10https://gerrit.wikimedia.org/r/307812 (https://phabricator.wikimedia.org/T144389) (owner: 10Alex Monk) [19:44:50] not sure whether puppet will trigger a restart [19:44:53] (03CR) 10Dzahn: "yep, nothing happened on iridium (prod) either. puppet ran twice." [puppet] - 10https://gerrit.wikimedia.org/r/307354 (https://phabricator.wikimedia.org/T144112) (owner: 10Chad) [19:44:57] you might have to restart udp2log-mw manually [19:45:25] (03CR) 10Hashar: [C: 031] Follow-up I6df802b9: Fix udp2log's demux.py [puppet] - 10https://gerrit.wikimedia.org/r/307812 (https://phabricator.wikimedia.org/T144389) (owner: 10Alex Monk) [19:45:50] switching group1 [19:46:03] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.17 [19:46:12] (03CR) 10ArielGlenn: [C: 032] Follow-up I6df802b9: Fix udp2log's demux.py [puppet] - 10https://gerrit.wikimedia.org/r/307812 (https://phabricator.wikimedia.org/T144389) (owner: 10Alex Monk) [19:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:29] bd808: thank you :] [19:46:31] grrr [19:46:37] didn't wait for jenkins [19:47:08] bd808: you might know how on logstash GUI one can refresh data of a dashboard ? [19:47:23] bd808: there used to be a tiny refresh button, looks like we have to refresh the whole page now [19:47:51] (03PS3) 10Dzahn: contint: fix resource conflict with service::deploy::common [puppet] - 10https://gerrit.wikimedia.org/r/307561 (https://phabricator.wikimedia.org/T143065) (owner: 10BryanDavis) [19:47:59] (03CR) 10Dzahn: [C: 032] contint: fix resource conflict with service::deploy::common [puppet] - 10https://gerrit.wikimedia.org/r/307561 (https://phabricator.wikimedia.org/T143065) (owner: 10BryanDavis) [19:48:20] (03PS1) 10Urbanecm: HD logos for frwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307814 (https://phabricator.wikimedia.org/T144427) [19:50:12] !log restarted udp2log-mw on fluorine again after deployment of https://gerrit.wikimedia.org/r/#/c/307812/ [19:50:19] all looking good now [19:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:50:40] thanks Krenair [19:54:35] I'm going to check out, if you need other root help someone in an sf-like tz will have to take it up from here. [19:54:37] happy trails [19:54:44] bd808, been having trouble getting jouncebot back in here [19:54:55] it runs on the bastion, but not via the grid [19:55:02] ImportError: /usr/lib/x86_64-linux-gnu/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by /data/project/jouncebot/virtenv/local/lib/python2.7/site-packages/lxml/etree.so) [19:55:56] oh btw bd808: 2016-08-31 06:25:23 [stuff] mw1228 hewiki 1.28.0-wmf.16 api INFO: API GET, heh [19:56:01] (tail) [19:58:44] Krenair: I've seen that in the error logs too, but thought it was still running? [19:59:08] Well, if I run it from the bastion: [19:59:10] hashar: click the little magnifying glass to refresh [19:59:11] !log group1 to 1.28.0-wmf.17 done. There is a couple explicit commit of implicit transactions for Wikidata T144433 T144434 not much of a worry [19:59:13] T144433: Wikibase\SqlIdGenerator::generateNewId: Implicit transaction expected (DBO_TRX set). - https://phabricator.wikimedia.org/T144433 [19:59:13] T144434: Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages: Implicit transaction already active (from DatabaseBase::query (Wikibase\Client\Usage\Sql\EntityUsageTable::queryUsages)). - https://phabricator.wikimedia.org/T144434 [19:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:59:33] bd808: oh just search again. Neat! thank you a ton [19:59:37] Krenair: that import error means you're using a venv from precise on trusty or vice versa [19:59:44] ^ [20:00:01] (03PS7) 10ArielGlenn: abstract out code for adds/changes dumps generation, for general library [dumps] - 10https://gerrit.wikimedia.org/r/307257 (https://phabricator.wikimedia.org/T133547) [20:00:09] ooh [20:00:11] ah. I have no idea which the venv is built for. [20:00:16] the bastion is trusty [20:00:16] (03CR) 10jenkins-bot: [V: 04-1] abstract out code for adds/changes dumps generation, for general library [dumps] - 10https://gerrit.wikimedia.org/r/307257 (https://phabricator.wikimedia.org/T133547) (owner: 10ArielGlenn) [20:00:17] but [20:00:23] it was running on a precise exec node [20:00:29] apergos: Krenair: thank you for the fluorine fix up [20:00:50] Krenair: will let you claim/solve https://phabricator.wikimedia.org/T144389 :] [20:01:14] Krenair: just use the run.sh script to start it and it sets the right things [20:01:18] 06Operations, 13Patch-For-Review: /a/mw-log/hhvm.log: file not found on Fluorine - https://phabricator.wikimedia.org/T144389#2599736 (10AlexMonk-WMF) 05Open>03Resolved a:03AlexMonk-WMF This was my fault, mostly. Fixed now. [20:01:26] sigh [20:01:35] Okay [20:01:55] tools.jouncebot@tools-bastion-03:~$ grep jsub start_jouncebot.sh [20:01:55] # TO START JOUNCEBOT: jsub -once -continuous -mem 512m -N jouncebot start_jouncebot.sh [20:02:05] ^ That's what I was doing [20:02:11] Correct one: [20:02:12] tools.jouncebot@tools-bastion-03:~$ cat run.sh [20:02:12] #!/usr/bin/env bash [20:02:12] jsub -once -continuous -mem 512m -l release=trusty -N jouncebot start_jouncebot.sh [20:02:27] I'll update start_jouncebot.sh's comment [20:03:22] (03PS8) 10ArielGlenn: abstract out code for adds/changes dumps generation, for general library [dumps] - 10https://gerrit.wikimedia.org/r/307257 (https://phabricator.wikimedia.org/T133547) [20:04:40] (03CR) 10Dzahn: "merged on prod master now" [puppet] - 10https://gerrit.wikimedia.org/r/307561 (https://phabricator.wikimedia.org/T143065) (owner: 10BryanDavis) [20:05:13] now it's running on a trusty node [20:05:30] success [20:05:31] jouncebot, help [20:05:36] jouncebot, next [20:05:36] In 2 hour(s) and 54 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160831T2300) [20:05:59] is that with my latest patch Krenair? [20:06:04] yes bd808 [20:06:10] back to no cherry-picks then :) [20:07:10] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2599777 (10MoritzMuehlenhoff) Thanks, I'll try an install on mw2148. If it doesn't work, we can narrow that down further with a third host. [20:07:27] bd808, oh, hang on [20:07:34] it seems to have an old version of your commit [20:07:54] there was a cherry-pick of it [20:08:24] git reset --hard origin/master should be safe [20:08:51] !sal [20:08:51] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [20:08:54] @seen Ladsgroup [20:08:54] mutante: I have never seen Ladsgroup [20:08:58] jouncebot, next [20:08:58] In 2 hour(s) and 51 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160831T2300) [20:09:15] releng got added to tools.jouncebot yesterday so we can eventually restart it in case it explodes [20:10:25] I need to make a couple more patches to it, but I'm going to test them out on stashbot first [20:12:01] (03CR) 10Dzahn: "What exactly is the "careful" part when merging this in "So it's okay to merge this but we should be careful we don't start the apocalyps" [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup) [20:12:33] "should be ok, just dont start the apocalypse", hehe [20:12:40] but how :) [20:21:34] ottomata: any chance you could test the new archiva server? [20:22:06] have never used it in any way, but i got it to run and have a login [20:22:18] unsure what a good test even is [20:23:21] or anyone who has used archiva.wm.org [20:24:06] testing is possible by editing /etc/resolv.conf so that archiva.wm.org is 208.80.154.73 [20:25:43] hmm, aye lemme see hmm [20:25:56] i _could_ add archiva-new.wm [20:26:02] but that might also mean new trouble [20:26:09] ottomata: thanks [20:28:09] (03CR) 10Gehel: [C: 04-1] Discovery stats module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305673 (https://phabricator.wikimedia.org/T143048) (owner: 10MaxSem) [20:29:25] hmm mutante its missing everything [20:29:47] looks like a brand new install almost [20:29:53] our configured repos aren't there [20:29:58] nor are any of the artifacts [20:30:08] was there a java database thing not brought over? [20:31:34] (03PS6) 10MaxSem: Discovery stats module [puppet] - 10https://gerrit.wikimedia.org/r/305673 (https://phabricator.wikimedia.org/T143048) [20:31:41] gehel, ^ [20:33:24] ottomata: i copied the entire /var/lib/archiva, but that's all i did [20:33:30] and the new config back in place [20:33:57] then i had to fix some permissions because the UID is different [20:34:05] then i could restart the service and the login popped up [20:34:44] hm, yeah, var/lib/archiva does have the derby databases [20:35:00] hm, that's just users [20:35:01] hm [20:35:15] (03CR) 10Ladsgroup: "The reason I'm worried about is that I don't know what kind of changes it triggers if we merge this. Will this clean up the whole /srv/dep" [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup) [20:35:26] moritz was hoping this is all too [20:38:35] ottomata: so the config.xml file, that isnt generated by puppet but by the web admin ui,, i am told.. right [20:38:58] (03CR) 10Hashar: "> That's more research/surgery than I'm interested in doing at this point. Could be done in a follow up patch by someone, but really I thi" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/307333 (https://phabricator.wikimedia.org/T144189) (owner: 10BryanDavis) [20:39:05] ottomata: that is the only file now that is not an exact copy of the one on old server [20:40:13] ottomata: because i copied that to /tmp, rsynced all, copied that one file back in place [20:41:18] ottomata: maybe either we try the original config from the old version.. or maybe in that admin webui some pathes have to be re-configured or so? [20:41:34] looking [20:41:37] i have to run really soon btw [20:41:39] but am looking [20:42:02] yes, ok, thanks. can also continue later [20:43:19] mutante: /etc/archiva/jetty.xml is in puppet [20:43:21] the rest is not [20:43:32] the rest of the config is done via ui, and saved to an xml file in /var/lib/archiva [20:43:49] /var/lib/archiva/conf/archiva.xml [20:43:57] how do I log into the new host? [20:43:57] that is the file i was talking about, yes [20:44:11] i did not want to overwrite that one [20:44:16] but mayeb i should have [20:44:30] mutante: ah yes, that one should def match what was on titanium [20:44:40] ottomata: meitnerium.wikimedia.org [20:45:12] mutante: do I need to go through proxy? [20:45:15] iron or whatever? [20:45:16] ottomata: ok, let me sync that .. i was thinking it comes from the package and should be the jessie one [20:45:22] ottomata: no, you can go directly [20:45:22] ahh ok [20:45:24] no, wait [20:45:24] i betcha that will do it [20:45:25] you cant [20:45:28] because of ferm [20:45:49] you still have to go through bastion [20:45:54] even though it's a 28. IP [20:45:56] 208 [20:45:57] cool am in through bastion [20:46:11] yeah, mutante def sync that file and restart archiva [20:46:15] i betcha that will fix it [20:46:20] ok, hold on [20:49:20] done [20:49:31] of course tight then everything slows down for me locally [20:50:13] ottomata: better now? [20:50:46] looking [20:51:17] so far so good: [20:51:18] Downloaded: https://archiva.wikimedia.org/repository/mirrored/org/apache/hadoop/hadoop-common/2.6.0-cdh5.5.2/hadoop-common-2.6.0-cdh5.5.2.jar (3312 KB at 1768.1 KB/sec) [20:51:29] logging in to look [20:51:43] its taking much longer to log in(like it used to) so that's a good sign too i guess [20:52:03] what do you use to login there? [20:52:31] is that LDAP? [20:53:16] nice that it looks good:) [20:53:38] no not ldap [20:53:56] would be great if it was, i tried that when i initialliy set this up years ago, but there was a bug that kept it from working [20:54:22] mutante: i'll write an email to analytics team and ask them to try to use this for a bit [20:54:30] ottomata: sounds great, cool [20:54:42] if they can do a full analytics/refinery release with this, then we shoudl use it [20:55:07] ottomata: good enough with the /etc/resolv.conf for them to test? [20:55:22] yeah [20:55:42] great [20:56:11] thanks, let's see later then [20:56:22] good progress [20:57:18] cool, tahnks, gotta run now, nice work! [20:57:43] :) laters [20:57:57] oh mutante :] [20:58:05] I have found an oddity in puppet : modules/role/manifests/mha/node.pp:class mha::node { [20:58:17] shouldn't it be class role::mha::node ? [20:58:39] eg the role:: prefix got dropped when mha got moved under module/role maybe? was b4973a4 [20:58:43] yes, it should be [20:58:48] there are many things liek that [20:58:54] hashar, https://gerrit.wikimedia.org/r/#/c/301076/ [20:59:05] trying to fix them is hard, there is always resistance to change them around [20:59:14] you have to find out which instnaces use them [20:59:18] and reconfigure them [20:59:30] ahh jynus nicely solves it :] [20:59:36] rm is a magic tool for sure [20:59:38] oh :) [21:00:06] -8318 hah, nice [21:00:12] what a cleanup [21:01:54] solved thank you :] [21:17:37] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 633 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5056089 keys - replication_delay is 633 [21:19:16] jouncebot: next [21:19:16] In 1 hour(s) and 40 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160831T2300) [21:23:01] Evening SWAT? [21:24:27] Yep :) [21:25:54] whose evening? [21:26:28] LOL [21:26:35] It is almost 11pm here [21:27:20] does someone need to worry about the redis whine above? [21:27:30] as it's midnight-30 here I am not that someone [21:27:52] !log T143226: Stopping restbase2001-a.codfw.wmnet to clear tables marked repaired [21:27:53] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:15] LOL [21:28:46] apergos your two hours a head, that makes your 00:28am [21:28:54] !!log T143226: Starting restbase2001-a.codfw.wmnet [21:28:54] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:28:56] uh huh [21:38:01] !log T143226: Stopping restbase2001-{a,b} to clear tables marked repaired [21:38:02] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:38:49] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5018082 keys - replication_delay is 0 [21:39:53] !log T143226: Starting restbase2001-{b,c}.codfw.wmnet [21:39:54] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:43:33] (03PS1) 10Andrew Bogott: Further attempt to get the precise cloud-init working [puppet] - 10https://gerrit.wikimedia.org/r/307877 [21:44:45] !log T143226: Clearing repair status: restbase2002.codfw.wmnet [21:44:46] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:46:02] !log T143226: Clearing repair status: restbase2007.codfw.wmnet [21:46:03] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:47:02] (03CR) 10Andrew Bogott: [C: 032] Further attempt to get the precise cloud-init working [puppet] - 10https://gerrit.wikimedia.org/r/307877 (owner: 10Andrew Bogott) [21:51:59] !log T143226: Clearing repair status: restbase2003.codfw.wmnet [21:52:00] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [21:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:53:07] there is also a spam of " Retrying connection to search.svc.eqiad.wmnet " [21:53:38] urandom: could that restbase issue cause search.svc.eqiad.wmnet to have troubles? [21:53:50] hashar: restbase issue? [21:54:11] hashar: we're not currently having any issues that i'm aware of... [21:54:13] sorry I was refering to the repair status [21:54:19] that is unrelated apparently [21:54:20] hashar: oh. nope! [21:59:49] (03PS2) 10BBlack: openssl (1.1.0-1~wmf1) experimental; urgency=medium [debs/openssl] (wmf-1.1) - 10https://gerrit.wikimedia.org/r/307466 [22:01:38] !log T143226: Clearing repair status: restbase2004.codfw.wmnet [22:01:39] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [22:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:09] !log surge of since 31th 7:00 UTC : Retrying connection to search.svc.eqiad.wmnet [22:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:04:24] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Wikimedia-log-errors: Elastica warning about Retrying connection to search.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T144450#2600157 (10hashar) [22:05:01] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2600178 (10AlexMonk-WMF) [22:05:04] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 15User-bd808: deployment-sca0[12] puppet failure due to issues involving /srv/deployment directory - https://phabricator.wikimedia.org/T143065#2600172 (10AlexMonk-WMF) 05Open>03Resolved a:03bd808 Thanks @bd808 and @dzahn [22:08:22] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2600186 (10madhuvishy) [22:08:24] 06Operations, 06Labs, 13Patch-For-Review: move nfs /scratch to labstore1003 - https://phabricator.wikimedia.org/T134896#2600184 (10madhuvishy) 05Open>03Resolved a:03madhuvishy [22:09:21] (03CR) 10Krinkle: [C: 031] "Did various greps in this repo and others, github searches, find/readlink/grep searches, and no matches." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306471 (owner: 10Chad) [22:12:27] !log T143226: Clearing repair status: restbase2008.codfw.wmnet [22:12:28] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [22:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:13:15] 06Operations: Fwd: [WMF Office IT / Facilities] Re: Fwd: Help with Winedale mailing list - https://phabricator.wikimedia.org/T144416#2600207 (10Krenair) [22:16:20] 06Operations, 10Wikimedia-Mailing-lists: Fwd: [WMF Office IT / Facilities] Re: Fwd: Help with Winedale mailing list - https://phabricator.wikimedia.org/T144416#2600214 (10Dzahn) [22:17:05] !log restarting elasticsearch on elastic1028 [22:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:22:00] 06Operations, 10Wikimedia-Mailing-lists: Fwd: [WMF Office IT / Facilities] Re: Fwd: Help with Winedale mailing list - https://phabricator.wikimedia.org/T144416#2599105 (10Dzahn) Hi @KFrancis can you check if Mike still has access to mnemonic at gmail.com or if not what the current address is? [22:26:07] 06Operations, 10Wikimedia-Mailing-lists: Fwd: [WMF Office IT / Facilities] Re: Fwd: Help with Winedale mailing list - https://phabricator.wikimedia.org/T144416#2600304 (10KFrancis) Hi, His current address/access is: mnemonic@gmail.com. Thanks, Katie [22:26:59] Hi! I have a question: Is it possible to set up a permanent forwarding for board members? More specifically, once they leave. Stephen L. was hoping to have any email sent to their @wikimedia email forwarded to their personal email. He said that there should be around 0-3 a year. Usually OIT deletes them from Google, but would forwarding be possible even if [22:26:59] they are deleted from our google console? [22:28:07] josephine: those forwardings used to exist, at least [22:29:29] I think this is a question for OIT, actually [22:29:49] it's possible but at the same time we want to move those away from ops and to oit [22:29:52] they may need to keep a stub entry in the google console of something [22:29:53] which is another ticket [22:29:56] *or [22:29:58] it could be done in exim [22:30:07] but they are moving away from that: https://phabricator.wikimedia.org/T122144 [22:30:11] what he said [22:30:25] :) [22:30:27] ok thats good to know. thanks! I'll talk to my team about it [22:30:31] :) [22:30:32] can you maybe do the forwarding in google? [22:30:40] ok, cool [22:30:44] let us know [22:32:02] We can do forwarding in google, but we delete accounts after a certain period. I think we can just keep their account open on our end instead of having it thru exim. [22:32:26] that would be ideal for us [22:32:40] then we dont have to maintain the exim file anymore and you guys dont have to ask us each time [22:32:43] !log depooling elastic1047 from LVS [22:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:33:04] Ok, I'll go let OIT know! [22:33:18] (03PS1) 10Andrew Bogott: Cloud.cfg tidying for precise and trusty images [puppet] - 10https://gerrit.wikimedia.org/r/307881 [22:33:32] !log gehel@palladium conftool action : set/pooled=no; selector: name=elastic1047.eqiad.wmnet [22:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:33:49] (03PS9) 10ArielGlenn: abstract out code for adds/changes dumps generation, for general library [dumps] - 10https://gerrit.wikimedia.org/r/307257 (https://phabricator.wikimedia.org/T133547) [22:34:12] (03PS3) 10BBlack: openssl (1.1.0-1~wmf1) experimental; urgency=medium [debs/openssl] (wmf-1.1) - 10https://gerrit.wikimedia.org/r/307466 [22:34:14] (03PS2) 10BBlack: Add local chapoly preference hack patch [debs/openssl] (wmf-1.1) - 10https://gerrit.wikimedia.org/r/307465 [22:37:38] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2600311 (10awight) a:03AndyRussG [22:45:02] (03PS10) 10ArielGlenn: abstract out code for adds/changes dumps generation, for general library [dumps] - 10https://gerrit.wikimedia.org/r/307257 (https://phabricator.wikimedia.org/T133547) [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160831T2300). Please do the needful. [23:00:04] Dereckson, Dereckson, and yurik: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:02:32] Dereckson, are you multiplying? :P [23:03:12] guess I'll do the deploy [23:10:45] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 4 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2600401 (10DStrine) [23:13:54] Hi. [23:14:48] I created two entries, one for my patch, one for urbanecm's patches [23:15:09] cheap clone method :) [23:15:43] !log maxsem@tin Synchronized php-1.28.0-wmf.17/extensions/Kartographer: https://gerrit.wikimedia.org/r/#/c/307883/ (duration: 00m 49s) [23:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:05] 06Operations, 10Wikimedia-Mailing-lists: Fwd: [WMF Office IT / Facilities] Re: Fwd: Help with Winedale mailing list - https://phabricator.wikimedia.org/T144416#2600427 (10Dzahn) Hi @KFrancis you can tell him to check is inbox, he should received a new random password from mailman itself now at that address. He... [23:16:59] 06Operations, 10Wikimedia-Mailing-lists: reset password of winedale-l - https://phabricator.wikimedia.org/T144416#2600428 (10Dzahn) [23:20:15] !log T143226: Clearing repair status: codfw, rack 'd' [23:20:16] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [23:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:26] (03PS2) 10MaxSem: HD logos for frwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307814 (https://phabricator.wikimedia.org/T144427) (owner: 10Urbanecm) [23:23:33] (03CR) 10MaxSem: [C: 032] HD logos for frwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307814 (https://phabricator.wikimedia.org/T144427) (owner: 10Urbanecm) [23:24:02] (03Merged) 10jenkins-bot: HD logos for frwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307814 (https://phabricator.wikimedia.org/T144427) (owner: 10Urbanecm) [23:24:07] Sync 307809 at the same time perhaps, that will less confusing to test [23:24:17] (and we'll avoid to serve an HD version different than the LD) [23:25:15] (03PS2) 10MaxSem: Update frwikt logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307809 (https://phabricator.wikimedia.org/T144427) (owner: 10Urbanecm) [23:25:20] (03CR) 10MaxSem: [C: 032] Update frwikt logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307809 (https://phabricator.wikimedia.org/T144427) (owner: 10Urbanecm) [23:25:50] (03Merged) 10jenkins-bot: Update frwikt logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307809 (https://phabricator.wikimedia.org/T144427) (owner: 10Urbanecm) [23:26:50] Dereckson, pulled on mw1099 [23:28:32] Testing. [23:29:39] https://fr.wiktionary.org/static/images/project-logos/frwiktionary.png works. [23:30:00] HD works too [23:31:41] !log maxsem@tin Synchronized static: Update logos, step 1 (duration: 00m 47s) [23:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:57] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: Updating logos, step 2 (duration: 00m 48s) [23:34:01] !log T143226: Clearing repair status: eqiad, rack 'a' nodes [23:34:03] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [23:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:13] Dereckson, ^ [23:35:45] (03PS4) 10MaxSem: Use extension registration for CategoryTree [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268627 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [23:36:16] no notice in server log [23:36:18] that looks good [23:36:50] yep, looks good to me [23:37:25] (03CR) 10MaxSem: [C: 032] Use extension registration for CategoryTree [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268627 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [23:38:01] (03Merged) 10jenkins-bot: Use extension registration for CategoryTree [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268627 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [23:38:34] Dereckson, pulled on mw1099 [23:38:41] testing [23:39:19] looks good to me [23:40:51] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/268627/ (duration: 00m 47s) [23:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:29] !log maxsem@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/268627/ (duration: 00m 47s) [23:42:36] (03PS1) 10Dereckson: Use extension registration for CategoryTree (cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307886 [23:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:38] Dereckson, [23:42:44] yep? [23:44:12] ^^^ :P [23:44:48] So. Currently, we've successfully switched CategoryTree to use registration, without breaking prod. [23:45:16] change has wg (for wfExtensionLoad) wmg (for old CS) variables [23:45:23] https://gerrit.wikimedia.org/r/307886 gets rid of the wmg [23:49:40] MaxSem: ^ [23:50:03] (03CR) 10MaxSem: [C: 032] Use extension registration for CategoryTree (cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307886 (owner: 10Dereckson) [23:50:29] (03Merged) 10jenkins-bot: Use extension registration for CategoryTree (cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307886 (owner: 10Dereckson) [23:51:20] Dereckson, pulled on mw1099 [23:51:59] still looks good to me [23:52:10] suspicious... [23:53:45] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/307886/ (duration: 00m 48s) [23:53:46] The goal was to stop to flood server log with notices and missing variables to make migrations smoother. [23:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:26] That seems to work well.