[12:21:34] (03PS1) 10Alexandros Kosiaris: Renumber planet1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/360836 [12:27:25] (03CR) 10Ema: [C: 031] thumbor: use jessie-backports as target release for python-thumbor-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/360350 (https://phabricator.wikimedia.org/T121388) (owner: 10Filippo Giunchedi) [12:43:25] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3370245 (10fgiunchedi) thanks @JAllemandou ! I've converted the tables to use parquet and dropped the old plaintext tables [12:47:10] (03PS2) 10Filippo Giunchedi: thumbor: use jessie-backports as target release for python-thumbor-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/360350 (https://phabricator.wikimedia.org/T121388) [12:49:23] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: use jessie-backports as target release for python-thumbor-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/360350 (https://phabricator.wikimedia.org/T121388) (owner: 10Filippo Giunchedi) [12:56:47] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 613.54 seconds [12:57:07] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 634.22 seconds [12:57:07] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 634.31 seconds [12:57:17] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 649.01 seconds [12:57:27] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 653.23 seconds [12:57:56] 10Operations, 10DBA, 10Patch-For-Review: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501#3370277 (10jcrespo) @faidon unless I am mistaken, this task you asked me to do some time ago is already fixed on puppet- "only" thing pending is to reimage the whole fleet i... [12:59:48] That is icinga losing downtimes, as I downtimed those for a week yesterday jynus akosiaris ^ [13:00:02] jouncebot: next [13:00:02] In 2 hour(s) and 59 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170622T1600) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170622T1300). Please do the needful. [13:00:04] schana: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:11] I'm here [13:00:18] o/ [13:01:03] I have absolutely zero clue how quicksurvey work :/ [13:01:14] marostegui: jynus aha.. so it's not related to the host. I am happy about that, means I am not crazy [13:02:01] the plot thickens [13:02:09] hashar: what do you need to know? [13:02:12] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359936 (https://phabricator.wikimedia.org/T131949) (owner: 10Nschaaf) [13:02:23] what can potentially break :] [13:02:34] I am going to deploy it on mwdebug1001 [13:02:55] okay [13:03:14] [Wed Jun 21 13:22:33 2017] EXTERNAL COMMAND: SCHEDULE_HOST_DOWNTIME;db2054;1498051352;1498656152;1;0;604800;marvin-bot;T166208 [13:03:14] [Wed Jun 21 13:22:33 2017] EXTERNAL COMMAND: SCHEDULE_HOST_SVC_DOWNTIME;db2054;1498051352;1498656152;1;0;604800;marvin-bot;T166208 [13:03:15] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [13:03:26] I guest that would be you marostegui [13:03:54] Yesp [13:03:56] Yep [13:04:27] (03Merged) 10jenkins-bot: Enable Reader Survey using QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359936 (https://phabricator.wikimedia.org/T131949) (owner: 10Nschaaf) [13:04:34] up to Wednesday the 28th [13:04:47] schana: it is on mwdebug1001 now [13:04:54] testing [13:05:44] (03CR) 10jenkins-bot: Enable Reader Survey using QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359936 (https://phabricator.wikimedia.org/T131949) (owner: 10Nschaaf) [13:06:43] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3370304 (10akosiaris) This just happened. db2054 just alerted. Looking at the icinga logs a schedule downtime for both and services was set f... [13:06:48] 10Operations, 10Tracking: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494#3366319 (10jcrespo) > I doubt this tracking bug is going to be particularly useful I agree with moritz for the following technical reasons: * No one has agreed to get rid of jessies. While it may be the nat... [13:07:47] akosiaris: Yep, I did: -d 604800 [13:07:56] hashar: looks good [13:08:27] marostegui: also a few minutes ago ? [13:08:29] [Thu Jun 22 13:00:51 2017] EXTERNAL COMMAND: SCHEDULE_HOST_SVC_DOWNTIME;db2054;1498136451;1498741251;1;0;604800;marvin-bot; [13:08:30] lets unleashing it [13:08:36] my english .... [13:08:53] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3225440 (10Marostegui) For the record, this is what I ran to downtime the hosts above: ``` icinga-downtime -d 604800 -h $i -r 'T166208' ``` (... [13:09:03] akosiaris: hehe yes :) [13:09:30] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable Reader Survey using QuickSurveys - T131949 (duration: 01m 04s) [13:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:38] T131949: [Epic] Repeat the big English reader survey in one or two more languages - https://phabricator.wikimedia.org/T131949 [13:09:49] schana: should be in production now :] [13:09:57] awesome, thanks! [13:11:17] ok this is weird [13:12:07] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [13:12:14] I can clearly see for example that today on 02:04:41 db2054 s7 lag check tripped 10 times and want to a HARD critical but icinga did not emit (as it should). Then on 12:56 today it did [13:12:45] and yet the main process runs since Jun 14 [13:12:58] 10Operations: Deploy a freenode server (again) - https://phabricator.wikimedia.org/T168579#3370318 (10Luke081515) @RobH Said he want to bring that to the weekly ops meeting. Can we wait for that, or is that definitive decision @faidon ? [13:13:30] The only thing I can think of is that it doesn't cope well with flaps, as I am sure it has been flapping as in: critical-recover-critical-recover…as it is executing big alters coming directly from the master [13:13:39] so it gets delayed, then recovers, then another alter arrives, so delayed again etc [13:15:50] yeah it doesn't look like it's state loss via the file or something [13:16:07] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 346 MB (3% inode=74%) [13:16:10] the retention file would have been synced multiple times in the meantime [13:17:00] (03PS1) 10Jcrespo: mariadb: Initial stretch support for wmf package with systemd [puppet] - 10https://gerrit.wikimedia.org/r/360845 (https://phabricator.wikimedia.org/T168356) [13:18:07] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [13:19:47] PROBLEM - DPKG on labtestnet2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:21:47] RECOVERY - DPKG on labtestnet2001 is OK: All packages OK [13:21:51] !log rebooting restbase2006 for kernel update [13:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:09] marostegui: I am updating https://phabricator.wikimedia.org/T164206#3370304 with whatever I find but this looks like a bug in icinga and I have no idea how to even reproduce it [13:29:04] My main question is…why some downtimes are lost and not some others? [13:29:10] like: why not all at once? [13:29:51] (03PS2) 10Filippo Giunchedi: Repurpose four mw machines in codfw for thumbor [dns] - 10https://gerrit.wikimedia.org/r/360826 (https://phabricator.wikimedia.org/T167801) [13:31:16] could it be that it's just expiring them faster than it should ? [13:32:07] nope [13:32:14] an exit from scheduled downtime is logged [13:32:20] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/6834/" [puppet] - 10https://gerrit.wikimedia.org/r/356198 (https://phabricator.wikimedia.org/T151648) (owner: 10Filippo Giunchedi) [13:32:20] I see nothing for db2054 [13:34:46] an exit? like it has expired you mean? [13:36:13] yes [13:36:18] the relevant log line is [13:36:39] [1498132825] SERVICE DOWNTIME ALERT: ganeti1005;Check systemd state;STOPPED; Service has exited from a period of scheduled downtime [13:36:48] hence me saying "exit" [13:36:57] enter is the reverse term [13:37:23] there's no "exit" for db2054 in the logs [13:37:36] 10Operations, 10ops-eqiad, 10Services (watching): scb1003 unresponsive after reboot - https://phabricator.wikimedia.org/T168534#3370362 (10Cmjohnson) Removed power, drained flea power and powered back on [13:38:02] 10Operations, 10ops-eqiad: Broken disk on mw1228 - https://phabricator.wikimedia.org/T168613#3370366 (10Cmjohnson) Ordered new disk through dell [13:39:28] This is a silly question but…there is no limit on the amount of downtimes we can have right? just thinking in a FIFO if we have a limit [13:39:39] Just throwing that random crazy idea here [13:39:41] just in case [13:40:43] note that I know of [13:40:46] not* [13:40:53] (03PS3) 10Filippo Giunchedi: Repurpose four mw machines in codfw for thumbor [dns] - 10https://gerrit.wikimedia.org/r/360826 (https://phabricator.wikimedia.org/T167801) [13:41:38] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3370374 (10jcrespo) I normally do not try to thought potential random things ,but I installed db2072 a few hours ago- could that have causes s... [13:42:07] PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:43:24] (03CR) 10Giuseppe Lavagetto: [C: 031] Repurpose four mw machines in codfw for thumbor [dns] - 10https://gerrit.wikimedia.org/r/360826 (https://phabricator.wikimedia.org/T167801) (owner: 10Filippo Giunchedi) [13:43:47] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3370379 (10akosiaris) Nope, no restart. icinga is running since Jun 14. Reloads I am pretty sure happen all the time so if a reload is involve... [13:44:22] (03PS1) 10Ottomata: Fix CLI opt in sqoop_mediawiki.pp cron [puppet] - 10https://gerrit.wikimedia.org/r/360846 [13:44:46] (03CR) 10Ottomata: [V: 032 C: 032] Fix CLI opt in sqoop_mediawiki.pp cron [puppet] - 10https://gerrit.wikimedia.org/r/360846 (owner: 10Ottomata) [13:45:53] !log reboot planet1001 for kernel upgrades and renumbering [13:46:01] (03CR) 10Alexandros Kosiaris: [C: 032] Renumber planet1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/360836 (owner: 10Alexandros Kosiaris) [13:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:51] (03PS1) 10Ottomata: hiera-ize a log retention parameters for analytisc kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/360847 [13:50:01] hashar: is it possible to revert the quicksurvey change from the swat? [13:51:22] (03CR) 10Hashar: "role::ci::slave::labs is solely for the legacy permanent slaves. I dont think we will have any using stretch + hhvm, the new target is to" [puppet] - 10https://gerrit.wikimedia.org/r/359492 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [13:51:59] (03PS2) 10Ottomata: hiera-ize a log retention parameters for analytisc kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/360847 [13:52:04] (03CR) 10Ottomata: "No op https://puppet-compiler.wmflabs.org/6836/" [puppet] - 10https://gerrit.wikimedia.org/r/360847 (owner: 10Ottomata) [13:52:09] (03CR) 10Ottomata: [V: 032 C: 032] hiera-ize a log retention parameters for analytisc kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/360847 (owner: 10Ottomata) [13:52:48] (03PS3) 10Filippo Giunchedi: Repurpose four mw machines in codfw for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/360827 (https://phabricator.wikimedia.org/T167801) [13:55:54] !log restart wdqs servers for kernel upgrade [13:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:11] (03PS2) 10Jcrespo: mariadb: Initial stretch support for wmf package with systemd [puppet] - 10https://gerrit.wikimedia.org/r/360845 (https://phabricator.wikimedia.org/T168356) [13:58:13] (03CR) 10Giuseppe Lavagetto: [C: 031] Repurpose four mw machines in codfw for thumbor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/360827 (https://phabricator.wikimedia.org/T167801) (owner: 10Filippo Giunchedi) [14:02:14] (03CR) 10Filippo Giunchedi: Repurpose four mw machines in codfw for thumbor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/360827 (https://phabricator.wikimedia.org/T167801) (owner: 10Filippo Giunchedi) [14:02:19] (03PS4) 10Filippo Giunchedi: Repurpose four mw machines in codfw for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/360827 (https://phabricator.wikimedia.org/T167801) [14:05:58] (03PS4) 10Filippo Giunchedi: Repurpose four mw machines in codfw for thumbor [dns] - 10https://gerrit.wikimedia.org/r/360826 (https://phabricator.wikimedia.org/T167801) [14:06:29] (03PS1) 10Nschaaf: Stop reader surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360849 (https://phabricator.wikimedia.org/T131949) [14:06:36] (03CR) 10Filippo Giunchedi: [C: 032] Repurpose four mw machines in codfw for thumbor [dns] - 10https://gerrit.wikimedia.org/r/360826 (https://phabricator.wikimedia.org/T167801) (owner: 10Filippo Giunchedi) [14:07:28] is there anyone available to help with stopping currently running surveys? https://phabricator.wikimedia.org/T131949 [14:07:39] (03CR) 10Filippo Giunchedi: [C: 032] Repurpose four mw machines in codfw for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/360827 (https://phabricator.wikimedia.org/T167801) (owner: 10Filippo Giunchedi) [14:07:40] they were deployed an hour ago during SWAT [14:08:50] (03CR) 10Hashar: [C: 031] "Looks good to me. Thank you Paladox." [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [14:10:24] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3370504 (10jcrespo) AFAIK, icinga seems to be using /var/lib/icinga/retention.dat as the persistance. There is no mention of db2054 there, exc... [14:11:37] RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [14:12:29] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3370540 (10jcrespo) The only think I could think for is to enable the debug log, what do you think? [14:13:22] PROBLEM - Host thumbor.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:13:38] sigh, sorry [14:13:51] that's me [14:18:32] RECOVERY - Host thumbor.svc.codfw.wmnet is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms [14:18:44] (03CR) 10Jcrespo: [C: 032] mariadb: Initial stretch support for wmf package with systemd [puppet] - 10https://gerrit.wikimedia.org/r/360845 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [14:18:51] (03PS3) 10Jcrespo: mariadb: Initial stretch support for wmf package with systemd [puppet] - 10https://gerrit.wikimedia.org/r/360845 (https://phabricator.wikimedia.org/T168356) [14:22:13] !log restart wdqs servers completed [14:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:26] 10Operations, 10Scoring-platform-team-Backlog: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3370653 (10awight) [14:37:38] !log restarting maps-test cluster for kernel upgrade [14:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:17] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=75%) [14:43:50] (03PS1) 10Jcrespo: mariadb: Remove installation of percona-xtrabackup on stretch [puppet] - 10https://gerrit.wikimedia.org/r/360859 (https://phabricator.wikimedia.org/T168356) [14:46:20] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2077295 [14:51:48] (03CR) 10Herron: "Evaluating x-spam-score seems like a reasonable approach to me. However, is the description accurate? My understanding is that bounce_ma" [puppet] - 10https://gerrit.wikimedia.org/r/350429 (https://phabricator.wikimedia.org/T161082) (owner: 10Nemo bis) [14:52:39] 10Operations: Deploy a freenode server (again) - https://phabricator.wikimedia.org/T168579#3370730 (10RobH) I planned to to bring it to the ops meeting for Faidon and Mark to review. If they've done so before the meeting, then there seems to be little point in bringing it to said meeting. [14:55:44] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3370746 (10Papaul) a:05Papaul>03Andrew [14:57:16] (03PS2) 10Jcrespo: mariadb: Remove installation of percona-xtrabackup on stretch [puppet] - 10https://gerrit.wikimedia.org/r/360859 (https://phabricator.wikimedia.org/T168356) [15:04:04] (03CR) 10Jcrespo: [C: 032] mariadb: Remove installation of percona-xtrabackup on stretch [puppet] - 10https://gerrit.wikimedia.org/r/360859 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [15:20:22] 10Operations, 10DBA, 10Patch-For-Review: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#3370853 (10jcrespo) The currrent package works ok, but there are 2 little details, which I am not 100% sure if to solve as config defaults or as packa... [15:21:44] !log rebooting restbase2005 for kernel update [15:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:50] PROBLEM - cassandra-a SSL 10.192.48.46:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:23:53] 10Operations, 10DBA, 10Patch-For-Review: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#3370861 (10Marostegui) +1 to place it on config rather than in package. I would prefer to be able to configure it thru puppet instead of hardcoding it... [15:24:30] PROBLEM - cassandra-a CQL 10.192.48.46:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.46 and port 9042: Connection refused [15:25:30] PROBLEM - cassandra-b SSL 10.192.48.47:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:25:40] PROBLEM - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.47 and port 9042: Connection refused [15:26:10] PROBLEM - trendingedits endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:30] PROBLEM - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.48 and port 9042: Connection refused [15:27:31] PROBLEM - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:29:10] PROBLEM - Host scb1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:33] 10Operations, 10ops-codfw, 10Patch-For-Review: Rack/setup codfw spare systems - https://phabricator.wikimedia.org/T167705#3370879 (10Papaul) Please see below for network port information Row A rack A6 wmf6575 ge-6/0/3 Row C rack C1 wmf6576 ge-1/0/15 Row D rack D8 wmf6577 ge-8/0/2 [15:30:50] 10Operations, 10ops-codfw, 10Patch-For-Review: Rack/setup codfw spare systems - https://phabricator.wikimedia.org/T167705#3370880 (10Papaul) [15:31:20] 10Operations, 10ops-codfw, 10Patch-For-Review: Rack/setup codfw spare systems - https://phabricator.wikimedia.org/T167705#3341661 (10Papaul) 05Open>03Resolved setup complete. [15:31:23] ^expired downtime, fixing [15:31:37] !log otto@tin Started deploy [eventlogging/analytics@328dea6]: inserting eventlogging events into mysql based on topic name if it exists, falling back to schema name [15:31:41] !log otto@tin Finished deploy [eventlogging/analytics@328dea6]: inserting eventlogging events into mysql based on topic name if it exists, falling back to schema name (duration: 00m 03s) [15:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:53] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3370884 (10Papaul) Disk wipe complete. [15:33:22] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3370885 (10Papaul) [15:34:00] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team-Backlog: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3370887 (10Ladsgroup) [15:34:15] 10Operations, 10ops-codfw: db2035 needs firmware upgrade - https://phabricator.wikimedia.org/T167125#3370900 (10Marostegui) Is there anything pending here? [15:35:23] (03PS1) 10Filippo Giunchedi: hieradata: add codfw thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/360865 [15:35:24] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team-Backlog: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3364601 (10Ladsgroup) @akosiaris: Hey, @awight is joing Scoring platform team, do you think this needs to go through normal access requests perio... [15:36:10] RECOVERY - Host scb1003 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [15:36:30] RECOVERY - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is OK: TCP OK - 0.036 second response time on 10.192.48.48 port 9042 [15:37:40] (03PS1) 10Paladox: test [debs/kafka] - 10https://gerrit.wikimedia.org/r/360867 [15:37:50] (03Abandoned) 10Paladox: test [debs/kafka] - 10https://gerrit.wikimedia.org/r/360867 (owner: 10Paladox) [15:38:30] PROBLEM - dhclient process on elastic1039 is CRITICAL: Return code of 255 is out of bounds [15:38:31] PROBLEM - DPKG on elastic1039 is CRITICAL: Return code of 255 is out of bounds [15:38:31] PROBLEM - SSH on elastic1039 is CRITICAL: connect to address 10.64.16.48 and port 22: Connection refused [15:38:40] PROBLEM - configured eth on elastic1039 is CRITICAL: Return code of 255 is out of bounds [15:38:40] PROBLEM - Elasticsearch HTTPS on elastic1039 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:38:40] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1039 is CRITICAL: Return code of 255 is out of bounds [15:38:49] gehel: ---^ :) [15:38:50] PROBLEM - MD RAID on elastic1039 is CRITICAL: Return code of 255 is out of bounds [15:38:55] yep, on it... [15:39:00] RECOVERY - trendingedits endpoints health on scb1003 is OK: All endpoints are healthy [15:39:00] PROBLEM - salt-minion processes on elastic1039 is CRITICAL: Return code of 255 is out of bounds [15:39:10] PROBLEM - Check size of conntrack table on elastic1039 is CRITICAL: Return code of 255 is out of bounds [15:39:10] PROBLEM - Check systemd state on elastic1039 is CRITICAL: Return code of 255 is out of bounds [15:39:20] PROBLEM - Disk space on elastic1039 is CRITICAL: Return code of 255 is out of bounds [15:39:20] PROBLEM - puppet last run on elastic1039 is CRITICAL: Return code of 255 is out of bounds [15:39:31] elukey: thanks! [15:40:46] (03PS2) 10Filippo Giunchedi: hieradata: add codfw thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/360865 [15:41:13] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/6838/ (fails with thumbor2001 because the master hasn't compiled its catalog yet)" [puppet] - 10https://gerrit.wikimedia.org/r/360865 (owner: 10Filippo Giunchedi) [15:42:24] 10Operations, 10ops-codfw: db2035 needs firmware upgrade - https://phabricator.wikimedia.org/T167125#3370934 (10jcrespo) 05Open>03Resolved a:05jcrespo>03Papaul [15:42:30] PROBLEM - Host elastic1039 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:10] RECOVERY - Check size of conntrack table on elastic1039 is OK: OK: nf_conntrack is 0 % full [15:45:10] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational [15:45:20] RECOVERY - Elasticsearch HTTPS on elastic1039 is OK: SSL OK - Certificate elastic1039.eqiad.wmnet valid until 2022-02-21 10:38:53 +0000 (expires in 1704 days) [15:45:20] RECOVERY - Host elastic1039 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:45:20] RECOVERY - Disk space on elastic1039 is OK: DISK OK [15:45:21] RECOVERY - puppet last run on elastic1039 is OK: OK: Puppet is currently enabled, last run 30 minutes ago with 0 failures [15:45:30] RECOVERY - dhclient process on elastic1039 is OK: PROCS OK: 0 processes with command name dhclient [15:45:40] RECOVERY - SSH on elastic1039 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [15:45:40] RECOVERY - DPKG on elastic1039 is OK: All packages OK [15:45:40] RECOVERY - configured eth on elastic1039 is OK: OK - interfaces up [15:45:40] RECOVERY - Check whether ferm is active by checking the default input chain on elastic1039 is OK: OK ferm input default policy is set [15:45:50] RECOVERY - MD RAID on elastic1039 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [15:46:00] RECOVERY - salt-minion processes on elastic1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:46:02] !log repooling scb1003 after hardware maintenance [15:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:45] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add codfw thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/360865 (owner: 10Filippo Giunchedi) [15:49:37] 10Operations, 10DBA, 10Patch-For-Review: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#3370951 (10jcrespo) [15:54:20] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - rendering-https_443 - Could not depool server mw1294.eqiad.wmnet because of too many down!: rendering_80 - Could not depool server mw1294.eqiad.wmnet because of too many down! [15:54:21] PROBLEM - Apache HTTP on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:30] PROBLEM - Apache HTTP on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:40] PROBLEM - Apache HTTP on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:51] PROBLEM - Nginx local proxy to apache on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:51] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - rendering_80 - Could not depool server mw1294.eqiad.wmnet because of too many down!: rendering-https_443 - Could not depool server mw1294.eqiad.wmnet because of too many down! [15:55:00] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - rendering_80 - Could not depool server mw1294.eqiad.wmnet because of too many down!: rendering-https_443 - Could not depool server mw1294.eqiad.wmnet because of too many down! [15:55:00] PROBLEM - HHVM rendering on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:00] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - rendering-https_443 - Could not depool server mw1294.eqiad.wmnet because of too many down!: rendering_80 - Could not depool server mw1294.eqiad.wmnet because of too many down! [15:55:10] PROBLEM - Nginx local proxy to apache on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:30] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:31] RECOVERY - Apache HTTP on mw1295 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 5.363 second response time [15:55:33] woa [15:55:55] moritzm: reboots? [15:56:01] RECOVERY - Nginx local proxy to apache on mw1295 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.056 second response time [15:56:01] PROBLEM - Nginx local proxy to apache on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:08] no [15:56:10] PROBLEM - HHVM rendering on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:20] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 73395 bytes in 0.104 second response time [15:56:47] no current reboots, having a look [15:57:10] 10Operations, 10Labs, 10Labs-Infrastructure, 10cloud-services-team (Kanban): Puppet CA: virt1000.wikimedia.org' will expire on 2017-08-15 - https://phabricator.wikimedia.org/T168110#3370989 (10Andrew) We're close to setting up new puppetmaster hardware, as per T167905. So I'm going to let this slide in ho... [15:57:28] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3349071 (10Andrew) [15:57:31] 10Operations, 10Labs, 10Labs-Infrastructure, 10cloud-services-team (Kanban): Puppet CA: virt1000.wikimedia.org' will expire on 2017-08-15 - https://phabricator.wikimedia.org/T168110#3370991 (10Andrew) [15:57:33] 5xx growing [15:59:07] <_joe_> no, I guess it's overload [15:59:09] only upload, I cannot see any other thing [15:59:24] <_joe_> jynus: that's related to the imagescalers crashes here [15:59:37] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3371009 (10Andrew) p:05Normal>03High @Cmjohnson, @RobH, the cert for the existing puppetmaster is expiring on July 15th, so I'd like to move ever... [15:59:40] yeah, I saw the renders, but I wanted to check anyway [15:59:50] RECOVERY - Nginx local proxy to apache on mw1293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 4.303 second response time [16:00:00] RECOVERY - HHVM rendering on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 73396 bytes in 8.500 second response time [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170622T1600). Please do the needful. [16:00:07] the spike is not too large [16:00:21] RECOVERY - Apache HTTP on mw1293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.053 second response time [16:00:22] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [16:00:30] RECOVERY - Apache HTTP on mw1294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 4.500 second response time [16:00:32] <_joe_> jynus: it seems just overload [16:00:38] 10Operations, 10Labs, 10Labs-Infrastructure, 10cloud-services-team (Kanban): Puppet CA: virt1000.wikimedia.org' will expire on 2017-08-15 - https://phabricator.wikimedia.org/T168110#3371028 (10Andrew) 05Open>03stalled [16:00:42] although I see a spike on total requests [16:00:47] yes [16:00:50] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [16:01:00] RECOVERY - Nginx local proxy to apache on mw1294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.055 second response time [16:01:00] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 73396 bytes in 0.116 second response time [16:01:00] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [16:01:00] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [16:01:39] <_joe_> https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=imagescaler&var-instance=All see the "hhvm load" panel [16:02:04] <_joe_> and the apache workers [16:02:18] <_joe_> we had a flood of requests for rendering that arrived all of a sudden [16:02:48] it doesn't seem worse than others in the past week though (at least from the graphs) [16:03:22] <_joe_> maybe this was more localized [16:04:10] (03CR) 10Andrew Bogott: [C: 032] Remove an extraneous character in nodnsupdate. [puppet] - 10https://gerrit.wikimedia.org/r/360699 (owner: 10Andrew Bogott) [16:04:15] 10Operations, 10ops-eqiad, 10Scoring-platform-team: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3371048 (10Cmjohnson) port assignments 1001 A6 ge-6/0/10 1002 A7 ge-7/0/23 1003 b7 ge-7/0/10 1004 B8 ge-8/0/10 1005 C3 ge-3/0/15 1006 C4 ge-4/0/13 1007 D3 ge-3/0/36 1008 D4 ge-4/... [16:04:17] (03PS2) 10Andrew Bogott: Remove an extraneous character in nodnsupdate. [puppet] - 10https://gerrit.wikimedia.org/r/360699 [16:04:20] PROBLEM - Check whether ferm is active by checking the default input chain on mw2148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:04:20] PROBLEM - Check size of conntrack table on mw2148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:04:20] PROBLEM - salt-minion processes on mw2148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:04:21] PROBLEM - puppet last run on mw2148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:04:30] PROBLEM - Apache HTTP on mw2148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:30] PROBLEM - Nginx local proxy to apache on mw2148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:30] PROBLEM - HHVM rendering on mw2148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:40] PROBLEM - nutcracker port on mw2148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:04:40] PROBLEM - Check systemd state on mw2148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:04:40] PROBLEM - DPKG on mw2148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:04:50] PROBLEM - configured eth on mw2148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:04:50] PROBLEM - dhclient process on mw2148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:04:51] PROBLEM - HHVM processes on mw2148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:00] PROBLEM - nutcracker process on mw2148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:00] PROBLEM - Disk space on mw2148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:20] looking [16:05:32] that, one, however, looks like a server down or a network issue [16:06:28] that's one of the repurposed hosts [16:06:29] <_joe_> that's a server being reimaged by godog [16:07:03] sigh, expired downtime, I thought it would be gone by now [16:07:11] sorry about that, I'll extend it [16:10:46] 10Operations, 10ops-eqiad, 10Scoring-platform-team: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3371064 (10Cmjohnson) [16:11:36] RECOVERY - DPKG on mw2148 is OK: All packages OK [16:11:46] RECOVERY - configured eth on mw2148 is OK: OK - interfaces up [16:11:46] RECOVERY - dhclient process on mw2148 is OK: PROCS OK: 0 processes with command name dhclient [16:11:47] RECOVERY - Disk space on mw2148 is OK: DISK OK [16:12:07] RECOVERY - Check whether ferm is active by checking the default input chain on mw2148 is OK: OK ferm input default policy is set [16:12:16] RECOVERY - Check size of conntrack table on mw2148 is OK: OK: nf_conntrack is 0 % full [16:12:16] RECOVERY - salt-minion processes on mw2148 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:12:38] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3371070 (10Papaul) switch port information Row A ms-be2001 - ge-1/0/1 ms-be2002 - ge-3/0/40 ms-be2003 - ge-4/0/40 ms-be2004 - ge-5/0/18 R... [16:13:35] 10Operations, 10Ops-Access-Requests: Access request for Daniel Worley to analytics / hadoop - https://phabricator.wikimedia.org/T168439#3371101 (10Dworley) @RobH: I've signed the L3 document. Where should I send the other requested details? [16:13:46] PROBLEM - nutcracker port on thumbor2001 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [16:13:56] PROBLEM - nutcracker process on thumbor2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (nutcracker), command name nutcracker [16:14:10] 10Operations, 10Ops-Access-Requests: Access request for Daniel Worley to analytics / hadoop - https://phabricator.wikimedia.org/T168439#3371102 (10RobH) Just here on the task is fine (all of it becomes public when its input into our user files) [16:14:16] PROBLEM - puppet last run on thumbor2001 is CRITICAL: CRITICAL: Puppet has 44 failures. Last run 2 minutes ago with 44 failures. Failed resources (up to 3 shown): Service[nutcracker],Exec[create-tmp-folder-8821],Exec[create-tmp-folder-8810],Exec[create-tmp-folder-8819] [16:15:26] PROBLEM - Check systemd state on thumbor2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:47] (03PS1) 10Andrew Bogott: Graphite labs archiver: Only save metrics for a year. [puppet] - 10https://gerrit.wikimedia.org/r/360869 (https://phabricator.wikimedia.org/T168344) [16:20:14] (03CR) 10jerkins-bot: [V: 04-1] Graphite labs archiver: Only save metrics for a year. [puppet] - 10https://gerrit.wikimedia.org/r/360869 (https://phabricator.wikimedia.org/T168344) (owner: 10Andrew Bogott) [16:20:18] 10Operations, 10Ops-Access-Requests: Access request for Daniel Worley to analytics / hadoop - https://phabricator.wikimedia.org/T168439#3371117 (10Dworley) Wikitech User: dworley Preferred Shell User: dworley Public Key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDlyZ7OS3200wsY9HIv2i0FlN3tbUBvzXL/AqvGXADDZscRK3DOU... [16:20:46] (03PS2) 10Andrew Bogott: Graphite labs archiver: Only save metrics for a year. [puppet] - 10https://gerrit.wikimedia.org/r/360869 (https://phabricator.wikimedia.org/T168344) [16:23:46] (03CR) 10jerkins-bot: [V: 04-1] Graphite labs archiver: Only save metrics for a year. [puppet] - 10https://gerrit.wikimedia.org/r/360869 (https://phabricator.wikimedia.org/T168344) (owner: 10Andrew Bogott) [16:24:09] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3371135 (10Andrew) [16:24:57] (03PS3) 10Andrew Bogott: Graphite labs archiver: Only save metrics for a year. [puppet] - 10https://gerrit.wikimedia.org/r/360869 (https://phabricator.wikimedia.org/T168344) [16:25:50] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3369055 (10Andrew) [16:25:51] I'm shutting stashbot down to keep it from flapping while I upgrade the Elasticsearch cluster in Toolforge [16:26:16] PROBLEM - thumbor@8816 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8816 is inactive [16:26:36] PROBLEM - thumbor@8834 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8834 is inactive [16:26:43] !log rebooting labsdb1006 as per T168584 [16:26:56] PROBLEM - thumbor@8835 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8835 is inactive [16:27:06] PROBLEM - thumbor@8819 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8819 is inactive [16:27:46] PROBLEM - thumbor@8821 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8821 is inactive [16:27:46] PROBLEM - thumbor@8838 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8838 is inactive [16:28:06] PROBLEM - thumbor@8805 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8805 is inactive [16:28:06] PROBLEM - thumbor@8839 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8839 is inactive [16:28:16] PROBLEM - thumbor@8840 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8840 is inactive [16:28:16] PROBLEM - thumbor@8806 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8806 is inactive [16:28:17] PROBLEM - thumbor@8823 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8823 is inactive [16:28:36] RECOVERY - thumbor@8834 service on thumbor2001 is OK: OK - thumbor@8834 is active [16:28:46] RECOVERY - thumbor@8821 service on thumbor2001 is OK: OK - thumbor@8821 is active [16:28:46] RECOVERY - thumbor@8838 service on thumbor2001 is OK: OK - thumbor@8838 is active [16:28:53] (03CR) 10Andrew Bogott: [C: 032] Graphite labs archiver: Only save metrics for a year. [puppet] - 10https://gerrit.wikimedia.org/r/360869 (https://phabricator.wikimedia.org/T168344) (owner: 10Andrew Bogott) [16:28:56] RECOVERY - thumbor@8835 service on thumbor2001 is OK: OK - thumbor@8835 is active [16:29:06] RECOVERY - thumbor@8805 service on thumbor2001 is OK: OK - thumbor@8805 is active [16:29:06] RECOVERY - thumbor@8839 service on thumbor2001 is OK: OK - thumbor@8839 is active [16:29:07] RECOVERY - thumbor@8819 service on thumbor2001 is OK: OK - thumbor@8819 is active [16:29:16] RECOVERY - thumbor@8840 service on thumbor2001 is OK: OK - thumbor@8840 is active [16:29:17] RECOVERY - thumbor@8816 service on thumbor2001 is OK: OK - thumbor@8816 is active [16:29:17] RECOVERY - thumbor@8806 service on thumbor2001 is OK: OK - thumbor@8806 is active [16:29:17] RECOVERY - thumbor@8823 service on thumbor2001 is OK: OK - thumbor@8823 is active [16:43:49] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3371277 (10RobH) I've confirmed that these are having issues with jessie booting faster than the disks can spin up. On many of them (wtp1027 as an example) won't quite get the disks spun u... [16:45:32] (03CR) 10Dzahn: [C: 031] "per all the comments from hashar and paladox above, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [16:46:19] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 8155 [16:46:31] (03PS1) 10Cmjohnson: Adding mgmt entries for labpuppetmasters T165531 [dns] - 10https://gerrit.wikimedia.org/r/360874 [16:48:11] (03CR) 10Cmjohnson: [C: 032] Adding mgmt entries for labpuppetmasters T165531 [dns] - 10https://gerrit.wikimedia.org/r/360874 (owner: 10Cmjohnson) [16:56:24] 10Operations, 10netops: Temperature increase in esams - https://phabricator.wikimedia.org/T168667#3371317 (10ayounsi) [16:58:30] (03PS1) 10RobH: adding rootdelay to jessie installs [puppet] - 10https://gerrit.wikimedia.org/r/360876 [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170622T1700). [17:01:15] (03PS2) 10Framawiki: Planet-fr: Replace the RAW feed by the new one [puppet] - 10https://gerrit.wikimedia.org/r/360403 (https://phabricator.wikimedia.org/T167617) [17:01:55] (03CR) 10Framawiki: "No." [puppet] - 10https://gerrit.wikimedia.org/r/360403 (https://phabricator.wikimedia.org/T167617) (owner: 10Framawiki) [17:02:24] no parsoid deploy today [17:03:20] !log rebooting labsdb1007 [17:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:24] !log Log events between 15:46 and 17:03 missed due to stashbot downtime [17:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:19] andrewbogott: looks like that ^ only included your db1006 log [17:05:41] !log rebooting labsdb1007 [17:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:52] bd808: thanks [17:06:00] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3371358 (10RobH) https://gerrit.wikimedia.org/r/#/c/360874/ [17:06:29] ACKNOWLEDGEMENT - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/2: down - Cust: Airport Express WiFi APBR Ayounsi T86541 [17:08:21] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3371363 (10Andrew) [17:09:47] gehel: most of the transition steps for logstash/plugins should be https://phabricator.wikimedia.org/T165748#3369000 [17:10:12] thcipriani: yep, I'm looking at this [17:10:51] (03CR) 10Gehel: [C: 032] "looks good, all the pieces are ready to merge and deploy" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/354466 (https://phabricator.wikimedia.org/T165748) (owner: 10Thcipriani) [17:10:56] (03CR) 10Gehel: [V: 032 C: 032] Deployment via scap3 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/354466 (https://phabricator.wikimedia.org/T165748) (owner: 10Thcipriani) [17:13:25] (03PS3) 10Gehel: Scap3: deploy logstash/plugins with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354472 (https://phabricator.wikimedia.org/T165748) (owner: 10Thcipriani) [17:13:59] !log disable puppet on db2062 before maintenance [17:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:32] !log moving to scap for logstash plugin deployment [17:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:48] (03CR) 10Gehel: [C: 032] Scap3: deploy logstash/plugins with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354472 (https://phabricator.wikimedia.org/T165748) (owner: 10Thcipriani) [17:18:29] gehel: just realized step 4 was a bit ambiguous, scap deploy --init should be run in /srv/deployment/logstash/plugins on tin (on tin being the part I left out :)) [17:18:50] ah, but I see you've run it :) [17:18:55] thcipriani: yep, that's what I understood [17:19:11] and I should probably run it on naos as well, just to make sure things are ready if / when we switch [17:19:22] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3371397 (10Cmjohnson) Connected labpuppetmaster1001 b8 ge-8/0/11 labpuppetmaster1002 d6 ge-6/0/1 [17:19:42] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3371398 (10Cmjohnson) [17:20:01] actually, no, there is a lock to prevent running scap on naos, so probably not something I should do. [17:20:23] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3371405 (10Papaul) [17:20:40] gehel: it runs as part of every deployment, I just like to make sure that .git/DEPLOY_HEAD is present before puppet is run on targets initially, probably not even strictly necessary, but a good "just in case" thing [17:20:54] ok, understood [17:21:07] since puppet on targets will run scap if the repo doesn't exist at the right location (it shouldn't actually run in this instance) [17:22:43] ok, /srv/deployment/logstash is now owned by deploy-service, let's try to deploy for real [17:22:46] 10Operations, 10ops-eqiad: Broken disk on mw1228 - https://phabricator.wikimedia.org/T168613#3371415 (10Cmjohnson) Dell got back to me and disks are not included in the warranty. I have a 500GB disks here and can replace it [17:22:56] ok, watching scap deploy-log :) [17:23:04] !log gehel@tin Started deploy [logstash/plugins@720b648]: (no justification provided) [17:23:05] (03PS3) 10Dzahn: Planet-fr: Replace the RAW feed by the new one [puppet] - 10https://gerrit.wikimedia.org/r/360403 (https://phabricator.wikimedia.org/T167617) (owner: 10Framawiki) [17:23:07] !log gehel@tin Finished deploy [logstash/plugins@720b648]: (no justification provided) (duration: 00m 02s) [17:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:24] (03CR) 10Dzahn: [C: 032] "thanks! got it" [puppet] - 10https://gerrit.wikimedia.org/r/360403 (https://phabricator.wikimedia.org/T167617) (owner: 10Framawiki) [17:23:26] (03CR) 10Bearloga: [C: 031] git::clone - ensure => latest should also work with non default branch [puppet] - 10https://gerrit.wikimedia.org/r/360685 (owner: 10Gehel) [17:23:28] (03PS1) 10RobH: setting production dns for ores100[1-9] [dns] - 10https://gerrit.wikimedia.org/r/360880 [17:23:58] huh, that was fast, but all the logs seem to think everything went ok [17:24:26] !log restarting logstash on logstash1001 to validate plugin deplyoment with scap3 [17:24:28] (03CR) 10RobH: [C: 032] setting production dns for ores100[1-9] [dns] - 10https://gerrit.wikimedia.org/r/360880 (owner: 10RobH) [17:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:40] does /srv/deployment/logstash/plugins-cache/revs/720b64805cb0a1228b2dfa3247444b80d0029aee exist on targets and symlinked in /srv/deployment/logstash/plugins? [17:25:18] yes, it does [17:25:47] neat :) [17:26:06] logstash1001 restarted just fine, we can call this a success [17:26:18] thcipriani: thanks for the help [17:26:28] 10Operations, 10ops-eqiad, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3371441 (10RobH) a:05Cmjohnson>03RobH Chris has done all the on-site required steps, stealing for remote accessible steps/remainder. [17:26:37] gehel: \o/ awesome, thank you for merging and verifying [17:26:42] one less thing :) [17:27:00] and now time to replicate that for elasticsearch plugins... [17:27:09] :) [17:31:36] !log bsitzmann@tin Started deploy [mobileapps/deploy@7bfe571]: Update mobileapps to 21f771d [17:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:24] (03PS1) 10Jcrespo: mariadb-core: Test systemd and stretch support [puppet] - 10https://gerrit.wikimedia.org/r/360883 (https://phabricator.wikimedia.org/T168356) [17:34:11] (03PS2) 10Jcrespo: mariadb-core: Test systemd and stretch support [puppet] - 10https://gerrit.wikimedia.org/r/360883 (https://phabricator.wikimedia.org/T168356) [17:34:30] !log bsitzmann@tin Finished deploy [mobileapps/deploy@7bfe571]: Update mobileapps to 21f771d (duration: 02m 54s) [17:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:07] 10Operations, 10hardware-requests: codfw/eqiad:(9+9) hardware access request for ORES - https://phabricator.wikimedia.org/T142578#3371467 (10RobH) 05stalled>03Resolved Setup is being handled via T165171. [17:35:28] (03CR) 10jerkins-bot: [V: 04-1] mariadb-core: Test systemd and stretch support [puppet] - 10https://gerrit.wikimedia.org/r/360883 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [17:37:09] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#3371477 (10kaldari) No worries. The purpose of this table was fulfilled years ago. It is safe to burn with fire. [17:37:43] 10Operations, 10Gerrit, 10Release-Engineering-Team: Setup maintenance date to reindex gerrit (offline reindex) - https://phabricator.wikimedia.org/T168670#3371480 (10Paladox) [17:39:56] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#3371497 (10kaldari) No worries. The purpose of this table was fulfilled years ago. It is safe to burn with fire. [17:42:22] 10Operations, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3371503 (10Ottomata) 05Open>03Resolved [17:43:40] (03PS3) 10Jcrespo: mariadb-core: Test systemd and stretch support [puppet] - 10https://gerrit.wikimedia.org/r/360883 (https://phabricator.wikimedia.org/T168356) [17:43:59] 10Operations, 10ops-eqiad, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3371519 (10RobH) [17:48:20] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/6839/" [puppet] - 10https://gerrit.wikimedia.org/r/360883 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [17:51:23] !log testing in-place upgrade from jessie to stretch of db2062 [17:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:49] 10Operations, 10Gerrit, 10Release-Engineering-Team: Setup maintenance date to reindex gerrit (offline reindex) - https://phabricator.wikimedia.org/T168670#3371559 (10Paladox) p:05Triage>03High Setting high as this needs to be done to fix T152640 and prevent it returning in any future release. [17:55:55] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 12 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3371576 (10AndyRussG) The most recent version of the [[ https://gerrit.wikimedia.org/r/#/c/336237 | change in Gerrit ]]... [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170622T1800). Please do the needful. [18:00:05] schana, Jdlrobson, ebernhardson, and framawiki: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:10] o/ [18:00:22] hello [18:00:30] \o [18:01:20] (03PS1) 10Ladsgroup: Add /data/ Redirect for commons [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) [18:01:57] here [18:02:38] (03PS1) 10RobH: setting install params for ores100[1-9].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/360890 [18:02:43] I can SWAT [18:03:34] (03CR) 10RobH: [C: 032] setting install params for ores100[1-9].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/360890 (owner: 10RobH) [18:05:45] jdlrobson: hrm, so this change is a little tricky to deploy without spiking the error rate since I can't guarantee the order in which IS.php or CS.php will land on the server but I can say it won't be instantaneous [18:05:46] (03PS1) 10Ladsgroup: Add /data/ url redirect in beta cluster (Wikipedia only) [puppet] - 10https://gerrit.wikimedia.org/r/360891 (https://phabricator.wikimedia.org/T163922) [18:07:02] jdlrobson: any chance I could get you to modify https://gerrit.wikimedia.org/r/#/c/360166/2/wmf-config/InitialiseSettings.php to add wgRelatedArticlesEnabledBucketSize and not simultaneously remove wgRelatedArticlesEnabledSamplingRate and then make a followup patch that removes wgRelatedArticlesEnabledSamplingRate ? [18:07:47] thcipriani: not sure i undertand [18:08:05] The code currently uses wgRelatedArticlesEnabledBucketSize [18:08:11] this is just a precaution for another patch [18:08:16] so i dont think it should spike errors [18:08:43] thcipriani: but yeh feel free to edit anyway that makes sense [18:08:45] oh, I didn't see wgRelatedArticlesEnabledBucketSize being set anywhere [18:09:09] thcipriani: oh my bad [18:09:14] no i meant wgRelatedArticlesEnabledSamplingRate [18:09:22] "code currently uses wgRelatedArticlesEnabledSamplingRate " [18:10:33] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3371629 (10Cmjohnson) 2nd ethernet connection...not setup on switch yet Labvirt1015 2/0/21 Labvirt1016 3/0/12 labvirt1017. 7/0/11 Labvirt1018 8/0/13 [18:10:41] jdlrobson: ah, yeah, ok, so the problem is that scap has no guarantee of the order in which IS.php and CS.php arrive on each server, so there's no way to deploy this patch where $wgRelatedArticlesLoggingSamplingRate will always be defined afaict [18:13:01] since if I deploy IS.php first $wgRelatedArticlesEnabledSamplingRate becomes undefined and if I deploy CS.php first wgRelatedArticlesEnabledBucketSize will be undefined [18:13:31] lemme make a quick tweak and have you check it out [18:13:32] ok thcipriani: we can just copy/pasta the value in InitialiseSettings if that's easier [18:13:40] im gonna remove it next week anyhow [18:13:52] yup, that was going to be my suggestion :) [18:13:56] I am going to restart db2062 [18:14:19] * thcipriani makes quick change [18:14:29] I cannot say some logs with errors will not be created- but not shown to users [18:15:51] 10Operations, 10ops-eqiad, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3371634 (10RobH) [18:17:04] (03PS3) 10Thcipriani: relatedArticles: SamplingRate -> BucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360166 (https://phabricator.wikimedia.org/T167236) (owner: 10Phuedx) [18:17:52] ^ jdlrobson does that tweak look fine/can I get a +1 then I'll roll out? [18:17:57] looking [18:18:29] i dont think https://gerrit.wikimedia.org/r/#/c/360166/3/wmf-config/CommonSettings.php,unified is needed any more? [18:18:38] og wait.. ignore me [18:19:29] (03CR) 10Jdlrobson: [C: 031] relatedArticles: SamplingRate -> BucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360166 (https://phabricator.wikimedia.org/T167236) (owner: 10Phuedx) [18:20:06] cool, thanks [18:20:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360166 (https://phabricator.wikimedia.org/T167236) (owner: 10Phuedx) [18:21:37] (03Merged) 10jenkins-bot: relatedArticles: SamplingRate -> BucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360166 (https://phabricator.wikimedia.org/T167236) (owner: 10Phuedx) [18:21:46] (03CR) 10jenkins-bot: relatedArticles: SamplingRate -> BucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360166 (https://phabricator.wikimedia.org/T167236) (owner: 10Phuedx) [18:22:23] jdlrobson: live on mwdebug1002, check please [18:23:12] (sorry about the minor detour there, I have an ongoing battle with how scap plays with IS.php and CS.php :)) [18:23:59] !log restart db2062 [18:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:21] as I said, the spike of errors from db2062 / 10.192.16.195 is me [18:25:19] thcipriani: good good good [18:25:27] jdlrobson: ok, going live [18:27:13] (03PS1) 10Cmjohnson: adding mgmt dns entries for labcontrol1003 and 1004 v [dns] - 10https://gerrit.wikimedia.org/r/360892 [18:27:59] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:360166|relatedArticles: SamplingRate -> BucketSize]] PART I (duration: 00m 53s) [18:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:21] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:360166|relatedArticles: SamplingRate -> BucketSize]] PART II (duration: 00m 48s) [18:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:52] jdlrobson: ^ live [18:32:59] sorry lost internet there for a minute [18:33:00] thanks thcipriani checking again [18:34:37] I confirm log noise back to 0 [18:34:47] server finished putting up the services [18:36:12] ebernhardson: your change is live on mwdebug1002, check please [18:36:15] thcipriani: looking [18:37:02] thcipriani: looks all sane [18:37:09] ebernhardson: ok, going live [18:39:18] !log thcipriani@tin Synchronized php-1.30.0-wmf.6/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: [[gerrit:360879|Switch to data-attribute for sister-search sidebar results]] T164854 (duration: 00m 50s) [18:39:27] ^ ebernhardson live now [18:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:28] T164854: Search Dashboard: update for engagement - sister projects - https://phabricator.wikimedia.org/T164854 [18:40:43] (03PS2) 10Thcipriani: Grant the 'movefile' right to the 'autopatrolled' group on rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359792 (https://phabricator.wikimedia.org/T168192) (owner: 10Framawiki) [18:40:51] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359792 (https://phabricator.wikimedia.org/T168192) (owner: 10Framawiki) [18:41:51] (03Merged) 10jenkins-bot: Grant the 'movefile' right to the 'autopatrolled' group on rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359792 (https://phabricator.wikimedia.org/T168192) (owner: 10Framawiki) [18:42:03] (03CR) 10jenkins-bot: Grant the 'movefile' right to the 'autopatrolled' group on rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359792 (https://phabricator.wikimedia.org/T168192) (owner: 10Framawiki) [18:42:09] schana: https://gerrit.wikimedia.org/r/#/c/360889/ is the kind of thing that l10nupdate will deploy automatically this evening, is there a reason that it needs to go sooner than that? [18:42:41] thcipriani: the surveys are running right now, and having non-translated values for the buttons could affect the responses [18:43:24] framawiki: https://gerrit.wikimedia.org/r/#/c/359792/2 is live on mwdebug1002 [18:44:19] thanks, i'm on it [18:44:44] (03PS1) 10Bearloga: statistics::packages: Add pandoc [puppet] - 10https://gerrit.wikimedia.org/r/360895 [18:47:10] schana: large l10nupdates aren't really things that should be swatted since they take quite a while and l10nupdate runs every night to pull in all the translate wiki updates, can this patch wait until this evening when it will be automatically deployed? [18:48:49] thcipriani: I think the concern is that we won't be able to use any of the collected data until after this patch is deployed [18:48:53] thcipriani: ok for this patch on debug [18:49:06] framawiki: ok, going live [18:50:39] schana: what data? [18:50:52] for the reader surveys [18:51:04] link to task incomming [18:51:09] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:359792|Grant the "movefile" right to the "autopatrolled" group on rowiki]] T168192 (duration: 00m 48s) [18:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:18] T168192: Grant the 'movefile' right to the 'autopatrolled' group on the Romanian Wikipedia - https://phabricator.wikimedia.org/T168192 [18:51:18] ^ framawiki live now [18:51:21] https://phabricator.wikimedia.org/T131949 [18:51:33] greg-g: ^ [18:52:06] * greg-g shrugs [18:52:07] thcipriani: confirmed, thanks [18:52:26] I'd prefer to just wait for the daily l10nupdate run for l10nupdates unless an emergency, is this an emergency or just nice to have? [18:52:44] I'm trying to clarify with Leila now [18:52:53] kk, thanks, I can't tell from the task :) [18:53:06] #wikimedia-research [18:53:58] greg-g: if it does go later tonight, I think we'd need to know what time the change occured [18:54:54] schana: it happens around 7pm pacific, but we can probably just do it now if it's easier [18:55:01] schana: I can deploy it now as part of SWAT, I don't want to invalidate responses because of poor translation, but I do take issue with making a survey live before there are translations [18:55:21] (03PS2) 10Thcipriani: Create a FeaturedFeed for the Wikimag bulletin on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359670 (https://phabricator.wikimedia.org/T168005) (owner: 10Framawiki) [18:55:48] the survey had all the translations, but the QuickSurvey extension was missing them in a few languages. we didn't realize this until after turning it on [18:56:17] test first :) [18:56:22] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359670 (https://phabricator.wikimedia.org/T168005) (owner: 10Framawiki) [18:57:24] (03Merged) 10jenkins-bot: Create a FeaturedFeed for the Wikimag bulletin on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359670 (https://phabricator.wikimedia.org/T168005) (owner: 10Framawiki) [18:57:34] (03CR) 10jenkins-bot: Create a FeaturedFeed for the Wikimag bulletin on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359670 (https://phabricator.wikimedia.org/T168005) (owner: 10Framawiki) [18:58:22] framawiki: I'm not sure if there's anything to test for ^ but it's live on mwdebug1002 [18:59:25] thanks [18:59:31] (03PS4) 10Bearloga: Add info to Discovery Dashboards index page [puppet] - 10https://gerrit.wikimedia.org/r/360592 (https://phabricator.wikimedia.org/T167930) [18:59:53] yw, let me know if it looks good to sync everywhere :) [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170622T1900). [19:00:38] thcipriani: it's not very good as I can see. let me look more precisely what happens [19:00:45] ok [19:01:10] twentyafterfour: still swatting, have to run a full scap so it may take a minute [19:02:16] oh, ok, thanks the cache, it's good now [19:02:24] framawiki: ok, going live [19:02:27] FYI: rss feed :) https://fr.wikipedia.org/w/api.php?action=featuredfeed&feed=wikimag&feedformat=atom [19:04:05] 10Operations, 10Discovery, 10Maps, 10Interactive-Sprint, 10Patch-For-Review: Refactor maps puppet code to the role / profile paradigm - https://phabricator.wikimedia.org/T167871#3371771 (10debt) 05Open>03Resolved [19:04:08] ah neat :) [19:04:09] PROBLEM - Host analytics1030 is DOWN: PING CRITICAL - Packet loss = 100% [19:04:20] 10Operations, 10Discovery, 10Epic, 10Maps (Maps-data): Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#3371774 (10debt) [19:04:23] 10Operations, 10Discovery, 10Interactive-Sprint, 10Maps (Maps-data): Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#3371772 (10debt) 05Open>03Resolved Yay! [19:04:57] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:359670|Create a FeaturedFeed for the Wikimag bulletin on frwiki]] T168005 (duration: 00m 54s) [19:05:03] ^ framawiki live [19:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:06] T168005: Create a FeaturedFeed for the frwiki Wikimag bulletin - https://phabricator.wikimedia.org/T168005 [19:06:09] yes, it's good, thanks ! [19:06:21] awesome, thanks for the patches :) [19:06:35] (03CR) 10Cmjohnson: [C: 032] adding mgmt dns entries for labcontrol1003 and 1004 v [dns] - 10https://gerrit.wikimedia.org/r/360892 (owner: 10Cmjohnson) [19:06:48] thcipriani: no problem [19:09:45] !log thcipriani@tin Started scap: SWAT: [[gerrit:360889|Translation updates for QuickSurveys]] T131949 [19:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:55] T131949: [Epic] Repeat the big English reader survey in one or two more languages - https://phabricator.wikimedia.org/T131949 [19:19:00] (03CR) 10Jcrespo: [C: 032] mariadb: Improve systemd and package management [software] - 10https://gerrit.wikimedia.org/r/357626 (https://phabricator.wikimedia.org/T116903) (owner: 10Jcrespo) [19:21:08] 10Operations, 10DBA, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356#3371867 (10jcrespo) [19:21:10] 10Operations, 10DBA, 10Patch-For-Review: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#3371865 (10jcrespo) 05Open>03Resolved I am going to resolve this because technically, the package and systemd are correct- the things missing are... [19:21:37] 10Operations, 10DBA, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356#3362038 (10jcrespo) [19:21:39] 10Operations, 10DBA, 10Patch-For-Review: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#3371870 (10jcrespo) [19:22:09] 10Operations, 10DBA, 10Patch-For-Review: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501#3371873 (10jcrespo) 05Open>03stalled Blocked on full stretch migration. [19:22:11] 10Operations, 10DBA, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356#3362038 (10jcrespo) [19:24:44] 10Operations, 10DBA, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356#3371880 (10jcrespo) Aside from socket, datadir, basedir configurable on hiera: T148507 we need to create user systemd customizable templates (e.g. to increase the number of max file connections). [19:25:28] 10Operations, 10DBA, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356#3371882 (10jcrespo) a:03jcrespo For now, db2062 and db2072 are stretch hosts with systemd-driven mysqls. [19:27:42] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3371884 (10Cmjohnson) [19:28:49] (03PS1) 10Ottomata: Remove datasets.wikimedia.org site, redirect to analytics.wikimedia.org/datasets/archive [puppet] - 10https://gerrit.wikimedia.org/r/360900 (https://phabricator.wikimedia.org/T159409) [19:29:37] (03PS2) 10Ottomata: Remove datasets.wikimedia.org site, redirect to analytics.wikimedia.org/datasets/archive [puppet] - 10https://gerrit.wikimedia.org/r/360900 (https://phabricator.wikimedia.org/T159409) [19:31:55] !log thcipriani@tin Finished scap: SWAT: [[gerrit:360889|Translation updates for QuickSurveys]] T131949 (duration: 22m 10s) [19:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:04] T131949: [Epic] Repeat the big English reader survey in one or two more languages - https://phabricator.wikimedia.org/T131949 [19:32:08] ^ schana translations should be live now [19:32:21] twentyafterfour: sorry for the delay, train should be clear now [19:32:58] ok [19:33:01] thanks thcipriani [19:33:07] although there is a new error that started creeping up during the last sync, is not l10n related, so I don't know why [19:33:20] hmm [19:33:39] 10Operations, 10ops-eqiad, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3371911 (10RobH) [19:33:48] (03CR) 10Ottomata: [C: 032] Remove datasets.wikimedia.org site, redirect to analytics.wikimedia.org/datasets/archive [puppet] - 10https://gerrit.wikimedia.org/r/360900 (https://phabricator.wikimedia.org/T159409) (owner: 10Ottomata) [19:34:55] thanks thcipriani [19:35:01] schana: yw :) [19:35:21] twentyafterfour: definitely just started at 19:21 during the last scap sync, but no new code went out there, just l10n updates. [19:37:39] PROBLEM - Disk space on elastic1041 is CRITICAL: Return code of 255 is out of bounds [19:37:49] PROBLEM - Check size of conntrack table on elastic1041 is CRITICAL: Return code of 255 is out of bounds [19:37:51] PROBLEM - MD RAID on elastic1041 is CRITICAL: Return code of 255 is out of bounds [19:37:59] PROBLEM - Elasticsearch HTTPS on elastic1041 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:37:59] PROBLEM - configured eth on elastic1041 is CRITICAL: Return code of 255 is out of bounds [19:37:59] PROBLEM - puppet last run on elastic1041 is CRITICAL: Return code of 255 is out of bounds [19:38:19] PROBLEM - dhclient process on elastic1041 is CRITICAL: Return code of 255 is out of bounds [19:38:19] PROBLEM - DPKG on elastic1041 is CRITICAL: Return code of 255 is out of bounds [19:38:19] PROBLEM - Check systemd state on elastic1041 is CRITICAL: Return code of 255 is out of bounds [19:38:29] PROBLEM - salt-minion processes on elastic1041 is CRITICAL: Return code of 255 is out of bounds [19:38:29] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1041 is CRITICAL: Return code of 255 is out of bounds [19:38:29] PROBLEM - SSH on elastic1041 is CRITICAL: connect to address 10.64.32.109 and port 22: Connection refused [19:38:31] thcipriani: maybe a previously un-deployed patch merged by someone else? [19:38:50] seems like that may be likely :( [19:38:50] elastic1041 is me... looking [19:40:05] 10Operations, 10ops-eqiad, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3371924 (10RobH) ores100[789] are not showing active links on the switch. ge-3/0/36 up down ores1007 ge-4/0/33 up down ores1008 ge-6/0/0... [19:41:20] PROBLEM - Host elastic1041 is DOWN: PING CRITICAL - Packet loss = 100% [19:42:26] hrm, I don't see anything in the SAL, I just checked my scrollback and nothing came down with any git fetch I did as part of SWAT other than what was expected. [19:42:37] strange [19:42:41] which error is it? [19:42:58] "Catchable fatal error: Argument 1 passed to DataValues\UnboundedQuantityValue::newFromArray() must be an instance of array, string given" ? [19:43:02] ^ [19:45:39] RECOVERY - SSH on elastic1041 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [19:45:49] RECOVERY - Elasticsearch HTTPS on elastic1041 is OK: SSL OK - Certificate elastic1041.eqiad.wmnet valid until 2022-02-20 09:19:31 +0000 (expires in 1703 days) [19:45:49] RECOVERY - Host elastic1041 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:45:49] RECOVERY - Check size of conntrack table on elastic1041 is OK: OK: nf_conntrack is 0 % full [19:45:59] RECOVERY - MD RAID on elastic1041 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [19:45:59] RECOVERY - configured eth on elastic1041 is OK: OK - interfaces up [19:45:59] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 49 minutes ago with 0 failures [19:46:20] RECOVERY - dhclient process on elastic1041 is OK: PROCS OK: 0 processes with command name dhclient [19:46:20] RECOVERY - Check systemd state on elastic1041 is OK: OK - running: The system is fully operational [19:46:20] RECOVERY - DPKG on elastic1041 is OK: All packages OK [19:46:29] RECOVERY - Check whether ferm is active by checking the default input chain on elastic1041 is OK: OK ferm input default policy is set [19:46:30] RECOVERY - salt-minion processes on elastic1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:46:30] wtf I don't get it [19:46:39] RECOVERY - Disk space on elastic1041 is OK: DISK OK [19:46:40] it's in vendor [19:46:59] does something run composer automatically somewhere? [19:48:36] not as far as I'm aware. It's Wikidata/vendor though... [19:51:09] https://phabricator.wikimedia.org/T168681 [19:51:29] twentyafterfour: I'm going to disappear for minute to get lunch, I didn't expect there to be so much swat runnover :( [19:51:41] thcipriani: no problem [19:51:51] might ping aude or addshore for wikidata things [19:52:03] I'm clueless at solving this one, I think we'll need help from one of them ^ indeed [19:52:05] enjoy your lunch [19:52:12] train: blocked [19:52:27] !log the train is currently blocked by https://phabricator.wikimedia.org/T168681 [19:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:48] (03PS2) 10Ottomata: statistics::packages: Add pandoc [puppet] - 10https://gerrit.wikimedia.org/r/360895 (owner: 10Bearloga) [19:53:52] (03CR) 10Ottomata: [V: 032 C: 032] statistics::packages: Add pandoc [puppet] - 10https://gerrit.wikimedia.org/r/360895 (owner: 10Bearloga) [19:56:36] (03CR) 10Andrew Bogott: [C: 031] git::clone - ensure => latest should also work with non default branch [puppet] - 10https://gerrit.wikimedia.org/r/360685 (owner: 10Gehel) [19:58:59] (03Draft1) 10Paladox: planet: Fix path to planet-wm2.png in puppet [puppet] - 10https://gerrit.wikimedia.org/r/360908 [19:59:01] (03PS2) 10Paladox: planet: Fix path to planet-wm2.png in puppet [puppet] - 10https://gerrit.wikimedia.org/r/360908 [19:59:21] (03PS3) 10Paladox: planet: Fix path to planet-wm2.png in puppet [puppet] - 10https://gerrit.wikimedia.org/r/360908 [20:02:05] (03CR) 10Dzahn: [C: 032] planet: Fix path to planet-wm2.png in puppet [puppet] - 10https://gerrit.wikimedia.org/r/360908 (owner: 10Paladox) [20:02:12] thanks :) [20:02:28] thx for fix [20:05:42] (03PS1) 10Dzahn: planet: make config templates and crons flexible for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/360910 (https://phabricator.wikimedia.org/T168490) [20:05:47] Your welcome :) [20:06:49] (03CR) 10Paladox: [C: 031] planet: make config templates and crons flexible for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/360910 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [20:07:09] twentyafterfour: Lucas_WMDE thinks it might be him [20:07:15] 10Operations, 10ops-eqiad, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3371997 (10RobH) >>! In T165171#3371924, @RobH wrote: > ores100[789] are not showing active links on the switch. > > ge-3/0/36 up down ores1007 >... [20:07:32] twentyafterfour: have a full stacktrace for him? [20:07:37] no [20:07:44] I can try to find one [20:08:05] ok, I have to run, be back online soon [20:08:07] hm, on second thought… it’s not the same as the error I had in mind, though it looks similar [20:09:02] are logs no longer on fluorine? /me can't log in there [20:10:05] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team, 10User-Zppix: Graphite access for Zppix - https://phabricator.wikimedia.org/T168014#3372003 (10Zppix) >>! In T168014#3368117, @RobH wrote: > It seems this is awaiting followup from @zppix to work with @Halfak (as the employee who sponsored his a... [20:10:07] Lucas_WMDE: the strange thing is that this showed up immediately after swat even though we didn't swat any patches which should have affected it [20:10:18] the sync was supposed to be only l18n changes [20:10:45] no idea how to actually reproduce it but it's showing up in logstash at a pretty high frequency [20:11:50] (03CR) 10Dzahn: adding rootdelay to jessie installs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/360876 (owner: 10RobH) [20:12:57] (03CR) 10RobH: adding rootdelay to jessie installs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/360876 (owner: 10RobH) [20:14:24] (03PS2) 10RobH: adding rootdelay to jessie installs [puppet] - 10https://gerrit.wikimedia.org/r/360876 [20:15:05] (03PS3) 10RobH: adding rootdelay to jessie installs [puppet] - 10https://gerrit.wikimedia.org/r/360876 [20:21:05] !log labtestnet2001 turning neutron debug logs off because they're flooding the (very small) '/' partition [20:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:26] I’ve looked into my code, and I don’t see how it could cause this error – I use UnboundedQuantityValue::newFromNumber, not newFromArray [20:21:46] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [20:22:30] just to be sure, if it’s not too hard to check – there should not be any jobs of the “constraintsTableUpdate” class queued or running (should be hidden behind a feature flag and not enabled yet) [20:22:53] Lucas_WMDE: thanks for looking [20:23:15] I don't know how to check job runners [20:23:22] okay [20:24:39] I'm looking in kibana maybe the info is in there [20:25:32] I don't see constraintsTableUpdate in the runjobs dashboard [20:31:35] interesting, the error rate has fallen quite a bit [20:32:45] I'm gonna go so far as to say it looks like that error probably was caused by a queued job because the rate tapered off to half of what it was [20:34:02] !log icinga - re-enabling disabled notifications for IPMI temp checks on some mc* and mw* hosts where check is fine and OK [20:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:02] 10Operations, 10Wikimedia-Stream: occasional 502 from rcstream seen by pybal - https://phabricator.wikimedia.org/T126313#3372049 (10Krinkle) 05Open>03declined Declining per RCStream being deprecated and scheduled to be shut down on July 7th. See also T156919. [20:41:01] 10Operations, 10Discovery-Analysis: Upgrade pandoc package to at least 1.12.3 - https://phabricator.wikimedia.org/T168683#3372065 (10mpopov) [20:41:32] 10Operations, 10Discovery-Analysis: Upgrade pandoc package to at least 1.12.3 - https://phabricator.wikimedia.org/T168683#3372065 (10mpopov) p:05Triage>03Normal [20:48:47] jouncebot: refresh [20:48:50] I refreshed my knowledge about deployments. [20:49:11] Hi. [20:50:36] Dereckson: hi [20:50:53] train currently blocked by T168681 [20:50:54] T168681: Argument 1 passed to DataValues\UnboundedQuantityValue::newFromArray() must be an instance of array, string given in extensions/Wikidata/vendor/data-values/serialization/src/Deserializers/DataValueDeserializer.php on line 141 - https://phabricator.wikimedia.org/T168681 [20:52:06] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team-Backlog: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3372118 (10awight) FWIW, I see that I have access to tin and can presumably deploy there. However, I don't have ssh access to the canary server... [20:52:13] added jeroen as it's DataValues [21:00:04] Dereckson: Respected human, time to deploy Wiki creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170622T2100). Please do the needful. [21:00:22] jouncebot: later [21:03:23] !log restarting rabbitmq-server on labcontrol1001 [21:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:44] (03CR) 10Volans: [C: 04-1] "There are few improvements that can be done, see inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357723 (https://phabricator.wikimedia.org/T167333) (owner: 10Herron) [21:18:27] twentyafterfour: so, you plan to do something now about the train or we can go on deployments? [21:19:13] 10Operations, 10ops-eqiad, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3372307 (10RobH) [21:28:11] twentyafterfour: hi! I was hoping for some help with a Phabricator bulk operation... [21:28:40] awight: ok? [21:28:50] Dereckson: go ahead the train is blocked for now [21:28:52] I'd like to get myself unsubscribed from https://phabricator.wikimedia.org/project/board/41/query/pDd2e_TTJf0_/ aka. has tag: fundraising-backlog, subscribed: awight [21:28:53] * Dereckson nods [21:31:04] awight: https://phabricator.wikimedia.org/maniphest/query/mvvUz9BkFZhZ/#R looks right? [21:31:39] twentyafterfour: yessir. Slay them aaaaall! [21:32:11] awight: ok bulk job running [21:33:08] 10Operations, 10DBA, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3372376 (10madhuvishy) Hi all, So current status is: - labsdb1001 and 1003: Cloud team needs to announce user maintenance, and handle dns switchover during reboots(I'm not sure what t... [21:33:14] thanks, you're saving my shredded sanity. [21:35:25] awight: what about tasks assigned to you? [21:35:37] I don't think you can be removed as a subscriber when it's assigned to you [21:35:57] I think dstrine successfully unassigned me from those, just a few minutes ago. [21:40:56] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:41:47] jouncebot: next [21:41:47] In 1 hour(s) and 18 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170622T2300) [21:42:16] (03CR) 10Madhuvishy: [C: 032] tools: fix chattr file path in maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/360779 (https://phabricator.wikimedia.org/T165875) (owner: 10BryanDavis) [21:42:25] (03PS2) 10Madhuvishy: tools: fix chattr file path in maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/360779 (https://phabricator.wikimedia.org/T165875) (owner: 10BryanDavis) [21:42:46] (03CR) 10Madhuvishy: [C: 032] "Happy to merge this when bryan's around." [puppet] - 10https://gerrit.wikimedia.org/r/360779 (https://phabricator.wikimedia.org/T165875) (owner: 10BryanDavis) [21:43:02] 10Operations, 10Gerrit, 10Release-Engineering-Team: Setup maintenance date to reindex gerrit (offline reindex) - https://phabricator.wikimedia.org/T168670#3372915 (10demon) I don't need a date, reindexing accounts will take all of thirty seconds. [21:43:24] !log gerrit: Stopping momentarily, reindexing accounts [21:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:48] It seems I'm unlucky again :D [21:46:04] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/endowment],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikimedia/TransparencyReport] [21:46:21] 10Operations, 10Gerrit, 10Release-Engineering-Team (Kanban): Setup maintenance date to reindex gerrit (offline reindex) - https://phabricator.wikimedia.org/T168670#3373154 (10demon) 05Open>03Resolved a:03demon ``` gerrit2@cobalt /var/lib/gerrit2/review_site$ java -jar bin/gerrit.war reindex --index acc... [21:46:23] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [21:46:24] ^ Will self resolve in a moment [21:47:00] (03PS1) 10Ottomata: Remove datasets.wm.org class [puppet] - 10https://gerrit.wikimedia.org/r/360985 [21:47:01] (tbh, we should probably make that a little more resilient. Now that we have a slave we could possibly fall-back) [21:47:31] (03CR) 10Ottomata: [V: 032 C: 032] Remove datasets.wm.org class [puppet] - 10https://gerrit.wikimedia.org/r/360985 (owner: 10Ottomata) [21:49:43] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [21:53:41] 10Operations, 10Scoring-platform-team: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3373452 (10RobH) a:05RobH>03akosiaris All of these systems are now calling into puppet, and are ready for service implementation. The original #hw-request was filied by Alex, so I've assigne... [21:56:50] 10Operations, 10Scoring-platform-team: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3373513 (10Ladsgroup) I think that part should be handled in {T168073} but not so sure. [21:57:52] 10Operations, 10Gerrit, 10Release-Engineering-Team (Kanban): Setup maintenance date to reindex gerrit (offline reindex) - https://phabricator.wikimedia.org/T168670#3373515 (10Paladox) @demon i meant a full index, including changes. But i guess that works :). thanks. [21:59:53] 10Operations, 10Gerrit, 10Release-Engineering-Team (Kanban): Setup maintenance date to reindex gerrit (offline reindex) - https://phabricator.wikimedia.org/T168670#3373520 (10demon) Why would the changes need to be reindexed if we're talking about accounts? This whole thing is stupid mess.... [22:00:59] 10Operations, 10Gerrit, 10Release-Engineering-Team (Kanban): Setup maintenance date to reindex gerrit (offline reindex) - https://phabricator.wikimedia.org/T168670#3373523 (10demon) Plus, I disagree with the assertion that we didn't do a full reindex. We did. Twice. [22:02:00] 10Operations, 10Gerrit, 10Release-Engineering-Team (Kanban): Setup maintenance date to reindex gerrit (offline reindex) - https://phabricator.wikimedia.org/T168670#3373525 (10Paladox) >>! In T168670#3373523, @demon wrote: > Plus, I disagree with the assertion that we didn't do a full reindex. We did. Twice.... [22:02:33] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [22:02:59] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team-Backlog: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3364601 (10RobH) >>! In T168442#3370887, @Ladsgroup wrote: > @akosiaris: Hey, @awight is joing Scoring platform team, do you think this needs to... [22:03:18] 10Operations, 10Ops-Access-Requests: Access request for Daniel Worley to analytics / hadoop - https://phabricator.wikimedia.org/T168439#3373532 (10RobH) a:05Dworley>03RobH [22:07:31] (03PS2) 10Dzahn: planet: make config templates and crons for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/360910 (https://phabricator.wikimedia.org/T168490) [22:07:33] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team-Backlog: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3373537 (10awight) 05Open>03Resolved a:03awight @RobH Thanks, I think I've determined I have appropriate access to deploy, and will deal wi... [22:10:43] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 8.96 seconds [22:10:54] (03PS1) 10RobH: Daniel Worley shell access request [puppet] - 10https://gerrit.wikimedia.org/r/360988 [22:11:57] (03CR) 10Dzahn: [C: 032] planet: make config templates and crons for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/360910 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [22:14:43] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [22:15:23] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:20:18] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Access request for Daniel Worley to analytics / hadoop - https://phabricator.wikimedia.org/T168439#3373553 (10RobH) This is now in the 3 day waiting period for objection. If no objections are noted, I'll merge this on Friday. [22:21:12] 10Operations, 10Scoring-platform-team: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3373554 (10RobH) More than likely, but I didn't want to assume. If that task does indeed handle it, and everyone involved is aware these servers are ready, this task can be resolved. [22:26:15] (03PS1) 10Dzahn: planet: fix "splitstate" option, config file name, icon [puppet] - 10https://gerrit.wikimedia.org/r/360990 (https://phabricator.wikimedia.org/T168490) [22:26:43] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:27:11] (03CR) 10jerkins-bot: [V: 04-1] planet: fix "splitstate" option, config file name, icon [puppet] - 10https://gerrit.wikimedia.org/r/360990 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [22:31:37] (03PS2) 10Dzahn: planet: fix "splitstate" option, config file name, icon [puppet] - 10https://gerrit.wikimedia.org/r/360990 (https://phabricator.wikimedia.org/T168490) [22:35:08] (03CR) 10Dzahn: [C: 032] planet: fix "splitstate" option, config file name, icon [puppet] - 10https://gerrit.wikimedia.org/r/360990 (https://phabricator.wikimedia.org/T168490) (owner: 10Dzahn) [22:38:40] 10Operations, 10Labs, 10Labs-Infrastructure, 10cloud-services-team (Kanban): Puppet CA: virt1000.wikimedia.org' will expire on 2017-08-15 - https://phabricator.wikimedia.org/T168110#3373570 (10akosiaris) The difficult part will be getting all the old clients (old VMs practically) getting to trust the new p... [22:42:22] (03PS1) 10Dereckson: Initial configuration for kbp.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360993 (https://phabricator.wikimedia.org/T160868) [22:46:35] (03CR) 10Dereckson: [C: 032] Initial configuration for kbp.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360993 (https://phabricator.wikimedia.org/T160868) (owner: 10Dereckson) [22:46:56] (03PS1) 10Andrew Bogott: proxyleaks: Avoid some edge cases that caused occasional script failure [puppet] - 10https://gerrit.wikimedia.org/r/360994 [22:46:58] (03PS1) 10Andrew Bogott: wmf_sink: Clean up DNS for cleaned up proxies on instance deletion. [puppet] - 10https://gerrit.wikimedia.org/r/360995 (https://phabricator.wikimedia.org/T168313) [22:47:00] (03PS1) 10Andrew Bogott: wmf_sink: Forward some changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/360996 [22:47:57] (03CR) 10jerkins-bot: [V: 04-1] proxyleaks: Avoid some edge cases that caused occasional script failure [puppet] - 10https://gerrit.wikimedia.org/r/360994 (owner: 10Andrew Bogott) [22:47:59] (03Merged) 10jenkins-bot: Initial configuration for kbp.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360993 (https://phabricator.wikimedia.org/T160868) (owner: 10Dereckson) [22:48:04] (03CR) 10jerkins-bot: [V: 04-1] wmf_sink: Clean up DNS for cleaned up proxies on instance deletion. [puppet] - 10https://gerrit.wikimedia.org/r/360995 (https://phabricator.wikimedia.org/T168313) (owner: 10Andrew Bogott) [22:48:07] (03CR) 10jenkins-bot: Initial configuration for kbp.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360993 (https://phabricator.wikimedia.org/T160868) (owner: 10Dereckson) [22:48:09] (03CR) 10jerkins-bot: [V: 04-1] wmf_sink: Forward some changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/360996 (owner: 10Andrew Bogott) [22:49:49] (03PS2) 10Andrew Bogott: proxyleaks: Avoid some edge cases that caused occasional script failure [puppet] - 10https://gerrit.wikimedia.org/r/360994 [22:49:51] (03PS2) 10Andrew Bogott: wmf_sink: Clean up DNS for cleaned up proxies on instance deletion. [puppet] - 10https://gerrit.wikimedia.org/r/360995 (https://phabricator.wikimedia.org/T168313) [22:49:53] (03PS2) 10Andrew Bogott: wmf_sink: Forward some changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/360996 [22:51:12] !log Create tables for kbpwiki (T160868) [22:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:23] T160868: Create Wikipedia Kabiye - https://phabricator.wikimedia.org/T160868 [22:52:38] !log dereckson@tin Synchronized dblists: (no justification provided) (duration: 00m 48s) [22:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:17] !log dereckson@tin rebuilt wikiversions.php and synchronized wikiversions files: +kbpwiki (T160868) [22:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:21] !log dereckson@tin Synchronized langlist: +kbp (T160868) (duration: 00m 46s) [22:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:03] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.47 seconds [22:56:05] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Initial configuration for kbp.wikipedia (T160868) (duration: 00m 45s) [22:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:46] (03PS1) 10Dereckson: Add kbp.wikipedia to interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361001 (https://phabricator.wikimedia.org/T160868) [22:59:03] (03CR) 10Dereckson: [C: 032] Add kbp.wikipedia to interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361001 (https://phabricator.wikimedia.org/T160868) (owner: 10Dereckson) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170622T2300). Please do the needful. [23:00:34] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.52 seconds [23:01:10] (03Merged) 10jenkins-bot: Add kbp.wikipedia to interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361001 (https://phabricator.wikimedia.org/T160868) (owner: 10Dereckson) [23:01:11] If someone needs to SWAT a change, previous deployment is still ending. [23:01:23] (03CR) 10jenkins-bot: Add kbp.wikipedia to interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361001 (https://phabricator.wikimedia.org/T160868) (owner: 10Dereckson) [23:07:42] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add kbp.wikipedia to interwiki map (T160868) (duration: 00m 47s) [23:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:51] T160868: Create Wikipedia Kabiye - https://phabricator.wikimedia.org/T160868 [23:11:04] !log dereckson@tin Synchronized wmf-config/interwiki.php: Add kbp.wikipedia to interwiki map (T160868) (duration: 00m 46s) [23:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:53] !log kbp.wikipedia wiki creation done. [23:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:47] (03PS1) 10Dzahn: planet: add stylesheet for rawdog planets [puppet] - 10https://gerrit.wikimedia.org/r/361011 [23:18:42] (03CR) 10jerkins-bot: [V: 04-1] planet: add stylesheet for rawdog planets [puppet] - 10https://gerrit.wikimedia.org/r/361011 (owner: 10Dzahn) [23:21:57] (03PS2) 10Dzahn: planet: add stylesheet for rawdog planets [puppet] - 10https://gerrit.wikimedia.org/r/361011 [23:26:55] (03PS3) 10Dzahn: planet: add stylesheet for rawdog planets [puppet] - 10https://gerrit.wikimedia.org/r/361011 [23:28:13] (03CR) 10Dzahn: [C: 032] planet: add stylesheet for rawdog planets [puppet] - 10https://gerrit.wikimedia.org/r/361011 (owner: 10Dzahn) [23:45:08] Dereckson: what is the name of Kabiye in Kabiye? also Kabiye? [23:45:22] i see it has about 10 different spellings anyways [23:45:46] ah, found it "Kabɩyɛ" [23:56:08] 10Operations, 10MobileFrontend, 10Reading-Web-Backlog, 10Traffic, and 3 others: Remove disableImages handling from VCL - https://phabricator.wikimedia.org/T168013#3352978 (10Jdlrobson) Train blocked by T168681