[00:00:05] twentyafterfour: How many deployers does it take to do Phabricator update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171102T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:02:58] 10Operations, 10cloud-services-team: Reboots of cloud servers - https://phabricator.wikimedia.org/T168445#3728379 (10bd808) [00:03:03] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3728377 (10bd808) 05Open>03Resolved >>! In T168584#3725668, @MoritzMuehlenhoff wrote: > Let's just keep 1003 running w/o r... [00:03:49] PROBLEM - Check systemd state on druid1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:04:29] PROBLEM - Druid historical on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server historical [00:07:48] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [00:17:18] ACKNOWLEDGEMENT - Host cp4009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black decomming [00:17:19] ACKNOWLEDGEMENT - Host cp4010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black decomming [00:27:38] RECOVERY - Druid historical on druid1003 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server historical [00:27:58] RECOVERY - Check systemd state on druid1003 is OK: OK - running: The system is fully operational [00:28:35] !log lvs1001-3: re-enable ethernet flowcontrol (short ethernet blip), with pybal stopped to failover to backups during... [00:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:00] !log cp4022 - varnish backend restart, mailbox [00:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:25] !log ebernhardson@tin Synchronized php-1.31.0-wmf.6/extensions/WikimediaEvents/extension.json: (no justification provided) (duration: 00m 50s) [00:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:50] !log ebernhardson@tin Synchronized php-1.31.0-wmf.5/extensions/WikimediaEvents/extension.json: SWAT followup: Update SearchSatisfaction eventlogging schema revision id (duration: 00m 50s) [00:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:19] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509584656 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3813364 keys, up 4 minutes 13 seconds - replication_delay is 1509584656 [01:04:28] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1509584660 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8516001 keys, up 4 minutes 18 seconds - replication_delay is 1509584660 [01:04:28] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [01:04:28] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1509584660 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8520694 keys, up 4 minutes 18 seconds - replication_delay is 1509584660 [01:04:28] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509584661 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3815547 keys, up 4 minutes 17 seconds - replication_delay is 1509584661 [01:04:28] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1509584661 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3810793 keys, up 4 minutes 17 seconds - replication_delay is 1509584661 [01:04:28] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [01:05:28] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8511173 keys, up 5 minutes 22 seconds - replication_delay is 0 [01:05:29] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8406766 keys, up 5 minutes 21 seconds - replication_delay is 0 [01:06:19] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3805224 keys, up 6 minutes 14 seconds - replication_delay is 0 [01:06:28] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8509901 keys, up 6 minutes 19 seconds - replication_delay is 0 [01:06:28] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8513523 keys, up 6 minutes 20 seconds - replication_delay is 0 [01:07:28] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3806994 keys, up 7 minutes 18 seconds - replication_delay is 0 [01:07:28] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3803084 keys, up 7 minutes 18 seconds - replication_delay is 0 [02:43:24] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.5) (duration: 16m 06s) [02:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:23] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.6) (duration: 15m 37s) [03:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:28] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 693.58 seconds [03:29:36] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Nov 2 03:29:35 UTC 2017 (duration 7m 12s) [03:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:29] PROBLEM - HHVM jobrunner on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [03:38:28] RECOVERY - HHVM jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [03:46:35] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3728526 (10Anomie) >>! In T178538#3727988, @Krinkle wrote: > * Verify the hypothesis. Ensure MediaWiki works and tests pass under the proposed HHVM version and configur... [04:05:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 236.97 seconds [04:12:45] 10Operations, 10Deployments, 10Performance-Team, 10Traffic, and 2 others: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#3728547 (10Krinkle) [04:13:03] 10Operations, 10Performance-Team, 10Traffic, 10Performance-Team-notice: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#3728550 (10Krinkle) [04:13:30] 10Operations, 10MediaWiki-General-or-Unknown, 10Performance-Team, 10Performance-Team-notice, 10Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3728553 (10Krinkle) [04:18:33] > [04:18:35] Your membership in the mailing list Mentors has been disabled due to [04:18:35] excessive bounces The last bounce received from you was dated [04:18:35] 02-Nov-2017. [04:18:36] > [04:18:44] is there a problem with @wikimedia.org addresses bouncing? [04:22:11] oh, I bet I know why [04:22:51] there's a bunch of 100% spam junk messages in the mailman archives for this list, I'm guessing google apps bounced them back causing me to pass the threshold [04:23:00] if that's true I bet a bunch of people got unsubscribed :( [05:23:28] (03PS3) 10Tim Starling: Update dumps archive_index.html for the files I just uploaded [puppet] - 10https://gerrit.wikimedia.org/r/383958 [05:23:35] (03CR) 10Tim Starling: [C: 032] Update dumps archive_index.html for the files I just uploaded [puppet] - 10https://gerrit.wikimedia.org/r/383958 (owner: 10Tim Starling) [05:26:46] (03PS1) 10Legoktm: dumps: Fix typo in archive_index.html [puppet] - 10https://gerrit.wikimedia.org/r/387977 [05:27:22] TimStarling: ^ noticed when looking at your commit [05:29:02] (03CR) 10Tim Starling: [C: 032] dumps: Fix typo in archive_index.html [puppet] - 10https://gerrit.wikimedia.org/r/387977 (owner: 10Legoktm) [05:29:09] (03PS2) 10Tim Starling: dumps: Fix typo in archive_index.html [puppet] - 10https://gerrit.wikimedia.org/r/387977 (owner: 10Legoktm) [05:32:26] error: file write error (No space left on device) [05:43:39] !log on puppetmaster2001: disk full due to logs, so I'm truncating /var/log/debug which is apparently copied to daemon.log anyway [05:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:49] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [05:53:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [05:58:58] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:59:58] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:01:57] (03PS1) 10Legoktm: Revert "Revert "Set wgWikiDiff2MovedParagraphDetectionCutoff for group0"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387978 [06:02:44] !log live-hacking mwdebug1002 [06:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:29] 10Operations, 10Icinga, 10Readers-Web-Backlog, 10Spike: How do you add a private mailing list as an Icinga contact? - https://phabricator.wikimedia.org/T172879#3728618 (10Dzahn) [06:24:04] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387979 (https://phabricator.wikimedia.org/T161088) [06:27:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387979 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [06:27:29] !log Deploy optimize table wikidatawiki.pagelinks on s5 master (db1063) - T174509 [06:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:36] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [06:28:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387979 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [06:28:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387979 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [06:29:38] (03PS1) 10Krinkle: grafana: Update server-board JSON to schemaVersion 14 (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/387980 [06:29:40] (03PS1) 10Krinkle: grafana: Reduce "network" dropdown to valid options for current server [puppet] - 10https://gerrit.wikimedia.org/r/387981 [06:30:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1091 - T161088 (duration: 01m 07s) [06:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:33] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [06:31:23] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3728625 (10Legoktm) {F10564122} So I got it to eventually work on testwiki + mwdebug1002 - it's a caching problem. I think it should b... [06:31:50] !log Stop MySQL on db1091 and db1103 to migrate db1091 to file per table [06:31:52] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3728626 (10Legoktm) Also, if you want to view a diff uncached just use show changes in action=edit or click "undo" on an edit. [06:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:44] 10Operations, 10Icinga, 10Readers-Web-Backlog, 10Spike: How do you add a private mailing list as an Icinga contact? - https://phabricator.wikimedia.org/T172879#3728627 (10Dzahn) How to add a contact in Icinga requirements: shell access to puppetmaster, root privileges - ssh to production puppet master (s... [06:33:06] !log done on mwdebug1002 [06:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:27] (03CR) 10Krinkle: "Preview at https://grafana-admin.wikimedia.org/dashboard/db/server-board" [puppet] - 10https://gerrit.wikimedia.org/r/387981 (owner: 10Krinkle) [06:33:55] 10Operations, 10Icinga, 10Readers-Web-Backlog, 10Spike: How do you add a private mailing list as an Icinga contact? - https://phabricator.wikimedia.org/T172879#3728628 (10Dzahn) Whether the email address used is an individual, a public list, a private list, mailman or not mailman etc is irrelevant for the... [06:33:58] (03CR) 10Legoktm: [C: 031] "This should be OK to go out, use action=edit to test whether it's working instead of checking potentially cached diffs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387978 (owner: 10Legoktm) [06:34:30] (03PS2) 10Legoktm: Revert "Revert "Set wgWikiDiff2MovedParagraphDetectionCutoff for group0"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387978 (https://phabricator.wikimedia.org/T177891) [06:39:41] (03PS1) 10Dzahn: icinga/nagios_common: add contact team-reading-web to contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/387983 (https://phabricator.wikimedia.org/T172879) [06:41:39] (03PS1) 10Legoktm: docker: Add deb-src for apt.wm.o in jessie and stretch images [puppet] - 10https://gerrit.wikimedia.org/r/387984 (https://phabricator.wikimedia.org/T179354) [06:41:49] (03PS2) 10Dzahn: icinga/nagios_common: add contact team-reading-web to contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/387983 (https://phabricator.wikimedia.org/T172879) [06:41:54] (03CR) 10Dzahn: [C: 032] icinga/nagios_common: add contact team-reading-web to contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/387983 (https://phabricator.wikimedia.org/T172879) (owner: 10Dzahn) [06:50:09] (03PS1) 10Marostegui: mariadb: Add db2085 to s3 and s5 [puppet] - 10https://gerrit.wikimedia.org/r/387985 (https://phabricator.wikimedia.org/T178359) [06:53:07] (03CR) 10Marostegui: [C: 032] "Looks good: https://puppet-compiler.wmflabs.org/compiler02/8595/" [puppet] - 10https://gerrit.wikimedia.org/r/387985 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:54:37] !log Stop s3 instance on db2092 to copy it to db2085 - T178359 [06:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:43] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:55:26] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [06:59:03] 10Operations, 10Icinga, 10Readers-Web-Backlog, 10Patch-For-Review, 10Spike: How do you add a private mailing list as an Icinga contact? - https://phabricator.wikimedia.org/T172879#3728656 (10Dzahn) [06:59:34] 10Operations, 10Icinga, 10Readers-Web-Backlog, 10Patch-For-Review, 10Spike: How do you add a private mailing list as an Icinga contact? - https://phabricator.wikimedia.org/T172879#3511962 (10Dzahn) 05Open>03Resolved https://wikitech.wikimedia.org/wiki/Icinga#Adding_a_new_contact also see: T164238 [07:00:57] 10Operations, 10Icinga, 10Readers-Web-Backlog, 10Patch-For-Review, 10Spike: How do you add a private mailing list as an Icinga contact? - https://phabricator.wikimedia.org/T172879#3728663 (10Dzahn) on einsteinium (icinga prod server): ``` define contactgroup { contactgroup_name team-readi... [07:02:51] 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3728666 (10Dzahn) 05Open>03stalled blocked on T179302 [07:03:12] 10Operations, 10Mail, 10monitoring: prometheus metrics and grafana dashboard for exim - https://phabricator.wikimedia.org/T179302#3719846 (10Dzahn) [07:03:15] 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3728670 (10Dzahn) [07:05:05] 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3292437 (10Dzahn) p:05Normal>03Low [07:05:37] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3728675 (10Marostegui) Hi, I have managed to bring MySQL up on a read-only state (with innodb_force_recovery=2). Now I can select stuff and see some data. If you... [07:46:13] 10Operations, 10Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#3728699 (10Dzahn) Looks like the site is live now: https://wikimediaendowment.org/ [07:46:48] 10Operations, 10Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#3728700 (10Dzahn) 05stalled>03Resolved @kaythaney This is all resolved now, right? [07:53:40] (03CR) 10Dzahn: "@gehel but this is adding -Xms.. it's "revert revert"." [puppet] - 10https://gerrit.wikimedia.org/r/387605 (owner: 10Chad) [07:54:19] (03PS3) 10Dzahn: Revert "Revert "Gerrit: Also set minimum heap size"" [puppet] - 10https://gerrit.wikimedia.org/r/387605 (owner: 10Chad) [07:59:13] !log Stop mysql event scheduler on labsdb1001 - T179464 [07:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:20] T179464: labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464 [08:04:16] !log installing curl security updates [08:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:18] (03CR) 10Dzahn: [C: 032] Revert "Revert "Gerrit: Also set minimum heap size"" [puppet] - 10https://gerrit.wikimedia.org/r/387605 (owner: 10Chad) [08:11:22] (03PS15) 10Dzahn: Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [08:12:15] (03CR) 10Dzahn: [C: 032] Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [08:14:30] (03PS2) 10Dzahn: Git::clone tidy up default gerrit URLs [puppet] - 10https://gerrit.wikimedia.org/r/384756 (owner: 10Chad) [08:15:06] (03PS1) 10Phedenskog: Removed old secureConnectionStart check that throw away metrics [puppet] - 10https://gerrit.wikimedia.org/r/387997 (https://phabricator.wikimedia.org/T179555) [08:15:18] !log restarting app server canaries to pick up curl security update [08:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:44] (03PS2) 10Phedenskog: Removed old secureConnectionStart check that throw away metrics [puppet] - 10https://gerrit.wikimedia.org/r/387997 (https://phabricator.wikimedia.org/T179555) [08:15:56] (03CR) 10Dzahn: [C: 032] Git::clone tidy up default gerrit URLs [puppet] - 10https://gerrit.wikimedia.org/r/384756 (owner: 10Chad) [08:22:56] (03PS1) 10Marostegui: db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387998 (https://phabricator.wikimedia.org/T161088) [08:25:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387998 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:25:49] (03CR) 10Gehel: "@Dzahn: right, I was confused by the double negation... In that case, that CR was probably mostly useless..." [puppet] - 10https://gerrit.wikimedia.org/r/387605 (owner: 10Chad) [08:27:12] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387998 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:27:21] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1091 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387998 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:27:45] PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libcurl3-dbg] [08:28:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1091 with low weight after maintenance - T161088 (duration: 00m 50s) [08:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:37] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [08:29:44] (03PS2) 10Dzahn: Do clones of MediaWiki + extensions + skins + vendor to /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/384754 (owner: 10Chad) [08:30:10] (03PS3) 10Dzahn: releases: Clone MediaWiki + extensions + skins + vendor to /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/384754 (owner: 10Chad) [08:30:25] PROBLEM - puppet last run on mw2130 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[libcurl3-dbg] [08:31:16] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2082871 [08:35:11] (03PS2) 10Giuseppe Lavagetto: Use python-build images for build [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/387587 [08:35:28] (03CR) 10Dzahn: [C: 032] releases: Clone MediaWiki + extensions + skins + vendor to /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/384754 (owner: 10Chad) [08:39:37] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Use python-build images for build [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/387587 (owner: 10Giuseppe Lavagetto) [08:42:44] (03PS1) 10Marostegui: db-eqiad.php: Increase db1091 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388000 (https://phabricator.wikimedia.org/T161088) [08:42:53] (03PS9) 10Umherirrender: Add ar_content_format and ar_content_model to labs views [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) [08:44:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1091 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388000 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:45:32] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1091 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388000 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:46:26] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1091 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388000 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [08:46:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1091 weight after maintenance - T161088 (duration: 00m 50s) [08:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:47] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [08:52:35] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_clone_mediawiki/extensions],Exec[git_clone_mediawiki/skins] [08:53:00] grrr ^ cloning timed out [08:53:21] does the initial clone manually [08:53:24] 10Operations, 10Release-Engineering-Team, 10User-Joe: Create jenkins job for creating deployment artifacts for `docker-pkg-deploy` - https://phabricator.wikimedia.org/T179562#3728803 (10Joe) [08:53:33] 10Operations, 10Release-Engineering-Team, 10User-Joe: Create jenkins job for creating deployment artifacts for `docker-pkg-deploy` - https://phabricator.wikimedia.org/T179562#3728815 (10Joe) p:05Triage>03High [08:54:30] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388001 (https://phabricator.wikimedia.org/T161088) [08:55:25] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:55:45] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_clone_mediawiki/extensions],Exec[git_clone_mediawiki/skins] [08:57:35] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:57:39] (03PS3) 10Giuseppe Lavagetto: contint: include python3 on CI masters [puppet] - 10https://gerrit.wikimedia.org/r/385210 (https://phabricator.wikimedia.org/T178594) (owner: 10Hashar) [08:57:45] RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:59:56] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: include python3 on CI masters [puppet] - 10https://gerrit.wikimedia.org/r/385210 (https://phabricator.wikimedia.org/T178594) (owner: 10Hashar) [09:00:45] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:02:51] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Right concept, and the implementation would work, but we should not add an exec where we already wrote a custom resource." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377304 (owner: 10Chad) [09:04:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388001 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [09:05:25] (03PS1) 10Giuseppe Lavagetto: Re-introduce the CI servers as targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/388002 [09:05:27] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388001 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [09:06:26] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388001 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [09:06:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1091 weight after maintenance - T161088 (duration: 00m 50s) [09:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:32] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [09:09:40] (03PS2) 10Ema: VCL: add layer information to X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/387817 (https://phabricator.wikimedia.org/T177199) [09:10:12] (03CR) 10Marostegui: "> Is anything else needed to deploy this change?" [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo) [09:18:38] (03PS1) 10Marostegui: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388004 (https://phabricator.wikimedia.org/T161088) [09:24:50] 10Operations, 10Goal, 10User-fgiunchedi: Port exim statistics to Prometheus - https://phabricator.wikimedia.org/T179565#3728881 (10fgiunchedi) [09:28:06] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3728894 (10Dzahn) @Framawiki Any progress on this? [09:28:37] 10Operations, 10ops-eqiad, 10DBA: Decommission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3728895 (10Marostegui) a:03Cmjohnson [09:35:15] 10Operations, 10Wikimedia-General-or-Unknown, 10Tor, 10WorkType-NewFunctionality: Run our own Tor client for Tor block - https://phabricator.wikimedia.org/T32716#355684 (10Dzahn) Given that the last 2 comments seem to agree on calling this resolved.. and then more than 2 years went by without further comme... [09:39:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388004 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [09:40:36] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388004 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [09:40:45] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388004 (https://phabricator.wikimedia.org/T161088) (owner: 10Marostegui) [09:41:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1091 original weight - T161088 (duration: 00m 50s) [09:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:56] T161088: Migrate some s4 hosts to file per table - https://phabricator.wikimedia.org/T161088 [09:42:19] (03PS10) 10Muehlenhoff: Use new repository layout for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) [09:43:12] (03CR) 10Muehlenhoff: [C: 032] Use new repository layout for stretch onwards (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [09:44:27] (03PS2) 10Filippo Giunchedi: prometheus: add k8s instance [puppet] - 10https://gerrit.wikimedia.org/r/387551 (https://phabricator.wikimedia.org/T177395) [09:45:10] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add k8s instance [puppet] - 10https://gerrit.wikimedia.org/r/387551 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [09:48:18] (03CR) 10Hashar: "I have purged blubber from the CI labs instances:" [puppet] - 10https://gerrit.wikimedia.org/r/385208 (owner: 10Hashar) [09:53:32] (03PS3) 10Muehlenhoff: Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/357616 (https://phabricator.wikimedia.org/T158583) [09:54:10] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] prometheus: add k8s instance [puppet] - 10https://gerrit.wikimedia.org/r/387551 (https://phabricator.wikimedia.org/T177395) (owner: 10Filippo Giunchedi) [09:54:22] 10Operations, 10Icinga, 10Readers-Web-Backlog, 10Patch-For-Review, 10Spike: How do you add a private mailing list as an Icinga contact? - https://phabricator.wikimedia.org/T172879#3511962 (10phuedx) Wow! Thanks for the detailed explanation and for the write-up on wiki, @Dzahn! [09:56:36] (03CR) 10Addshore: "Perhaps we should look at altering the cache key first?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387978 (https://phabricator.wikimedia.org/T177891) (owner: 10Legoktm) [09:57:32] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3728971 (10Addshore) > Also, if you want to view a diff uncached just use show changes in action=edit or click "undo" on an edit. I d... [10:13:32] !log Deploy alter table on db2019 - T174569 [10:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:39] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [10:20:40] (03CR) 10Hashar: "check experimental" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/369838 (https://phabricator.wikimedia.org/T172358) (owner: 10BryanDavis) [10:20:53] (03CR) 10Hashar: "check experimental" [dumps] - 10https://gerrit.wikimedia.org/r/387593 (owner: 10ArielGlenn) [10:21:08] (03CR) 10Hashar: "check experimental" [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350282 (owner: 10ArielGlenn) [10:21:16] (03CR) 10jenkins-bot: Install composer for PHP images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/369838 (https://phabricator.wikimedia.org/T172358) (owner: 10BryanDavis) [10:21:23] (03CR) 10jenkins-bot: tabs to spaces, blame mutante. [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350282 (owner: 10ArielGlenn) [10:21:31] (03CR) 10Hashar: "check experimental" [software] - 10https://gerrit.wikimedia.org/r/387526 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:21:41] (03CR) 10jenkins-bot: drop check of existence for config file from all utils [dumps] - 10https://gerrit.wikimedia.org/r/387593 (owner: 10ArielGlenn) [10:23:36] (03CR) 10jenkins-bot: s5,s6.hosts: Add db2089 [software] - 10https://gerrit.wikimedia.org/r/387526 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:32:19] (03PS1) 10Hashar: Do not use bare except [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/388016 [10:33:05] (03CR) 10Hashar: "check experimental" [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/388016 (owner: 10Hashar) [10:34:31] apergos: hello! I got some flake8 fix for dumps/import-tools https://gerrit.wikimedia.org/r/#/c/388016/ : ) [10:34:38] bah [10:34:43] poor timeout [10:35:30] (03PS1) 10Gehel: wdqs: set G1 new size to 20% [puppet] - 10https://gerrit.wikimedia.org/r/388017 (https://phabricator.wikimedia.org/T175919) [10:36:11] (03CR) 10Gehel: [C: 032] wdqs: set G1 new size to 20% [puppet] - 10https://gerrit.wikimedia.org/r/388017 (https://phabricator.wikimedia.org/T175919) (owner: 10Gehel) [10:37:04] (03CR) 10Hashar: "check experimental" [software/conftool] - 10https://gerrit.wikimedia.org/r/343622 (owner: 10Giuseppe Lavagetto) [10:37:30] (03CR) 10Hashar: "check experimental" [software/cumin] - 10https://gerrit.wikimedia.org/r/384547 (https://phabricator.wikimedia.org/T178279) (owner: 10Volans) [10:37:46] (03CR) 10Hashar: "check experimental" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/387547 (owner: 10Giuseppe Lavagetto) [10:37:56] (03CR) 10jenkins-bot: Add pypi classifiers [software/conftool] - 10https://gerrit.wikimedia.org/r/343622 (owner: 10Giuseppe Lavagetto) [10:38:05] hashar: what's this, new trick? :-P [10:38:11] xddd [10:38:21] it is a developer thing [10:38:28] it is too complicated for you ops :] [10:38:39] !log rolling restart of wdqs nodes for GC tuning - T175919 [10:38:43] I'm listed as developer in the org chart :-P [10:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:44] T175919: investigate GC times on wikidata query service - https://phabricator.wikimedia.org/T175919 [10:38:46] more seriously, Zuul process events from Gerrit such as someone adding a comment in Gerrit [10:38:53] (03CR) 10jenkins-bot: Do not add twice the registry name to images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/387547 (owner: 10Giuseppe Lavagetto) [10:39:04] Zuul then pass those events to "pipelines" which filter events [10:39:21] <_joe_> wat? [10:39:21] if one comments "check experimental", the event will be accepted by the "experimental" pipeline which then trigger a set of jobs [10:39:39] volans: so really in the end, it is just a way to run jobs that are not the same that are run when one send a patchset in gerrit [10:39:41] <_joe_> hashar: why did jenkins-bot make that CR? that patch is merged since forever [10:40:03] _joe_: cause I asked it to test the patch by commenting "check experimental" ex: https://gerrit.wikimedia.org/r/#/c/387547/ [10:40:09] <_joe_> ah ok [10:40:12] and jenkins-bot kindly replied that the job went fine [10:40:19] <_joe_> ahah ok [10:40:20] here I am testing the tox jobs that runs in docker [10:40:22] ;D [10:40:37] (03CR) 10jenkins-bot: PuppetDB backend: Class, Roles and Profiles shortcuts [software/cumin] - 10https://gerrit.wikimedia.org/r/384547 (https://phabricator.wikimedia.org/T178279) (owner: 10Volans) [10:40:38] so eventually [10:40:46] if the job pass, I can switch CI to run the docker flavor [10:40:53] then deploy and nothing bad will happen :D [10:41:15] PROBLEM - Check Varnish expiry mailbox lag on cp4025 is CRITICAL: CRITICAL: expiry mailbox lag is 2052928 [10:41:19] mine timed out :D [10:42:06] yeah some of the instances are too slow: (- [10:43:04] (03PS2) 10Alexandros Kosiaris: grafana: Update server-board JSON to schemaVersion 14 (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/387980 (owner: 10Krinkle) [10:43:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] grafana: Update server-board JSON to schemaVersion 14 (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/387980 (owner: 10Krinkle) [10:44:57] (03CR) 10Hashar: "check experimental" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/342445 (owner: 10Giuseppe Lavagetto) [10:45:20] there will be a bit of spam, I am tesdting all the operations/* repositories having tox [10:46:02] (03CR) 10jenkins-bot: Refactor ReplicationController, version bump [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/342445 (owner: 10Giuseppe Lavagetto) [10:46:34] (03CR) 10Hashar: "check experimental" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/325042 (owner: 10Gerrit Patch Uploader) [10:46:54] (03CR) 10Hashar: "check experimental" [software/service-checker] - 10https://gerrit.wikimedia.org/r/358037 (owner: 10Giuseppe Lavagetto) [10:47:16] (03CR) 10Hashar: "check experimental" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [10:47:24] * volans hopes not manually... [10:47:28] <_joe_> hashar: about that, I'm about to post a patch to update profile::ci:shipyard [10:47:31] (03CR) 10jenkins-bot: Fix up puppet-compiler for labs usage [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/325042 (owner: 10Gerrit Patch Uploader) [10:47:37] (03CR) 10Hashar: "check experimental" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/383877 (owner: 10BryanDavis) [10:47:45] (03CR) 10jenkins-bot: Generate man page for collector-runner [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [10:47:53] (03CR) 10jenkins-bot: Use assert_hostname for https urls only [software/service-checker] - 10https://gerrit.wikimedia.org/r/358037 (owner: 10Giuseppe Lavagetto) [10:47:57] (03CR) 10Hashar: "check experimental" [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/385155 (owner: 10Filippo Giunchedi) [10:48:09] (03CR) 10Dzahn: "meanwhile i added the missing parameters in another change, needs manual rebase..amending" [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [10:48:13] (03CR) 10Hashar: "check experimental" [switchdc] - 10https://gerrit.wikimedia.org/r/351680 (https://phabricator.wikimedia.org/T164403) (owner: 10Volans) [10:48:37] (03CR) 10jenkins-bot: t09_start_maintenance: clear systemctl state on dc_from [switchdc] - 10https://gerrit.wikimedia.org/r/351680 (https://phabricator.wikimedia.org/T164403) (owner: 10Volans) [10:48:49] (03CR) 10jenkins-bot: Release 0.4 [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/385155 (owner: 10Filippo Giunchedi) [10:48:51] (03CR) 10jenkins-bot: Bump debian package version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/383877 (owner: 10BryanDavis) [10:53:31] !log depooling logstash100[123] in preparation for decommission - T175830 [10:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:38] T175830: decommission logstash100[1-3] - https://phabricator.wikimedia.org/T175830 [10:53:40] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=logstash1001.eqiad.wmnet [10:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:58] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=logstash1002.eqiad.wmnet [10:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:09] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=logstash1003.eqiad.wmnet [10:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:27] (03PS4) 10Muehlenhoff: Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/357616 (https://phabricator.wikimedia.org/T158583) [10:58:03] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#3729031 (10Gehel) [10:59:01] (03PS1) 10Elukey: Enable icinga notifications for db1108 [puppet] - 10https://gerrit.wikimedia.org/r/388021 (https://phabricator.wikimedia.org/T177405) [10:59:38] (03CR) 10Elukey: [C: 032] Enable icinga notifications for db1108 [puppet] - 10https://gerrit.wikimedia.org/r/388021 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:00:54] (03CR) 10Muehlenhoff: [C: 032] Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/357616 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [11:00:59] (03PS5) 10Muehlenhoff: Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/357616 (https://phabricator.wikimedia.org/T158583) [11:01:27] (03CR) 10Muehlenhoff: [V: 032 C: 032] Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/357616 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [11:03:15] (03CR) 10Dzahn: gerrit: let Apache proxy only listen on service IP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [11:04:06] (03PS1) 10Muehlenhoff: Remove thirdparty component for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/388022 [11:04:46] (03PS13) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [11:05:29] (03CR) 10jerkins-bot: [V: 04-1] gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [11:08:32] (03PS14) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [11:09:14] (03CR) 10jerkins-bot: [V: 04-1] gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [11:10:10] doesn't see the actual error [11:11:33] sees h.ashar migrating tox jobs and shuts up :) [11:12:05] most tox jobs are on docker now :] [11:12:21] lot of them fails due to flake8 having released a new version. I should probably drop a mail about it [11:12:22] :) [11:12:31] it is food time! [11:12:35] indeed [11:13:03] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [11:13:32] (03CR) 10jerkins-bot: [V: 04-1] gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [11:16:17] (03PS1) 10Gehel: wdqs: cleanup JVM options for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/388026 (https://phabricator.wikimedia.org/T175830) [11:16:48] (03CR) 10jerkins-bot: [V: 04-1] wdqs: cleanup JVM options for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/388026 (https://phabricator.wikimedia.org/T175830) (owner: 10Gehel) [11:17:41] (03PS2) 10Gehel: wdqs: cleanup JVM options for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/388026 (https://phabricator.wikimedia.org/T175830) [11:19:49] (03PS1) 10Muehlenhoff: Remove experimental component [puppet] - 10https://gerrit.wikimedia.org/r/388027 [11:21:05] !log mobrovac@tin Started deploy [restbase/deploy@f6c4e2d]: Parsoid module: use the Cassandra 2 tables as fallback when needed - T179417 [11:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:11] T179417: Migrate Parsoid from legacy to new storage - https://phabricator.wikimedia.org/T179417 [11:23:57] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#3729078 (10Gehel) [11:26:09] !log drop log.MediaViewer_10867062_15423246 from db1047,db1108 since already archived in hdfs - T168303 [11:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:16] T168303: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303 [11:26:21] marostegui: I am sorry --^ :( [11:26:32] a lot of pain for a table that we don't need [11:26:34] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#3729090 (10Gehel) [11:26:37] hahaha [11:26:49] well, at least it will be gone once we start importing db1107 [11:27:44] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#1540070 (10Gehel) [11:27:47] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Investigate mysterious write load during general read-only maintenance - https://phabricator.wikimedia.org/T109127#3729096 (10Gehel) 05Open>03Resolved a:03Gehel Too many things have changed and too many logs have been rotated since this... [11:28:39] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#3729108 (10Gehel) [11:28:40] (03PS2) 10KartikMistry: Remove wgContentTranslationEnableSuggestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378833 [11:28:41] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: Collect threaddumps from elasticsearch at regular intervals - https://phabricator.wikimedia.org/T130209#3729105 (10Gehel) 05Open>03Resolved a:03Gehel We have not had a real use case for this for the last year. Closing this, we... [11:29:21] !log mobrovac@tin Finished deploy [restbase/deploy@f6c4e2d]: Parsoid module: use the Cassandra 2 tables as fallback when needed - T179417 (duration: 08m 15s) [11:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:27] T179417: Migrate Parsoid from legacy to new storage - https://phabricator.wikimedia.org/T179417 [11:29:39] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#1540070 (10Gehel) @debt: I finally went through all the subtasks and closed a bunch of them. There are st... [11:31:26] !log remove obsolete packages from stretch-wikimedia/experimental, now empty [11:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:16] (03PS2) 10Addshore: Remove unused wmgUseWikibasePropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381193 [11:40:45] (03PS5) 10Addshore: Add loading of wikibase extensions from build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) [11:41:03] (03CR) 10Addshore: [C: 04-1] "should look at and act on audes comments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [11:41:13] (03PS5) 10Addshore: Load wikibase build from mediawiki-config for beta wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381195 (https://phabricator.wikimedia.org/T176948) [11:41:17] (03PS4) 10Addshore: Load wikibase build from mediawiki-config for beta wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381199 (https://phabricator.wikimedia.org/T176948) [11:43:39] !log drop table log.PageContentSaveComplete_5588433 from db1046,db1047,db1108,dbstore1002, archived on hdfs - T177101 [11:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:46] T177101: PageContentSaveComplete. Stop collecting - https://phabricator.wikimedia.org/T177101 [11:45:05] (03PS1) 10Giuseppe Lavagetto: profile::ci::shipyard: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/388031 (https://phabricator.wikimedia.org/T177276) [11:46:13] (03PS1) 10Filippo Giunchedi: mx: export metrics from exim4 mainlog [puppet] - 10https://gerrit.wikimedia.org/r/388032 (https://phabricator.wikimedia.org/T179565) [11:47:20] 10Operations, 10Wikimedia-Mailing-lists: Reach out to Google about @yahoo.com emails not reaching gmail inboxes (when sent to mailing lists) - https://phabricator.wikimedia.org/T146841#3729163 (10Dzahn) @Seb35 @Peachey88 @Herron since T168467 is resolved meanwhile, does this mean this ticket can be closed as... [11:51:24] (03PS1) 10Giuseppe Lavagetto: Add fake docker credentials to hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/388034 [11:53:13] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add fake docker credentials to hieradata [labs/private] - 10https://gerrit.wikimedia.org/r/388034 (owner: 10Giuseppe Lavagetto) [11:54:24] (03PS15) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [11:54:55] (03CR) 10jerkins-bot: [V: 04-1] gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [11:57:21] (03PS1) 10Ppchelko: Run the updateBetaFeatureUserCount on kafka queue only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388037 [11:57:57] !log mobrovac@tin Started deploy [restbase/deploy@9314cf6]: Parsoid: Switch all but WPs to use the next-generation storage - T179417 [11:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:03] T179417: Migrate Parsoid from legacy to new storage - https://phabricator.wikimedia.org/T179417 [12:02:55] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/8597/" [puppet] - 10https://gerrit.wikimedia.org/r/388032 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [12:03:41] (03CR) 10Filippo Giunchedi: "This works for, though I'll need to followup with a simple way to provide sample log files and associated tests to check the program works" [puppet] - 10https://gerrit.wikimedia.org/r/388032 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [12:05:02] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Re-introduce the CI servers as targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/388002 (owner: 10Giuseppe Lavagetto) [12:05:20] (03PS2) 10Giuseppe Lavagetto: profile::ci::shipyard: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/388031 (https://phabricator.wikimedia.org/T177276) [12:06:32] (03PS2) 10Ppchelko: Run the updateBetaFeatureUserCount on kafka queue only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388037 (https://phabricator.wikimedia.org/T175210) [12:06:39] !log mobrovac@tin Finished deploy [restbase/deploy@9314cf6]: Parsoid: Switch all but WPs to use the next-generation storage - T179417 (duration: 08m 43s) [12:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:49] T179417: Migrate Parsoid from legacy to new storage - https://phabricator.wikimedia.org/T179417 [12:08:22] (03PS16) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [12:09:58] !log removed obsolete packages from jessie-wikimedia/experimental [12:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:36] (03CR) 10Mobrovac: [C: 031] "LGTM, let's put it up for SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388037 (https://phabricator.wikimedia.org/T175210) (owner: 10Ppchelko) [12:11:45] (03PS3) 10Giuseppe Lavagetto: profile::ci::shipyard: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/388031 (https://phabricator.wikimedia.org/T177276) [12:12:24] (03PS17) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [12:14:17] (03CR) 10Alexandros Kosiaris: [C: 032] grafana: Reduce "network" dropdown to valid options for current server [puppet] - 10https://gerrit.wikimedia.org/r/387981 (owner: 10Krinkle) [12:14:21] (03PS2) 10Alexandros Kosiaris: grafana: Reduce "network" dropdown to valid options for current server [puppet] - 10https://gerrit.wikimedia.org/r/387981 (owner: 10Krinkle) [12:14:29] (03CR) 10Alexandros Kosiaris: [C: 032] "Nice! this is useful, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/387981 (owner: 10Krinkle) [12:14:32] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] grafana: Reduce "network" dropdown to valid options for current server [puppet] - 10https://gerrit.wikimedia.org/r/387981 (owner: 10Krinkle) [12:15:28] (03PS18) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [12:17:34] PROBLEM - Check Varnish expiry mailbox lag on cp4023 is CRITICAL: CRITICAL: expiry mailbox lag is 2048937 [12:22:17] (03CR) 10ArielGlenn: [C: 032] Do not use bare except [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/388016 (owner: 10Hashar) [12:30:53] (03CR) 10Muehlenhoff: [C: 032] Remove thirdparty component for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/388022 (owner: 10Muehlenhoff) [12:30:58] (03PS2) 10Muehlenhoff: Remove thirdparty component for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/388022 [12:32:28] 10Operations, 10monitoring: Cluster puppet variable and ganglia decommission - https://phabricator.wikimedia.org/T179395#3729296 (10Volans) As partially discussed in the last monitoring meeting, this is one possibility: - Given that we're going to have a single Puppet role per host, it seems to me that the co... [12:34:02] (03PS2) 10Muehlenhoff: Remove experimental component [puppet] - 10https://gerrit.wikimedia.org/r/388027 [12:37:44] (03CR) 10Dzahn: "compiles now: http://puppet-compiler.wmflabs.org/8605/" [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [12:42:35] (03CR) 10Dzahn: [C: 031] "i think it should work now" [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [12:46:12] (03PS2) 10Dzahn: requesttracker: Switch ferm rule to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366812 (owner: 10Muehlenhoff) [12:46:19] (03PS3) 10Dzahn: requesttracker: Switch ferm rule to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366812 (owner: 10Muehlenhoff) [12:46:30] 10Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3729337 (10elukey) All right we are ready to outline the next steps with some deadlines (tentative): * November 6th: the analytics-slave CNAME moves from db1047 to db1108 * N... [12:47:31] (03CR) 10Dzahn: [C: 032] "thanks bblack! verifying it on ununpentium" [puppet] - 10https://gerrit.wikimedia.org/r/366812 (owner: 10Muehlenhoff) [12:48:35] (03CR) 10Muehlenhoff: [C: 032] Remove experimental component [puppet] - 10https://gerrit.wikimedia.org/r/388027 (owner: 10Muehlenhoff) [12:48:40] (03PS3) 10Muehlenhoff: Remove experimental component [puppet] - 10https://gerrit.wikimedia.org/r/388027 [12:49:45] (03CR) 10Dzahn: "applied on ununpentium. no issues. https://rt.wikimedia.org still works" [puppet] - 10https://gerrit.wikimedia.org/r/366812 (owner: 10Muehlenhoff) [12:55:09] 10Operations, 10Analytics, 10DBA, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3729344 (10elukey) [12:56:24] (03PS2) 10Dzahn: profile::microsites::static_bugzilla: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366819 (owner: 10Muehlenhoff) [13:00:00] (03CR) 10Dzahn: [C: 032] profile::microsites::static_bugzilla: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366819 (owner: 10Muehlenhoff) [13:00:10] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171102T1300). [13:00:11] kart_ and gehel: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] * kart_ is here [13:00:17] I can SWAT today [13:00:18] jouncebot: o/ [13:00:59] bothumor is getting better... :D [13:01:48] kart_, gehel: I can SWAT, so not trying to make you do it, just asking if you would prefer to deploy your patches yourselves :) [13:02:09] zeljkof: I much prefer if you do it! [13:02:28] (03CR) 10Dzahn: "applied on bromine.eqiad.wmnet, this affected:" [puppet] - 10https://gerrit.wikimedia.org/r/366819 (owner: 10Muehlenhoff) [13:02:47] zeljkof: thanks for the help! [13:02:52] 10Operations, 10Analytics, 10DBA, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3729360 (10Marostegui) >>! In T156844#3729337, @elukey wrote: > All right we are ready to outline the next steps with some deadlines (tentative): > > * Novem... [13:03:33] kart_, gehel: does your patch take a long time to test? in that case it would be deployed after other patches [13:03:47] not much to test... [13:04:08] zeljkof: mine will take some time.. [13:04:21] (03CR) 10Dzahn: "already applied on bromine via profile::microsites::static_bugzilla , just needs rebase" [puppet] - 10https://gerrit.wikimedia.org/r/366815 (owner: 10Muehlenhoff) [13:04:23] kart_: ok, in that case, deploying gehel's patch first [13:04:35] gehel: will ping you when it's at mwdebug1002 [13:04:38] in a few minutes [13:04:51] zeljkof: ack [13:05:52] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383355 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [13:06:50] (03PS2) 10Dzahn: profile::microsites::annualreport: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366815 (owner: 10Muehlenhoff) [13:07:38] (03PS3) 10Dzahn: profile::microsites::annualreport: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366815 (owner: 10Muehlenhoff) [13:07:40] (03PS1) 10ArielGlenn: generate separate config for dumps jobs on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/388048 (https://phabricator.wikimedia.org/T178893) [13:07:49] (03Merged) 10jenkins-bot: use the logstash LVS endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383355 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [13:08:00] (03CR) 10jenkins-bot: use the logstash LVS endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383355 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [13:09:18] gehel: the patch is at mwdebug1002, please test and let me know if I can deploy [13:09:27] * gehel is testing [13:09:55] log are still going through, looks good [13:10:14] gehel: ok to deploy? [13:10:22] zeljkof: yes, ok to deploy [13:10:29] ok, deploying [13:11:24] !log zfilipin@tin Synchronized wmf-config/ProductionServices.php: SWAT: [[gerrit:383355|use the logstash LVS endpoint (T175242)]] (duration: 00m 51s) [13:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:31] T175242: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242 [13:11:47] gehel: deployed, please monitor logs for a while and thanks for deploying with #releng ;) [13:12:02] zeljkof: Thanks! I'll keep an eye on things... [13:12:47] kart_: merging your patch, waiting for CI, will ping you when it's at mwdebug1002 [13:14:35] zeljkof: sure [13:15:02] (03CR) 10Dzahn: [C: 032] profile::microsites::annualreport: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366815 (owner: 10Muehlenhoff) [13:15:07] (03PS4) 10Dzahn: profile::microsites::annualreport: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366815 (owner: 10Muehlenhoff) [13:16:04] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:16:52] (03PS2) 10ArielGlenn: generate separate config for dumps jobs on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/388048 (https://phabricator.wikimedia.org/T178893) [13:18:11] (03CR) 10ArielGlenn: [C: 032] generate separate config for dumps jobs on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/388048 (https://phabricator.wikimedia.org/T178893) (owner: 10ArielGlenn) [13:18:23] 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628#3729423 (10Aklapper) 05Open>03stalled Setting task status to `stalled` as this is blocked on fixing {T170817}. [13:21:47] kart_: the patch is at mwdebug1002, let me know if you need more than 5 minutes to test [13:22:46] zeljkof: OK. Let me try. [13:28:57] zeljkof: me and Santhosh are testing; [13:29:28] kart_: ok, no rush, just let me know the ETA when/if you have one [13:30:08] zeljkof: few more minutes. Bit tricky, but there is a way. [13:33:21] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review: wikimedia-jessie & wikimedia-stretch docker images don't have deb-src set for apt.wikimedia.org - https://phabricator.wikimedia.org/T179354#3721999 (10akosiaris) `puppet/modules/package_builder` and `puppet/modules/docker`... [13:33:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] "See comment https://phabricator.wikimedia.org/T179354#3729501, I don't think this will solve the problem" [puppet] - 10https://gerrit.wikimedia.org/r/387984 (https://phabricator.wikimedia.org/T179354) (owner: 10Legoktm) [13:34:24] zeljkof: all OK. [13:34:35] kart_: ok, deploying [13:34:38] zeljkof: go ahead please. [13:35:43] !log zfilipin@tin Synchronized php-1.31.0-wmf.6/extensions/ContentTranslation/modules/publish/ext.cx.publish.js: SWAT: [[gerrit:387996|CX1: Check for template adaptation failures before publishing (T154116)]] (duration: 00m 50s) [13:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:49] T154116: Content Translation does not publish the translation - https://phabricator.wikimedia.org/T154116 [13:36:21] kart_: deployed, please monitor logs for a while and tanks for deploying with #releng ;) [13:36:44] zeljkof: yeah. tanks and guns :) [13:36:53] !log EU SWAT finished [13:36:57] zeljkof: thanks! see you next time! [13:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:21] heh, I was going to add stuff to swat, but didnt think it started for 20 mins! Guess I will use a slot this evening! :d [13:39:19] addshore: there is still time, and you can deploy yourself (or I can deploy) [13:39:49] the window is still ours, I release it when I am out of things to deploy [13:41:04] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:41:07] (03PS5) 10Dzahn: profile::microsites::annualreport: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366815 (owner: 10Muehlenhoff) [13:42:04] zeljkof: I need to check through the patches before deploying, so will wait :) [13:44:07] (03CR) 10Dzahn: "https://annual.wikimedia.org is just fine (bromine)" [puppet] - 10https://gerrit.wikimedia.org/r/366815 (owner: 10Muehlenhoff) [13:44:33] (03CR) 10Ottomata: [C: 031] "As far I as can understand, +1." [puppet] - 10https://gerrit.wikimedia.org/r/387817 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [13:45:03] (03PS2) 10Dzahn: profile::microsites::transparency: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366820 (owner: 10Muehlenhoff) [13:46:14] (03PS1) 10Gehel: cassandra: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/388052 (https://phabricator.wikimedia.org/T175242) [13:46:32] !log ugrade dnsmasq-base on labcontrol* [13:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:14] (03CR) 10Dzahn: [C: 032] "just like bugzilla_static and annual, all of these are on bromine" [puppet] - 10https://gerrit.wikimedia.org/r/366820 (owner: 10Muehlenhoff) [13:48:38] (03CR) 10Dzahn: "https://transparency.wikimedia.org is ok" [puppet] - 10https://gerrit.wikimedia.org/r/366820 (owner: 10Muehlenhoff) [13:51:01] (03CR) 10Eevans: [C: 031] cassandra: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/388052 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [13:58:11] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3729595 (10fgiunchedi) [13:59:16] (03CR) 10Dzahn: [C: 032] Gerrit: Add some soy templates for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/387410 (owner: 10Paladox) [13:59:20] (03PS3) 10Dzahn: Gerrit: Add some soy templates for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/387410 (owner: 10Paladox) [13:59:40] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3650139 (10fgiunchedi) [14:00:37] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/8579/" [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:00:39] (03PS4) 10Rush: openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) [14:01:16] (03CR) 10Elukey: "> As far I as can understand, +1." [puppet] - 10https://gerrit.wikimedia.org/r/387817 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:01:21] (03CR) 10jerkins-bot: [V: 04-1] openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:02:42] (03PS3) 10Volans: Cumin: fix wmf-style violations [puppet] - 10https://gerrit.wikimedia.org/r/386799 [14:02:54] (03CR) 10Volans: "latest compiler for reference: https://puppet-compiler.wmflabs.org/compiler02/8607/" [puppet] - 10https://gerrit.wikimedia.org/r/386799 (owner: 10Volans) [14:03:22] (03CR) 10Volans: [C: 032] Cumin: fix wmf-style violations [puppet] - 10https://gerrit.wikimedia.org/r/386799 (owner: 10Volans) [14:04:36] (03PS4) 10Dzahn: Gerrit: Add some soy templates for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/387410 (owner: 10Paladox) [14:05:45] (03PS1) 10Filippo Giunchedi: hieradata: rollout smart health check to codfw [puppet] - 10https://gerrit.wikimedia.org/r/388056 (https://phabricator.wikimedia.org/T86552) [14:05:47] (03PS1) 10Filippo Giunchedi: smart: add ensure metaparameter [puppet] - 10https://gerrit.wikimedia.org/r/388057 (https://phabricator.wikimedia.org/T86552) [14:08:20] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/8608/aqs1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/388052 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [14:12:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] gerrit: let Apache proxy only listen on service IP (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [14:12:50] (03PS2) 10Filippo Giunchedi: hieradata: rollout smart health check to codfw [puppet] - 10https://gerrit.wikimedia.org/r/388056 (https://phabricator.wikimedia.org/T86552) [14:12:53] (03PS2) 10Filippo Giunchedi: smart: add ensure metaparameter [puppet] - 10https://gerrit.wikimedia.org/r/388057 (https://phabricator.wikimedia.org/T86552) [14:17:40] (03CR) 10Gehel: "This affects the maps, aqs, restbase cassandra clusters (https://puppet-compiler.wmflabs.org/compiler03/8612/)" [puppet] - 10https://gerrit.wikimedia.org/r/388052 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [14:18:37] (03PS3) 10Filippo Giunchedi: hieradata: rollout smart health check to codfw [puppet] - 10https://gerrit.wikimedia.org/r/388056 (https://phabricator.wikimedia.org/T86552) [14:18:39] (03PS3) 10Filippo Giunchedi: smart: add ensure metaparameter [puppet] - 10https://gerrit.wikimedia.org/r/388057 (https://phabricator.wikimedia.org/T86552) [14:20:52] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/8613/" [puppet] - 10https://gerrit.wikimedia.org/r/388056 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [14:23:35] (03PS4) 10Filippo Giunchedi: hieradata: rollout smart health check to codfw [puppet] - 10https://gerrit.wikimedia.org/r/388056 (https://phabricator.wikimedia.org/T86552) [14:25:18] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: rollout smart health check to codfw [puppet] - 10https://gerrit.wikimedia.org/r/388056 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [14:27:04] (03PS1) 10Ottomata: Import mediawiki.recentchange data into HDFS with Camus [puppet] - 10https://gerrit.wikimedia.org/r/388063 [14:28:52] (03CR) 10Ottomata: [C: 032] Import mediawiki.recentchange data into HDFS with Camus [puppet] - 10https://gerrit.wikimedia.org/r/388063 (owner: 10Ottomata) [14:29:38] (03CR) 10Dzahn: [C: 04-1] "this moved to a different place. it's now modules/profile/manifests/releases/mediawiki.pp" [puppet] - 10https://gerrit.wikimedia.org/r/366818 (owner: 10Muehlenhoff) [14:30:13] PROBLEM - puppet last run on labstore2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:30:26] (03PS2) 10Dzahn: profile::microsites::releases: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366818 (owner: 10Muehlenhoff) [14:31:00] (03CR) 10Ottomata: [C: 031] "Oh, without the cache hostname path? I wonder if researchers would care about this." [puppet] - 10https://gerrit.wikimedia.org/r/387817 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:31:03] (03CR) 10jerkins-bot: [V: 04-1] profile::microsites::releases: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366818 (owner: 10Muehlenhoff) [14:41:00] (03CR) 10Elukey: "> Oh, without the cache hostname path? I wonder if researchers would" [puppet] - 10https://gerrit.wikimedia.org/r/387817 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:43:07] (03PS1) 10Ema: VCL: log TLS information to VSM [puppet] - 10https://gerrit.wikimedia.org/r/388064 (https://phabricator.wikimedia.org/T177199) [14:43:43] (03CR) 10Ema: "> So x_cache will be left untouched, but cache_status will be updated" [puppet] - 10https://gerrit.wikimedia.org/r/387817 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:46:02] (03CR) 10Ottomata: [C: 031] "Ah, ok great! Understand now, this is A-ok." [puppet] - 10https://gerrit.wikimedia.org/r/387817 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:46:21] (03PS3) 10Dzahn: profile::microsites::releases: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366818 (owner: 10Muehlenhoff) [14:46:24] PROBLEM - puppet last run on labstore2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:47:11] (03CR) 10jerkins-bot: [V: 04-1] profile::microsites::releases: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366818 (owner: 10Muehlenhoff) [14:48:06] (03PS4) 10Dzahn: profile::microsites::releases: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366818 (owner: 10Muehlenhoff) [14:49:36] (03CR) 10Dzahn: [C: 032] profile::microsites::releases: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366818 (owner: 10Muehlenhoff) [14:50:19] (03PS5) 10Dzahn: releases::mediawiki: limit ferm srange to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366818 (owner: 10Muehlenhoff) [14:50:36] (03CR) 10Dzahn: [V: 032 C: 032] releases::mediawiki: limit ferm srange to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366818 (owner: 10Muehlenhoff) [14:52:56] (03Abandoned) 10Muehlenhoff: profile::docker::registry: Restrict to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366859 (owner: 10Muehlenhoff) [14:53:01] (03PS4) 10Filippo Giunchedi: smart: add ensure metaparameter [puppet] - 10https://gerrit.wikimedia.org/r/388057 (https://phabricator.wikimedia.org/T86552) [14:53:03] (03PS1) 10Filippo Giunchedi: labstore: use require_package [puppet] - 10https://gerrit.wikimedia.org/r/388067 (https://phabricator.wikimedia.org/T86552) [14:53:09] (03CR) 10Dzahn: "applied on releases* https://releases.wikimedia.org/mediawiki/ is OK" [puppet] - 10https://gerrit.wikimedia.org/r/366818 (owner: 10Muehlenhoff) [14:53:24] (03PS4) 10Muehlenhoff: Configure fixed lock manager ports for labstore NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/357562 (https://phabricator.wikimedia.org/T165136) [14:54:08] (03CR) 10Muehlenhoff: [C: 031] labstore: use require_package [puppet] - 10https://gerrit.wikimedia.org/r/388067 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [14:57:36] (03CR) 10Dzahn: "checked that dput/debupload from tin isn't affected since it uses scp" [puppet] - 10https://gerrit.wikimedia.org/r/366818 (owner: 10Muehlenhoff) [15:00:19] (03PS1) 10Ottomata: Import resource_change topic into HDFS via Camus [puppet] - 10https://gerrit.wikimedia.org/r/388068 [15:02:37] (03CR) 10Ottomata: [C: 032] Import resource_change topic into HDFS via Camus [puppet] - 10https://gerrit.wikimedia.org/r/388068 (owner: 10Ottomata) [15:11:59] (03PS19) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [15:13:08] (03CR) 10Dzahn: [C: 031] gerrit: let Apache proxy only listen on service IP (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [15:14:02] (03CR) 10Alexandros Kosiaris: [C: 031] gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [15:15:09] (03PS20) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [15:19:08] hashar, paladox, `sudo gem install rake` solved my issue, thanks [15:20:40] (03PS4) 10Volans: Backends: add support to external backends plugins [software/cumin] - 10https://gerrit.wikimedia.org/r/384616 (https://phabricator.wikimedia.org/T178342) [15:20:42] (03PS2) 10Volans: Logging: uniform loggers [software/cumin] - 10https://gerrit.wikimedia.org/r/386399 (https://phabricator.wikimedia.org/T179002) [15:20:44] (03PS2) 10Volans: Logging: use % syntax for parameters [software/cumin] - 10https://gerrit.wikimedia.org/r/386400 (https://phabricator.wikimedia.org/T179002) [15:20:49] hashar, paladox, but now I have another issue... https://www.irccloud.com/pastebin/vbu3qKWH/ [15:21:56] (03CR) 10Elukey: "Let's also update https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest once this is done" [puppet] - 10https://gerrit.wikimedia.org/r/387817 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [15:22:19] (03PS1) 10Muehlenhoff: * New upstream release - CVE-2017-3735 - CVE-2017-3736 [debs/openssl11] - 10https://gerrit.wikimedia.org/r/388073 [15:24:05] (03PS1) 10Ottomata: Import mediawiki.job queue topics into HDFS via Camus [puppet] - 10https://gerrit.wikimedia.org/r/388074 [15:24:47] 10Operations, 10Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#3729831 (10kaythaney) Thanks, Dzahn. We're all set. [15:25:01] (03CR) 10Muehlenhoff: [C: 032] * New upstream release - CVE-2017-3735 - CVE-2017-3736 [debs/openssl11] - 10https://gerrit.wikimedia.org/r/388073 (owner: 10Muehlenhoff) [15:27:36] (03PS1) 10Alexandros Kosiaris: Create module for docker-pkg software [puppet] - 10https://gerrit.wikimedia.org/r/388075 [15:31:13] (03PS2) 10Alexandros Kosiaris: Create module for docker-pkg software [puppet] - 10https://gerrit.wikimedia.org/r/388075 [15:33:47] (03CR) 10Alexandros Kosiaris: [C: 031] "Boron seems happy at https://puppet-compiler.wmflabs.org/compiler02/8616/boron.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/388075 (owner: 10Alexandros Kosiaris) [15:36:36] (03PS1) 10Muehlenhoff: Update no-symbolic.patch to 1.1.0g [debs/openssl11] - 10https://gerrit.wikimedia.org/r/388076 [15:43:48] (03PS7) 10Pmiazga: Implement Schema:Print purging strategy [puppet] - 10https://gerrit.wikimedia.org/r/379829 (https://phabricator.wikimedia.org/T175395) (owner: 10Bmansurov) [15:47:51] (03PS2) 10Filippo Giunchedi: labstore: use require_package [puppet] - 10https://gerrit.wikimedia.org/r/388067 (https://phabricator.wikimedia.org/T86552) [15:48:52] (03CR) 10Filippo Giunchedi: [C: 032] labstore: use require_package [puppet] - 10https://gerrit.wikimedia.org/r/388067 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:51:33] RECOVERY - puppet last run on labstore2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:14] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:58:31] (03PS8) 10Elukey: Implement Schema:Print purging strategy [puppet] - 10https://gerrit.wikimedia.org/r/379829 (https://phabricator.wikimedia.org/T175395) (owner: 10Bmansurov) [15:58:45] (03CR) 10Mobrovac: [C: 032] Run the updateBetaFeatureUserCount on kafka queue only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388037 (https://phabricator.wikimedia.org/T175210) (owner: 10Ppchelko) [15:59:23] (03CR) 10Elukey: [C: 032] Implement Schema:Print purging strategy [puppet] - 10https://gerrit.wikimedia.org/r/379829 (https://phabricator.wikimedia.org/T175395) (owner: 10Bmansurov) [16:00:05] godog, moritzm, and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171102T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:32] (03Merged) 10jenkins-bot: Run the updateBetaFeatureUserCount on kafka queue only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388037 (https://phabricator.wikimedia.org/T175210) (owner: 10Ppchelko) [16:00:41] (03CR) 10jenkins-bot: Run the updateBetaFeatureUserCount on kafka queue only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388037 (https://phabricator.wikimedia.org/T175210) (owner: 10Ppchelko) [16:02:07] gooooood! [16:03:04] (03PS4) 10Gehel: logstash: update logstash_syslog common hiera parameter to point to LVS. [puppet] - 10https://gerrit.wikimedia.org/r/383146 (https://phabricator.wikimedia.org/T175242) [16:03:14] (03CR) 10Muehlenhoff: [C: 032] Update no-symbolic.patch to 1.1.0g [debs/openssl11] - 10https://gerrit.wikimedia.org/r/388076 (owner: 10Muehlenhoff) [16:04:16] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Use only EventBus for processing updateBetaFeatureUserCount - T175210 (duration: 00m 51s) [16:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:23] T175210: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210 [16:04:29] (03PS5) 10Gehel: logstash: update logstash_syslog common hiera parameter to point to LVS. [puppet] - 10https://gerrit.wikimedia.org/r/383146 (https://phabricator.wikimedia.org/T175242) [16:06:21] (03CR) 10Gehel: [C: 032] logstash: update logstash_syslog common hiera parameter to point to LVS. [puppet] - 10https://gerrit.wikimedia.org/r/383146 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [16:11:47] (03PS1) 10Ppchelko: [Logging] Enable logstash logging for EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388079 (https://phabricator.wikimedia.org/T150106) [16:12:27] (03CR) 10Ottomata: [C: 031] [Logging] Enable logstash logging for EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388079 (https://phabricator.wikimedia.org/T150106) (owner: 10Ppchelko) [16:13:36] (03PS1) 10Muehlenhoff: Update symbols for 1.1.0g [debs/openssl11] - 10https://gerrit.wikimedia.org/r/388080 [16:14:23] !log restart varnish-be on cp4026 due to mbox lag [16:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:27] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3711643 (10fgiunchedi) >>! In T179050#3726279, @MoritzMuehlenhoff wrote: >>>! In T179050#3726257, @BBlack wrote: >> +1 We may as well move to stretch here. For the bastion/installserver role it should be... [16:21:23] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [16:21:47] 10Operations, 10Traffic: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456#3730005 (10fgiunchedi) +1 for ms-fe, I'm assuming the rollout will happen with puppet disabled and the progressively re-enabled ? [16:25:33] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:27:08] (03CR) 10Muehlenhoff: [C: 032] Update symbols for 1.1.0g [debs/openssl11] - 10https://gerrit.wikimedia.org/r/388080 (owner: 10Muehlenhoff) [16:27:48] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3730013 (10Legoktm) >>! In T177891#3728971, @Addshore wrote: > But wouldn't a new edit not have any issue with caching? I had a lot of... [16:27:59] 10Operations, 10media-storage, 10User-fgiunchedi: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374#3730015 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I agree let's resolve this, I'll followu... [16:28:40] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3730021 (10RobH) So this now reads 'bast4001' in the subject, but it is bast4002, just making sure no one changed that intentionally? (I setup the task as bast4002, so checking.) [16:33:49] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3711643 (10Dzahn) +1 to 400**2** and stretch ! [16:35:14] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.31-release-notes (WMF-deploy-2017-09-26 (1.31.0-wmf.1)), 10User-fgiunchedi: Find and clear oversized x-content-dimensions headers - https://phabricator.wikimedia.org/T179595#3730033 (10fgiunchedi) [16:45:38] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3730085 (10Ladsgroup) [16:45:55] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625#3730087 (10RobH) I'm not exactly sure what you mean by unable to access ports, so I'll just list off the issue I'm seeing and what I've confirmed between the console servers. Compare the setup of scs-... [16:46:02] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625#3730088 (10RobH) a:05RobH>03Cmjohnson [16:55:33] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:58:40] (03PS2) 10Ottomata: Import mediawiki.job queue topics into HDFS via Camus [puppet] - 10https://gerrit.wikimedia.org/r/388074 [16:58:42] (03PS1) 10Ottomata: Don't refine mediawiki_recentchange events [puppet] - 10https://gerrit.wikimedia.org/r/388090 [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: How many deployers does it take to do Services – Graphoid / Parsoid / OCG / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171102T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:05] (03PS1) 10Ladsgroup: Enable draftquality model in ORES extension for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388092 (https://phabricator.wikimedia.org/T179596) [17:00:20] (03CR) 10Mobrovac: [C: 04-1] "LGTM, but cannot go before I4f41ad56aa9d6e625daa3709e7679055622a55b7 is deployed everywhere, so -1'ing until that happens." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388079 (https://phabricator.wikimedia.org/T150106) (owner: 10Ppchelko) [17:00:37] awight, what do you think? Are we doing ORES or still experimenting in beta? [17:00:43] Let’s roll it out! [17:00:48] \o/ [17:00:57] no parsoid deploy today [17:01:11] so: ORES going cray cray today. [17:01:18] (03CR) 10Ottomata: [C: 032] Import mediawiki.job queue topics into HDFS via Camus [puppet] - 10https://gerrit.wikimedia.org/r/388074 (owner: 10Ottomata) [17:01:20] (03CR) 10Ottomata: [C: 032] Don't refine mediawiki_recentchange events [puppet] - 10https://gerrit.wikimedia.org/r/388090 (owner: 10Ottomata) [17:01:42] * Amir1 wears SWAT shield [17:01:48] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Scap: Standardize git version - https://phabricator.wikimedia.org/T179353#3730136 (10demon) What about eventlog* and sca*? [17:07:10] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Scap: Standardize git version - https://phabricator.wikimedia.org/T179353#3730151 (10greg) Of course. (oops) [17:07:20] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4001/WMF7218 - https://phabricator.wikimedia.org/T179050#3730152 (10RobH) Ok, I'll reimage, I'm also doing the conversion to bastion profile. (Unless Brandon tells me otherwise, I'm also making all references on this new server be bast4002, since bast4001 will... [17:07:28] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050#3730153 (10RobH) [17:08:42] !log awight@tin Started deploy [ores/deploy@0e54a3c]: revscoring 2, T175180 [17:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:48] T175180: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180 [17:09:01] halfak: Amir1: heads-up! ^ [17:09:12] \o/ [17:10:43] halfak: Double-checking, what can I test after the canary stage? Just hit the API 20x and see if one response is a 5xx? [17:12:43] (03PS4) 10Giuseppe Lavagetto: profile::ci::shipyard: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/388031 (https://phabricator.wikimedia.org/T177276) [17:13:28] (03PS1) 10Bearloga: profile::discovery_dashboards: Remove forecasts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/388118 (https://phabricator.wikimedia.org/T112170) [17:15:45] No errors that I can see, carrying on... [17:16:16] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/8618/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/388031 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [17:21:37] 10Operations, 10Ops-Access-Requests: Add hoo to perf-roots - https://phabricator.wikimedia.org/T179317#3720398 (10RobH) @hoo: I've reviewed the L3 document, and I don't see your signature on it. This is likely due to the fact you've had server access since before phabricator. However, we like to have users s... [17:22:09] (03CR) 10Thcipriani: [C: 031] profile::ci::shipyard: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/388031 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [17:23:07] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::ci::shipyard: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/388031 (https://phabricator.wikimedia.org/T177276) (owner: 10Giuseppe Lavagetto) [17:24:16] (03CR) 10Chad: "How about we leave both of them in place for now so we can land this preemptively and not have to coordinate with the upgrade time?" [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) (owner: 10Paladox) [17:24:40] (03CR) 10Paladox: "> How about we leave both of them in place for now so we can land" [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) (owner: 10Paladox) [17:25:14] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3730226 (10EBernhardson) It was perhaps noted before, but because of the recursive nature of the refreshLinks and htmlCacheUpdate jobs even if the back... [17:25:54] Krinkle: o/ - do you have a min for a chat about refresh links? [17:26:55] (03PS8) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) [17:27:00] (03PS9) 10Paladox: Gerrit: Replace certificates with tokens for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) [17:30:37] (03PS1) 10Alexandros Kosiaris: kubernetes: Enable RBAC in production [puppet] - 10https://gerrit.wikimedia.org/r/388122 (https://phabricator.wikimedia.org/T177393) [17:31:24] (03PS1) 10Giuseppe Lavagetto: docker::builder: explicitly install make [puppet] - 10https://gerrit.wikimedia.org/r/388124 [17:32:10] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Scap: Standardize git version - https://phabricator.wikimedia.org/T179353#3721967 (10mobrovac) >>! In T179353#3730136, @demon wrote: > sca*? Those hosts have only Zotero on them, and they will never move off of trusty (because reasons).... [17:32:19] (03PS2) 10Giuseppe Lavagetto: docker::builder: explicitly install make [puppet] - 10https://gerrit.wikimedia.org/r/388124 [17:32:52] (03CR) 10Chad: [C: 031] Gerrit: Replace certificates with tokens for its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/384901 (https://phabricator.wikimedia.org/T178385) (owner: 10Paladox) [17:33:33] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::builder: explicitly install make [puppet] - 10https://gerrit.wikimedia.org/r/388124 (owner: 10Giuseppe Lavagetto) [17:33:57] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625#3730267 (10Cmjohnson) Tested a standard ethernet cable and it works fine. It appears that the custom pinout for the cable is no longer required and each of the cables will need to be re-done. [17:37:38] (03PS1) 10Jdlrobson: [labs] Unconditionally enable popups for anonymous users on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388127 (https://phabricator.wikimedia.org/T179546) [17:39:31] (03CR) 10Chad: "Good point. This is an *old* change from before the resource existed, so in rebasing/moving forward I didn't put it there. Will amend." [puppet] - 10https://gerrit.wikimedia.org/r/377304 (owner: 10Chad) [17:39:49] (03PS2) 10Jdlrobson: [labs] Unconditionally enable popups for anonymous users on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388127 (https://phabricator.wikimedia.org/T179546) [17:42:31] !log awight@tin Finished deploy [ores/deploy@0e54a3c]: revscoring 2, T175180 (duration: 33m 50s) [17:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:38] T175180: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180 [17:47:41] !log uploaded openssl 1.1.0g-1+wmf1 to jessie-wikimedia [17:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:01] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3730359 (10elukey) >>! In T173710#3730226, @EBernhardson wrote: > It was perhaps noted before, but because of the recursive nature of the refreshLinks... [17:50:54] PROBLEM - Host labvirt1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:23] (03PS1) 10Gehel: elasticsearch: auto reload log4j2 configuration [puppet] - 10https://gerrit.wikimedia.org/r/388130 [17:53:49] !log demon@tin Synchronized php-1.31.0-wmf.6/includes/libs/rdbms: logging improvements (duration: 00m 54s) [17:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:03] i guess labvirt1015 being down is planned ^^ [17:55:07] !log demon@tin Synchronized php-1.31.0-wmf.6/includes/Message.php: logging fixes (duration: 00m 48s) [17:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:03] RECOVERY - Host labvirt1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [17:59:08] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: connect second interface for each frack to opposite switch for each eqiad host - https://phabricator.wikimedia.org/T176975#3643120 (10Cmjohnson) the 2nd interfaces are connected, updated the switch descriptions, I did not enable the ports. [17:59:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3730397 (10Cmjohnson) The CPU was replaced and idrac log cleared. [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171102T1800). [18:00:04] Amir1: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:48] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:01:12] wut... [18:01:21] gehel: around? [18:01:23] <_joe_> uh what's up? [18:01:32] * apergos peeks in [18:01:37] yep [18:01:39] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 435 bytes in 0.045 second response time [18:01:40] looking [18:01:57] ok, I looked to hard and made it behave... [18:02:34] it looks like our new LVS check that actually checks blazegraph is working! [18:04:03] _joe_: I was trying to follow-up the scap deploy --init thing, but Ruby isn't my forte. Does this look even remotely sane? https://phabricator.wikimedia.org/P6249 [18:04:35] <_joe_> no_justification: oh thanks, I didn't expect you to, I had made a note to myself to work on it [18:04:40] <_joe_> I'll look in a few [18:04:53] okie dokie :) [18:08:26] (03PS1) 10Lucas Werkmeister (WMDE): Add rewrite rules for normalized Wikidata predicates [puppet] - 10https://gerrit.wikimedia.org/r/388134 [18:09:06] so, wdqs seems to be slowing down on the eqiad cluster, not sure why, I'm looking [18:09:25] gehel: give it more cookies, that should help [18:09:40] (03CR) 10Gehel: profile::discovery_dashboards: Remove forecasts dashboard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/388118 (https://phabricator.wikimedia.org/T112170) (owner: 10Bearloga) [18:13:36] !log restarting stuck updater on wdqs1003 [18:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:36] (03PS3) 10Krinkle: navtiming: Removed old secureConnectionStart check that throw away metrics [puppet] - 10https://gerrit.wikimedia.org/r/387997 (https://phabricator.wikimedia.org/T179555) (owner: 10Phedenskog) [18:14:41] (03PS1) 10Chad: group2 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388135 [18:14:43] (03CR) 10Chad: [C: 04-2] group2 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388135 (owner: 10Chad) [18:14:46] (03CR) 10Krinkle: [C: 031] navtiming: Removed old secureConnectionStart check that throw away metrics [puppet] - 10https://gerrit.wikimedia.org/r/387997 (https://phabricator.wikimedia.org/T179555) (owner: 10Phedenskog) [18:16:33] Eh, the channel topic is outdated, right? [18:16:38] no_justification: "stuck at wmf.4" [18:18:03] (03PS4) 10Giuseppe Lavagetto: navtiming: Removed old secureConnectionStart check that throw away metrics [puppet] - 10https://gerrit.wikimedia.org/r/387997 (https://phabricator.wikimedia.org/T179555) (owner: 10Phedenskog) [18:19:07] (03CR) 10Giuseppe Lavagetto: [C: 032] navtiming: Removed old secureConnectionStart check that throw away metrics [puppet] - 10https://gerrit.wikimedia.org/r/387997 (https://phabricator.wikimedia.org/T179555) (owner: 10Phedenskog) [18:19:17] (03CR) 10Chelsyx: [C: 031] profile::discovery_dashboards: Remove forecasts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/388118 (https://phabricator.wikimedia.org/T112170) (owner: 10Bearloga) [18:19:42] * elukey afk! [18:20:06] Krinkle: right! [18:20:43] !log restarting blazegraph on wdqs1003 [18:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:13] restarting blazegraph seems to help... now the question is why... [18:24:53] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:24:54] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:26:11] ^ that might be wdqs, it should reslove in a few seconds [18:30:00] <_joe_> gehel: maybe depooling could help [18:30:38] _joe_: it is back on track... restarting blazegraph on wdqs1003 fixed it [18:31:01] now, I have no idea why and what went wrong... [18:33:35] 5xx are down on wdqs, but the peak of 5xx https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1 does not match wdqs, there was something else as well... [18:34:53] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:34:54] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:36:40] gehel that graph can be confusing for a number of raisins [18:36:54] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5&from=1509639013611&to=1509647732303 [18:36:58] ^ ? [18:37:28] there's some minor bits of 500, then a spike of 502 (which is 5xx, but neither 500 nor 503, which part of the earlier graph looks at those two) [18:37:50] elukey: Yep [18:37:53] (refresh ilnks) [18:37:59] (03PS5) 10Ayounsi: [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) [18:38:01] bblack: Ok, that seems to match better with what I see on wdqs [18:38:21] so no immediate problem, but I need to dig into logs... [18:38:32] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) (owner: 10Ayounsi) [18:38:40] (03CR) 10Ayounsi: [V: 032 C: 032] Add fake keys for Netbox deployment [labs/private] - 10https://gerrit.wikimedia.org/r/387878 (owner: 10Ayounsi) [18:43:32] (03PS1) 10Mobrovac: JobQueue: Use EventBus for all "hearted" jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388139 (https://phabricator.wikimedia.org/T175210) [18:46:08] (03PS1) 10ArielGlenn: run xml/sql dumps on dumpsdata host [puppet] - 10https://gerrit.wikimedia.org/r/388142 (https://phabricator.wikimedia.org/T178893) [19:00:04] no_justification: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171102T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:06:43] (03CR) 10Chad: [C: 032] group2 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388135 (owner: 10Chad) [19:08:30] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3730584 (10Dispenser) @Keegan You've got less than a fortnight! [19:09:01] (03Merged) 10jenkins-bot: group2 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388135 (owner: 10Chad) [19:09:07] (03CR) 10Bearloga: profile::discovery_dashboards: Remove forecasts dashboard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/388118 (https://phabricator.wikimedia.org/T112170) (owner: 10Bearloga) [19:09:10] (03CR) 10jenkins-bot: group2 to wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388135 (owner: 10Chad) [19:20:19] (03PS2) 10Gehel: profile::discovery_dashboards: Remove forecasts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/388118 (https://phabricator.wikimedia.org/T112170) (owner: 10Bearloga) [19:21:09] (03CR) 10Gehel: [C: 032] profile::discovery_dashboards: Remove forecasts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/388118 (https://phabricator.wikimedia.org/T112170) (owner: 10Bearloga) [19:25:03] 10Operations, 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis (Current work), and 2 others: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#3730632 (10mpopov) [19:28:42] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.6 [19:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:02] (03PS2) 10ArielGlenn: run xml/sql dumps on dumpsdata host [puppet] - 10https://gerrit.wikimedia.org/r/388142 (https://phabricator.wikimedia.org/T178893) [19:30:12] (03CR) 10ArielGlenn: [C: 032] run xml/sql dumps on dumpsdata host [puppet] - 10https://gerrit.wikimedia.org/r/388142 (https://phabricator.wikimedia.org/T178893) (owner: 10ArielGlenn) [19:41:04] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3730671 (10Jgreen) Note ubuntu/trusty config is fairly different, here's a writeup that worked, I just changed bond-mode to active-backup: https://paulmellor... [20:11:29] 10Operations, 10Ops-Access-Requests: Requesting Sharvani Haran to be added to researchers group - https://phabricator.wikimedia.org/T179611#3730724 (10Sharvaniharan) [20:18:12] !log simulate load for labvirt1015 'sudo cumin "name:labvirt1015stresstest*" "stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G &"' [20:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3730770 (10chasemp) !log simulate load for labvirt1015 'sudo cumin "name:labvirt1015stresstest*" "stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G &"' [20:19:35] (03PS1) 10EBernhardson: Force constant MLR model on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388165 [20:19:41] 10Operations, 10Ops-Access-Requests: Requesting Sharvani Haran to be added to researchers group - https://phabricator.wikimedia.org/T179611#3730787 (10Fjalapeno) I’m case it is needed, this request has my approval [20:20:04] (03CR) 10EBernhardson: [C: 032] "repair search on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388165 (owner: 10EBernhardson) [20:20:20] (03PS2) 10EBernhardson: Force constant MLR model on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388165 [20:20:25] (03CR) 10EBernhardson: [C: 032] Force constant MLR model on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388165 (owner: 10EBernhardson) [20:20:36] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-EventLogging: Requesting Sharvani Haran to be added to researchers group - https://phabricator.wikimedia.org/T179611#3730792 (10Fjalapeno) [20:20:45] 20:20:10 Exception: ('command: ', "echo 'aawiki'; /usr/local/bin/mwscript update.php --wiki=aawiki --quick", 'output: ', "aawiki\n#!/usr/bin/env php\nPHP Parse error: syntax error, unexpected ''enwiki'' (T_CONSTANT_ENCAPSED_STRING), expecting ']' in /srv/mediawiki-staging/wmf-config/InitialiseSettings-labs.php on line 697\nParse error: syntax error, unexpected ''enwiki'' (T_CONSTANT_ENCAPSED_STRING), expecting ']' in /srv/ [20:20:45] mediawiki-staging/wmf-config/InitialiseSettings-labs.php on line 697\n") [20:27:31] (03Merged) 10jenkins-bot: Force constant MLR model on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388165 (owner: 10EBernhardson) [20:27:40] (03CR) 10jenkins-bot: Force constant MLR model on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/388165 (owner: 10EBernhardson) [20:30:23] PROBLEM - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:33:01] (03PS1) 10Ayounsi: Fix two Jenkins tests errors in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/388167 [20:34:30] (03CR) 10Hashar: [C: 031] Fix two Jenkins tests errors in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/388167 (owner: 10Ayounsi) [20:34:51] (03CR) 10Ayounsi: [C: 032] Fix two Jenkins tests errors in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/388167 (owner: 10Ayounsi) [20:36:36] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) (owner: 10Ayounsi) [20:37:34] RECOVERY - Check Varnish expiry mailbox lag on cp4023 is OK: OK: expiry mailbox lag is 0 [20:41:24] RECOVERY - Check Varnish expiry mailbox lag on cp4025 is OK: OK: expiry mailbox lag is 0 [20:52:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3730869 (10chasemp) I started our 20 test instances here and issued the same command to generate load and it died pretty much immediately. ```labcontrol1001:~# nova list --a... [20:54:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3730871 (10chasemp) {F10576661} Getting turned back on and then dying from load [20:55:23] RECOVERY - puppet last run on labtestvirt2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:56:27] !log Powercycling labvirt1015 [20:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:22] !log Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds [21:38:25] no_justification i think we broke gerrit's nagios check [21:38:26] with your heap ajustment (easy fix though) [21:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:28] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [21:40:04] sjoerddebruin: ^ [21:41:15] :D [21:41:18] (03Draft1) 10Paladox: Gerrit: Fix nagios check [puppet] - 10https://gerrit.wikimedia.org/r/388189 [21:41:20] (03PS2) 10Paladox: Gerrit: Fix nagios check [puppet] - 10https://gerrit.wikimedia.org/r/388189 [21:41:25] no_justification ^^ :) [21:43:23] Confirmed that fixes it for me :) [21:43:38] was undetected as gerrit needed to be restarted for it to use the new command. [21:50:30] 10Operations, 10Ops-Access-Requests: Add hoo to perf-roots - https://phabricator.wikimedia.org/T179317#3731047 (10hoo) >>! In T179317#3730221, @RobH wrote: > @hoo: I've reviewed the L3 document, and I don't see your signature on it. This is likely due to the fact you've had server access since before phabrica... [21:50:58] !log upgrading librenms [21:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:57] !log ayounsi@tin Started deploy [librenms/librenms@ad48b4d]: (no justification provided) [21:53:02] !log ayounsi@tin Finished deploy [librenms/librenms@ad48b4d]: (no justification provided) (duration: 00m 05s) [21:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:50] What's up with gerrit? [22:01:14] Reedy not sure what you mean? [22:01:25] It's on a serious go slow [22:01:55] Reedy it's slow? [22:02:00] That's what I just said [22:02:01] !log Mass-resizing Graphite/Whisper files on graphite1001 and graphite2001 for T179622 (frontend.* namespace) [22:02:06] it's fast for me [22:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:07] T179622: Update all performance team Graphite metrics for current retention rules - https://phabricator.wikimedia.org/T179622 [22:07:18] 10Operations, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Find and clear oversized x-content-dimensions headers - https://phabricator.wikimedia.org/T179595#3731106 (10Krinkle) [22:19:50] Hey folks. We've got a minor problem in ORES prod that I'm working on a patch for. It's a quick configuration change, so I don't expect anything crazy to happen [22:21:49] https://gerrit.wikimedia.org/r/#/c/388252/ [22:21:50] BTW [22:33:46] Looks good in beta. [22:34:57] greg-g, looking to do a quick deploy of ORES to get a small config bug. This is out of window. Anything you'd like me to do other than the usual documentation? [22:35:52] (03CR) 10Ayounsi: [C: 032] LibreNMS: update AS#, add eqsin [puppet] - 10https://gerrit.wikimedia.org/r/388186 (owner: 10Ayounsi) [22:38:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3466006 (10Volans) @chasemp FYI if you add the labs project to the cumin query is immediate (as compared to go over all projects) and OpenStack API already does a regex, so t... [22:39:57] !log halfak@tin Started deploy [ores/deploy@82a13ae]: T179621 [22:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:03] T179621: Issues with ORES model on plwiki - https://phabricator.wikimedia.org/T179621 [22:40:36] Just checked the criteria for scheduling this deploy and it looks like this minor config change falls out of it, so I'm being bold. [22:45:03] PROBLEM - ores on scb1002 is CRITICAL: HTTP CRITICAL - No data received from host [22:45:35] ^ canary in deploy [22:46:03] RECOVERY - ores on scb1002 is OK: HTTP OK: HTTP/1.0 200 OK - 3580 bytes in 0.019 second response time [22:46:20] Looks like everything is fine. Just caused a blip [22:48:15] !log cleaned daemon.log from puppet spam logs on puppetmaster2001 (15GB) - cc herron [22:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:34] BTW, scb1002 now handling traffic just fine. [22:51:38] (03CR) 10Paladox: [C: 031] "Works on labs at least." [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171102T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:17] 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 2 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3731256 (10Smalyshev) [23:10:56] !log disabled puppet master debug logging on puppetmaster2001 via /usr/share/puppet/rack/puppet-master/config.ru [23:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:06] thanks volans! [23:11:48] herron: thanks for having a look, also syslog is big although not that much, all "expected"? [23:12:11] the problem is that if it finish the space like it had already the other day puppet-merge will start failing [23:12:55] yeah just super verbose logging from puppet hitting syslog afaik [23:15:46] Looks like everything is good with the ORES deployment [23:16:10] declaring victory [23:20:05] (03CR) 10Chad: "This is so ridiculously fragile. Can we not do some sort of regex? Or maybe we should move all the parameters to an array and expand them " [puppet] - 10https://gerrit.wikimedia.org/r/388189 (owner: 10Paladox) [23:20:35] (03CR) 10Chad: "(not your fault it's fragile, it's always been like this)" [puppet] - 10https://gerrit.wikimedia.org/r/388189 (owner: 10Paladox) [23:20:37] (03CR) 10Paladox: "> This is so ridiculously fragile. Can we not do some sort of regex?" [puppet] - 10https://gerrit.wikimedia.org/r/388189 (owner: 10Paladox) [23:24:07] no_justification our prevous command for nagios was [23:24:08] nrpe_command => "/usr/lib/nagios/plugins/check_procs -w 1:1 -c 1:1 --ereg-argument-array '^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war'", [23:24:25] i wonder how we would use a regex there [23:25:55] ah what about "/usr/lib/nagios/plugins/check_procs -w 1:1 -c 1:1 --ereg-argument-array '^.*-jar .* /var/lib/gerrit2/review_site/bin/gerrit.war .*'", [23:25:57] ? [23:27:25] Basically I think something that checks for "is java running this war file" is all we need. [23:27:42] ok [23:27:55] idk though [23:28:01] I'd say the SSH and HTTP checks are sufficient [23:28:08] But on the slave they aren't I guess [23:29:51] (03PS1) 10Volans: Documentation: refactor documentation [software/cumin] - 10https://gerrit.wikimedia.org/r/388261 [23:33:26] no_justification: taking a quick patch in SWAT - given nobody added anything? [23:33:28] https://gerrit.wikimedia.org/r/#/c/387915/ [23:34:48] I certainly can't stop you 😂 [23:35:05] Just checking in case you're doing something else mw-related in prod. [23:35:08] :) [23:35:09] no_justification aha found a regex :) [23:35:54] (03PS3) 10Paladox: Gerrit: Fix nagios check [puppet] - 10https://gerrit.wikimedia.org/r/388189 [23:36:05] !log krinkle@tin Synchronized php-1.31.0-wmf.6/extensions/NavigationTiming/modules/ext.navigationTiming.js: Fix zero values - T178479 (duration: 00m 47s) [23:36:06] /usr/lib/nagios/plugins/check_procs -w 1:1 -c 1:1 --ereg-argument-array '^${java_home}/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site' [23:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:11] T178479: Investigate navtiming2 values - https://phabricator.wikimedia.org/T178479 [23:36:33] no_justifcation at least anything in front of -jar wont need updating now :). [23:36:38] Krinkle: no sir, I'm actually mobile [23:36:43] not sure if we want to do it where daemon is. [23:41:16] (03CR) 10Volans: "Documentation-only change" [software/cumin] - 10https://gerrit.wikimedia.org/r/388261 (owner: 10Volans) [23:53:09] 10Operations, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3731320 (10bd808) 05Open>03Resolved a:03Marostegui I announced the current precarious read-only state at https://lists.wikimedia.org/pipermail/cloud-announce/2017-November/000008.h... [23:53:26] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3731324 (10madhuvishy) fyi @Cmjohnson We are not doing the labsdb1003 reboot on Tuesday Nov 7, due to T179464. [23:53:41] 10Operations, 10cloud-services-team (Kanban): labsdb1001 crashed - storage issue - https://phabricator.wikimedia.org/T179464#3731326 (10bd808)