[01:04:15] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1503277447 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4945613 keys, up 4 minutes 4 seconds - replication_delay is 1503277447 [01:05:16] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4942168 keys, up 5 minutes 8 seconds - replication_delay is 0 [01:14:06] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:42:36] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [02:29:56] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.14) (duration: 08m 57s) [02:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:58] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Aug 21 02:36:58 UTC 2017 (duration 7m 3s) [02:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 662.43 seconds [03:45:31] 10Operations, 10Gerrit, 10ORES, 10Scap, and 2 others: Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3536646 (10awight) /me likes @demon's post. Awesome, let's stay in coordination about how we might be able to help with this effort. Seems t... [03:49:55] PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:58:26] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 246.39 seconds [04:18:16] RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:59:03] (03PS1) 10Muehlenhoff: Remove privileged LDAP access for siddharth11 [puppet] - 10https://gerrit.wikimedia.org/r/372814 [07:03:08] (03CR) 10Muehlenhoff: [C: 032] Remove privileged LDAP access for siddharth11 [puppet] - 10https://gerrit.wikimedia.org/r/372814 (owner: 10Muehlenhoff) [07:03:51] (03PS1) 10Marostegui: db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372815 [07:07:02] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372815 (owner: 10Marostegui) [07:08:26] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372815 (owner: 10Marostegui) [07:08:36] (03CR) 10jenkins-bot: db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372815 (owner: 10Marostegui) [07:09:49] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2047 (duration: 00m 45s) [07:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:18] !log Stop s4 replication thread on db1095 - T172996 [07:13:26] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [07:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:33] T172996: Migrate s4 from db1095 to db1102 - https://phabricator.wikimedia.org/T172996 [07:13:40] moritzm: ^ is that you? [07:22:19] damn, forgot to press y... [07:22:25] now merged [07:22:36] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [07:34:10] (03PS2) 10Marostegui: sanitarium3: Prepare db1102 to run s4 instance [puppet] - 10https://gerrit.wikimedia.org/r/372518 (https://phabricator.wikimedia.org/T172996) [07:34:56] (03PS3) 10Marostegui: sanitarium3: Prepare db1102 to run s4 instance [puppet] - 10https://gerrit.wikimedia.org/r/372518 (https://phabricator.wikimedia.org/T172996) [07:37:43] (03CR) 10Marostegui: [C: 032] sanitarium3: Prepare db1102 to run s4 instance [puppet] - 10https://gerrit.wikimedia.org/r/372518 (https://phabricator.wikimedia.org/T172996) (owner: 10Marostegui) [07:42:09] Regarding the recent email to Ops, should be worried? Can I help? [07:42:33] !log Drop cx_drafts table from x1 - T172364 [07:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:45] T172364: Remove cx_drafts table from production - https://phabricator.wikimedia.org/T172364 [08:05:44] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3536747 (10akosiaris) 05Open>03stalled >>! In T171167#3534026, @ayounsi wrote: > Those changes should land in the august release of LibreNMS (ht... [08:08:42] !log Rename tables article_assessment, article_assessment_pages, article_assessment_ratings tables from testwiki on db1078 - T173590 [08:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Technically looks fine to me, -1 just for a more descriptive commit message and we are good to go." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369915 (https://phabricator.wikimedia.org/T169246) (owner: 10Halfak) [08:22:39] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372816 (https://phabricator.wikimedia.org/T173570) [08:23:32] (03PS1) 10Marostegui: s3.hosts: Remove db1015 [software] - 10https://gerrit.wikimedia.org/r/372817 (https://phabricator.wikimedia.org/T173570) [08:25:55] (03PS3) 10Ladsgroup: ores: Add hieradata for number of celery workers [puppet] - 10https://gerrit.wikimedia.org/r/369915 (https://phabricator.wikimedia.org/T169246) (owner: 10Halfak) [08:26:14] (03PS1) 10Marostegui: mariadb: Remove db1015 [puppet] - 10https://gerrit.wikimedia.org/r/372818 (https://phabricator.wikimedia.org/T173570) [08:27:43] (03PS2) 10Marostegui: mariadb: Remove db1015 [puppet] - 10https://gerrit.wikimedia.org/r/372818 (https://phabricator.wikimedia.org/T173570) [08:29:17] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler02/7548/" [puppet] - 10https://gerrit.wikimedia.org/r/372818 (https://phabricator.wikimedia.org/T173570) (owner: 10Marostegui) [08:31:04] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: Failed disk on logstash1006 - https://phabricator.wikimedia.org/T173689#3536783 (10Gehel) [08:31:41] ACKNOWLEDGEMENT - Disk space on logstash1006 is CRITICAL: DISK CRITICAL - /var/lib/elasticsearch is not accessible: Input/output error Gehel failed disk - T173689 [08:31:41] ACKNOWLEDGEMENT - puppet last run on logstash1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/elasticsearch] Gehel failed disk - T173689 [08:36:22] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Failed disk on logstash1006 - https://phabricator.wikimedia.org/T173689#3536796 (10Gehel) p:05Triage>03High [08:36:51] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3536799 (10MoritzMuehlenhoff) >>! In T171166#3533534, @Anomie wrote: > If there's a way to run the Scribunto "--group LuaSandbox" phpunit tests with the new version, that'd b... [08:46:01] (03PS4) 10Ladsgroup: ores: Add hieradata for number of celery workers [puppet] - 10https://gerrit.wikimedia.org/r/369915 (https://phabricator.wikimedia.org/T169246) (owner: 10Halfak) [09:00:21] (03PS5) 10Alexandros Kosiaris: ores: Add hieradata for number of celery workers [puppet] - 10https://gerrit.wikimedia.org/r/369915 (https://phabricator.wikimedia.org/T169246) (owner: 10Halfak) [09:00:43] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores: Add hieradata for number of celery workers [puppet] - 10https://gerrit.wikimedia.org/r/369915 (https://phabricator.wikimedia.org/T169246) (owner: 10Halfak) [09:01:20] !log upgrading hhvm-luasandbox on mw1261 [09:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:59] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, 10User-Joe: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3536827 (10akosiaris) Changed above merged, I did a puppet test ran on both production (a noop as expected) and a stresstest node (120 CELERYD_... [09:05:34] (03CR) 10Filippo Giunchedi: [C: 031] Add Prometheus lua script for nginx-extras [puppet/nginx] - 10https://gerrit.wikimedia.org/r/372543 (https://phabricator.wikimedia.org/T151554) (owner: 10Gilles) [09:05:48] (03CR) 10Filippo Giunchedi: [C: 032] Add Prometheus lua script for nginx-extras [puppet/nginx] - 10https://gerrit.wikimedia.org/r/372543 (https://phabricator.wikimedia.org/T151554) (owner: 10Gilles) [09:06:08] (03Merged) 10jenkins-bot: Add Prometheus lua script for nginx-extras [puppet/nginx] - 10https://gerrit.wikimedia.org/r/372543 (https://phabricator.wikimedia.org/T151554) (owner: 10Gilles) [09:08:10] (03PS7) 10Jcrespo: [WIP]mariadb: First attempt at a mydumper-based dump script [puppet] - 10https://gerrit.wikimedia.org/r/371944 (https://phabricator.wikimedia.org/T169516) [09:08:12] (03PS1) 10Jcrespo: dbstore2001: Increase buffer pool of s7 [puppet] - 10https://gerrit.wikimedia.org/r/372819 (https://phabricator.wikimedia.org/T168409) [09:08:59] (03CR) 10Marostegui: [C: 031] dbstore2001: Increase buffer pool of s7 [puppet] - 10https://gerrit.wikimedia.org/r/372819 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [09:10:28] 10Operations: Upload nodejs 6.x to stretch-wikimedia - https://phabricator.wikimedia.org/T169763#3536834 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff nodejs is available in stretch-wikimedia for a few weeks now (since the last nodejs security release). [09:10:40] (03PS2) 10Jcrespo: dbstore2001: Increase buffer pool of s7 [puppet] - 10https://gerrit.wikimedia.org/r/372819 (https://phabricator.wikimedia.org/T168409) [09:11:31] (03CR) 10Jcrespo: [C: 032] dbstore2001: Increase buffer pool of s7 [puppet] - 10https://gerrit.wikimedia.org/r/372819 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [09:11:40] (03PS3) 10Jcrespo: dbstore2001: Increase buffer pool of s7 [puppet] - 10https://gerrit.wikimedia.org/r/372819 (https://phabricator.wikimedia.org/T168409) [09:12:58] (03PS5) 10Filippo Giunchedi: Expose Thumbor Nginx metrics in Prometheus format [puppet] - 10https://gerrit.wikimedia.org/r/372254 (https://phabricator.wikimedia.org/T151554) (owner: 10Gilles) [09:13:11] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372816 (https://phabricator.wikimedia.org/T173570) (owner: 10Marostegui) [09:14:44] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372816 (https://phabricator.wikimedia.org/T173570) (owner: 10Marostegui) [09:15:52] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1015 - T173570 (duration: 00m 44s) [09:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:04] T173570: Decommission db1015 - https://phabricator.wikimedia.org/T173570 [09:16:07] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372816 (https://phabricator.wikimedia.org/T173570) (owner: 10Marostegui) [09:17:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1015 - T173570 (duration: 00m 44s) [09:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:08] (03PS6) 10Filippo Giunchedi: Expose Thumbor Nginx metrics in Prometheus format [puppet] - 10https://gerrit.wikimedia.org/r/372254 (https://phabricator.wikimedia.org/T151554) (owner: 10Gilles) [09:19:44] (03CR) 10Filippo Giunchedi: [C: 032] Expose Thumbor Nginx metrics in Prometheus format [puppet] - 10https://gerrit.wikimedia.org/r/372254 (https://phabricator.wikimedia.org/T151554) (owner: 10Gilles) [09:28:56] (03PS4) 10Jcrespo: dbstore2001: Increase buffer pool of s7 [puppet] - 10https://gerrit.wikimedia.org/r/372819 (https://phabricator.wikimedia.org/T168409) [09:29:22] !log upgrading hhvm-luasandbox on mw1262-1265 [09:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:15] !log restart dbstore2001 mariadb@x1 [09:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:00] (03PS1) 10Filippo Giunchedi: hieradata: put thumbor hosts in cluster thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372821 (https://phabricator.wikimedia.org/T151554) [09:50:53] !log restart dbstore2001 mariadb@s1 [09:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:24] (03PS2) 10Filippo Giunchedi: hieradata: put thumbor hosts in cluster thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372821 (https://phabricator.wikimedia.org/T151554) [10:00:01] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3536898 (10fgiunchedi) a:05fgiunchedi>03None Unassigned from me since the deployment part is pending [10:02:50] (03CR) 10Filippo Giunchedi: "PCC says yes https://puppet-compiler.wmflabs.org/compiler02/7553/" [puppet] - 10https://gerrit.wikimedia.org/r/372821 (https://phabricator.wikimedia.org/T151554) (owner: 10Filippo Giunchedi) [10:02:54] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: put thumbor hosts in cluster thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372821 (https://phabricator.wikimedia.org/T151554) (owner: 10Filippo Giunchedi) [10:07:18] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3536925 (10MoritzMuehlenhoff) The mwdebug* servers, the canary application servers (mw1261-mw1265) and the canary API servers (mw1276-mw1279) have been upgraded to 2.0.13. Lo... [10:13:05] (03PS1) 10Filippo Giunchedi: prometheus: poll nginx metrics from thumbor hosts [puppet] - 10https://gerrit.wikimedia.org/r/372823 (https://phabricator.wikimedia.org/T151554) [10:15:06] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [10:19:14] godog: Error: Could not find any hostgroup matching 'thumbor_codfw' (config file '/etc/icinga/puppet_hosts.cfg', starting on line 36146) [10:19:17] ^^^ [10:20:46] ugh, mhh I'll try forcing puppet on thumbor and the install machines [10:20:53] I'm assuming missing exported resources [10:21:20] yeah, could be the usual race condition [10:23:36] (03PS1) 10Urbanecm: Update WMF's address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372824 (https://phabricator.wikimedia.org/T173684) [10:23:48] (03PS2) 10Filippo Giunchedi: prometheus: poll nginx metrics from thumbor hosts [puppet] - 10https://gerrit.wikimedia.org/r/372823 (https://phabricator.wikimedia.org/T151554) [10:26:22] job queue size has been growing for a while, is it normal? https://grafana.wikimedia.org/dashboard/db/job-queue-health [10:30:52] (03CR) 10Filippo Giunchedi: "PCC says yes https://puppet-compiler.wmflabs.org/compiler02/7555/" [puppet] - 10https://gerrit.wikimedia.org/r/372823 (https://phabricator.wikimedia.org/T151554) (owner: 10Filippo Giunchedi) [10:30:56] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: poll nginx metrics from thumbor hosts [puppet] - 10https://gerrit.wikimedia.org/r/372823 (https://phabricator.wikimedia.org/T151554) (owner: 10Filippo Giunchedi) [10:34:19] jouncebot, next [10:34:19] In 2 hour(s) and 25 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170821T1300) [10:39:42] 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Backfill librenms data in graphite with historical RRDs - https://phabricator.wikimedia.org/T173698#3537022 (10fgiunchedi) [10:48:22] (03PS1) 10Filippo Giunchedi: hieradata: add thumbor hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/372825 (https://phabricator.wikimedia.org/T151554) [10:48:39] volans: FWIW it wasn't a race, ^ was missing [10:50:01] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add thumbor hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/372825 (https://phabricator.wikimedia.org/T151554) (owner: 10Filippo Giunchedi) [10:55:07] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [10:57:55] (03PS1) 10Filippo Giunchedi: role: explicit cluster label for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372826 (https://phabricator.wikimedia.org/T151554) [10:59:53] (03CR) 10Filippo Giunchedi: [C: 032] role: explicit cluster label for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/372826 (https://phabricator.wikimedia.org/T151554) (owner: 10Filippo Giunchedi) [11:03:56] godog: ack, right [11:04:00] thanks for fixing [11:10:17] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [11:34:27] (03PS18) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 [11:38:56] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:39:07] PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] [11:39:26] PROBLEM - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:39:36] PROBLEM - cassandra-b SSL 10.192.16.163:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:39:36] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:41:22] (03PS10) 10MarcoAurelio: Initial configuration for hi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) [11:41:54] (03PS3) 10MarcoAurelio: Set project logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371135 (https://phabricator.wikimedia.org/T173042) [11:42:28] (03PS2) 10MarcoAurelio: Administrators to add/remove 'transwiki' at nowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372045 (https://phabricator.wikimedia.org/T172365) [12:17:27] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [12:22:09] Amir1: not normal no (queue job size) mind opening a task? [12:22:30] my first suspect would be sth related to T171371 [12:22:31] T171371: Investigate 30x increase in Jobrunner errors - https://phabricator.wikimedia.org/T171371 [12:23:14] okay [12:26:03] (03PS3) 10Marostegui: mariadb: Remove db1015 [puppet] - 10https://gerrit.wikimedia.org/r/372818 (https://phabricator.wikimedia.org/T173570) [12:27:46] (03CR) 10Marostegui: [C: 032] mariadb: Remove db1015 [puppet] - 10https://gerrit.wikimedia.org/r/372818 (https://phabricator.wikimedia.org/T173570) (owner: 10Marostegui) [12:27:51] 10Operations, 10MediaWiki-JobQueue: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3537269 (10Ladsgroup) [12:27:58] godog: https://phabricator.wikimedia.org/T173710 [12:33:11] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review, 10User-fgiunchedi: Track incoming HTTP request count on the Thumbor boxes - https://phabricator.wikimedia.org/T151554#3537301 (10fgiunchedi) Patches are merged and stats are being polled by prometheus in codfw and eqiad, I've added basic re... [12:33:40] Amir1: nice, thanks! [12:34:23] (03CR) 10Marostegui: [C: 032] s3.hosts: Remove db1015 [software] - 10https://gerrit.wikimedia.org/r/372817 (https://phabricator.wikimedia.org/T173570) (owner: 10Marostegui) [12:34:35] 10Operations, 10MediaWiki-JobQueue: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3537302 (10fgiunchedi) cc @aaron and @Krinkle in case this behaviour rings a bell with the work that was done in {T171371} around the same time the increase started [12:35:09] (03Merged) 10jenkins-bot: s3.hosts: Remove db1015 [software] - 10https://gerrit.wikimedia.org/r/372817 (https://phabricator.wikimedia.org/T173570) (owner: 10Marostegui) [12:35:48] !log Stop MySQL on db1015 to decommission it - T173570 [12:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:03] T173570: Decommission db1015 - https://phabricator.wikimedia.org/T173570 [12:38:46] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3537309 (10Marostegui) [12:39:21] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3533347 (10Marostegui) a:03Cmjohnson This host is now ready for the remaining steps from @Cmjohnson [12:40:32] (03CR) 10Filippo Giunchedi: prometheus: new instance 'services' (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) (owner: 10Filippo Giunchedi) [12:40:37] (03PS8) 10Filippo Giunchedi: prometheus: new instance 'services' [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) [12:43:32] (03PS1) 10Muehlenhoff: Update docs [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/372829 [12:44:36] (03CR) 10Muehlenhoff: [C: 032] Update docs [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/372829 (owner: 10Muehlenhoff) [12:47:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372830 (https://phabricator.wikimedia.org/T163190) [12:47:43] (03PS9) 10Filippo Giunchedi: prometheus: new instance 'services' [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) [12:49:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372830 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [12:51:07] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372830 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [12:51:26] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372830 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [12:52:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1079 - T163190 (duration: 00m 45s) [12:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:27] T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190 [12:52:46] !log Stop replication on db1079 and db1041 to compare their data - T163190 [12:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:02] jouncebot: next [12:58:02] In 0 hour(s) and 1 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170821T1300) [12:59:25] CI not busy, so far so good [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170821T1300). [13:00:05] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:13] I'm here [13:00:15] I can SWAT today! [13:00:18] That's great! [13:00:53] I'll start with patches, as listed, anything special about any of them, or all standard? [13:01:13] all of them can be tested at mwdebug? [13:01:57] Urbanecm: ^ [13:02:11] All of them are testable [13:02:32] There's nothing special about them, no scrips should be required [13:02:41] zeljkof, ^ [13:02:54] great, thanks [13:02:56] ye [13:02:58] yw [13:03:49] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372212 (https://phabricator.wikimedia.org/T173471) (owner: 10Urbanecm) [13:04:04] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: new instance 'services' [puppet] - 10https://gerrit.wikimedia.org/r/372357 (https://phabricator.wikimedia.org/T173490) (owner: 10Filippo Giunchedi) [13:05:14] (03Merged) 10jenkins-bot: Reopen bawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372212 (https://phabricator.wikimedia.org/T173471) (owner: 10Urbanecm) [13:06:12] (03CR) 10jenkins-bot: Reopen bawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372212 (https://phabricator.wikimedia.org/T173471) (owner: 10Urbanecm) [13:07:36] Urbanecm: 372212 at mwdebug1002, let me know if I can push forward [13:07:49] ack [13:09:44] zeljkof, please deploy [13:09:51] Urbanecm: ok [13:10:45] !log zfilipin@tin Synchronized dblists/closed.dblist: SWAT: [[gerrit:372212|Reopen bawikibooks (T173471)]] (duration: 00m 44s) [13:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:57] T173471: Re-open Wikibooks Bashkir - https://phabricator.wikimedia.org/T173471 [13:11:38] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:372212|Reopen bawikibooks (T173471)]] (duration: 00m 44s) [13:11:47] Urbanecm: deployed, please check [13:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:29] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372777 (https://phabricator.wikimedia.org/T173471) (owner: 10Odder) [13:14:04] Hmm. There seems to be an error. I can create new account but not using centralauth. [13:14:29] But it seems we can solve it after SWAT so revert isn't needed [13:14:40] Urbanecm: ok, that's what I wanted to ask [13:14:50] maybe something needs to be configured additionally [13:15:06] Yep, I'll look after it after SWAT. [13:15:34] Urbanecm: argh, 372777 has merge conflict [13:15:39] (03PS3) 10Zfilipin: Add new logo for the Baskhir Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372777 (https://phabricator.wikimedia.org/T173471) (owner: 10Odder) [13:15:50] ok, looks like rebase fixed it [13:15:57] Great [13:17:34] (03CR) 10jenkins-bot: Add new logo for the Baskhir Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372777 (https://phabricator.wikimedia.org/T173471) (owner: 10Odder) [13:18:26] zeljkof, BTW, does logstash show something about the account creation issue? [13:18:42] Urbanecm: did not look [13:18:59] where would I see that? [13:19:36] Urbanecm: 372777 at mwdebug1002 [13:20:23] zeljkof, logstash is at logstash.wikimedia.org. It is restricted but deployers should have access. I don't know where it could be in it. [13:20:24] ack [13:20:52] 372777 is working, please deploy [13:22:16] zeljkof, seems I was finally able to log in using my normal user, so everything is working correctly. [13:25:39] Urbanecm: I can take a look after swat [13:25:42] deploying [13:25:50] zeljkof, not needed, it is working already :) [13:26:10] Urbanecm: even better [13:26:11] :) [13:26:23] !log adding index on (database, rev_timestamp) on mediawiki_page_create_2 table on dbstore1002: T170990 [13:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:39] T170990: Add index to mediawiki_page_create_1 table - https://phabricator.wikimedia.org/T170990 [13:26:42] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:372777|Add new logo for the Baskhir Wikibooks (T173471)]] (duration: 00m 44s) [13:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:53] T173471: Re-open Wikibooks Bashkir - https://phabricator.wikimedia.org/T173471 [13:27:37] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:372777|Add new logo for the Baskhir Wikibooks (T173471)]] (duration: 00m 44s) [13:27:38] !log stop dbstore2001 mariadb@s7 [13:27:46] Urbanecm: 372777 deployed, please check [13:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:02] working, thank you [13:29:08] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372795 (https://phabricator.wikimedia.org/T172245) (owner: 10Urbanecm) [13:31:17] gerrit is super annoying [13:31:21] (03CR) 10Ottomata: [C: 031] "One comment about requiring python-kafka, otherwise +1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [13:31:45] second patch in a row it "cannot merge" and all I need to do is to click rebase and then it works :| [13:31:50] (03PS2) 10Zfilipin: Update logos for srwiktionary, add HD logos for srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372795 (https://phabricator.wikimedia.org/T172245) (owner: 10Urbanecm) [13:32:32] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3537706 (10faidon) >>! In T172681#3526998, @elukey wrote: > 1) Make sure to install/deploy pmacct 1.6.2 (follow up wi Deb... [13:33:41] (03CR) 10jenkins-bot: Update logos for srwiktionary, add HD logos for srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372795 (https://phabricator.wikimedia.org/T172245) (owner: 10Urbanecm) [13:34:48] Urbanecm: 372795 at mwdebug [13:35:02] ack [13:35:46] working, deploy [13:36:06] ok [13:36:57] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:372795|Update logos for srwiktionary, add HD logos for srwiktionary (T172245)]] (duration: 00m 45s) [13:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:10] T172245: Logo for sr.wiktionary.org - https://phabricator.wikimedia.org/T172245 [13:37:47] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:372795|Update logos for srwiktionary, add HD logos for srwiktionary (T172245)]] (duration: 00m 44s) [13:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:04] Urbanecm: deployed, please check [13:38:25] working [13:38:27] thank you [13:39:17] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372789 (https://phabricator.wikimedia.org/T173631) (owner: 10Gerrit Patch Uploader) [13:39:29] (03PS2) 10Zfilipin: Set X-Frame-Options: SAMEORIGIN if UploadWizard enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372789 (https://phabricator.wikimedia.org/T173631) (owner: 10Gerrit Patch Uploader) [13:42:00] (03PS2) 10Zfilipin: Update logos for srwikinews, add HD version for them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372797 (https://phabricator.wikimedia.org/T172255) (owner: 10Urbanecm) [13:42:02] (03CR) 10jenkins-bot: Set X-Frame-Options: SAMEORIGIN if UploadWizard enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372789 (https://phabricator.wikimedia.org/T173631) (owner: 10Gerrit Patch Uploader) [13:43:01] (03Abandoned) 10Urbanecm: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368581 (owner: 10Urbanecm) [13:43:03] Urbanecm: 372789 is at mwdebug, this one looks harder to test [13:45:02] It is possible to test relatively easily, but I must be autoconfirmed. So please push to prod w/o test, I'll ask for it at the task. [13:45:18] Urbanecm: ok [13:46:18] !log adding index on (database, rev_timestamp) on mediawiki_page_create_2 table on db1047: T170990 [13:46:28] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:372789|Set X-Frame-Options: SAMEORIGIN if UploadWizard enabled (T173631)]] (duration: 00m 44s) [13:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:29] T170990: Add index to mediawiki_page_create_1 table - https://phabricator.wikimedia.org/T170990 [13:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:41] T173631: File Upload Wizard doesn't work well with X-Frame-Options set to be DENY on zhwiki - https://phabricator.wikimedia.org/T173631 [13:46:49] Urbanecm: deployed, but you can not test, right? [13:47:09] Yes. I'll ask for review at the task, they definitely will be able to test :D [13:47:59] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372797 (https://phabricator.wikimedia.org/T172255) (owner: 10Urbanecm) [13:48:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add /data/ Redirect for commons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/360887 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [13:49:25] (03Merged) 10jenkins-bot: Update logos for srwikinews, add HD version for them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372797 (https://phabricator.wikimedia.org/T172255) (owner: 10Urbanecm) [13:50:30] Urbanecm: 372797 at mwdebug [13:50:33] ack [13:51:09] working, deploy please [13:51:51] (03PS3) 10Urbanecm: Enable SandboxLink on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372531 (https://phabricator.wikimedia.org/T173054) [13:52:00] (03PS2) 10Urbanecm: Add HD logos for srwikisource, update them too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372796 (https://phabricator.wikimedia.org/T172268) [13:52:04] (03PS3) 10Urbanecm: Initial configuration for hifwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372798 (https://phabricator.wikimedia.org/T173643) [13:52:06] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:372797|Update logos for srwikinews, add HD version for them (T172255)]] (duration: 00m 44s) [13:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:18] T172255: Logo for sr.wikinews.org - https://phabricator.wikimedia.org/T172255 [13:52:44] !log stop dbstore2001 mariadb@s4, start mariadb@s7 [13:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:05] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:372797|Update logos for srwikinews, add HD version for them (T172255)]] (duration: 00m 49s) [13:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:36] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3537789 (10akosiaris) I think we have been unblocked btw ``` $ curl -i --resolve en.wikipedia.org:443:208.... [13:53:45] Urbanecm: 372797 deployed, please check [13:53:49] ack [13:54:24] working [13:57:05] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372796 (https://phabricator.wikimedia.org/T172268) (owner: 10Urbanecm) [13:58:25] (03CR) 10Urbanecm: [C: 04-1] Set project logo for wikimania2018wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371135 (https://phabricator.wikimedia.org/T173042) (owner: 10MarcoAurelio) [13:58:28] (03Merged) 10jenkins-bot: Add HD logos for srwikisource, update them too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372796 (https://phabricator.wikimedia.org/T172268) (owner: 10Urbanecm) [14:00:18] Urbanecm: 372796 at mwdebug [14:00:22] ack [14:01:02] Working, please deploy! [14:01:39] (03CR) 10Urbanecm: [C: 031] Initial configuration for hi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) (owner: 10MarcoAurelio) [14:02:08] Urbanecm: ok [14:02:56] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:372796|Add HD logos for srwikisource, update them too (T172268)]] (duration: 00m 44s) [14:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:12] T172268: Logo for sr.wikisource.org - https://phabricator.wikimedia.org/T172268 [14:03:15] (03CR) 10jenkins-bot: Update logos for srwikinews, add HD version for them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372797 (https://phabricator.wikimedia.org/T172255) (owner: 10Urbanecm) [14:03:17] (03CR) 10jenkins-bot: Add HD logos for srwikisource, update them too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372796 (https://phabricator.wikimedia.org/T172268) (owner: 10Urbanecm) [14:04:01] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:372796|Add HD logos for srwikisource, update them too (T172268)]] (duration: 00m 44s) [14:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:12] Urbanecm: deployed, please check [14:04:16] ack [14:04:38] Working, thank you for your deploys! [14:04:51] !log EU SWAT finished [14:05:01] Urbanecm: thanks for releasing with #releng ;) [14:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:30] 10Operations, 10media-storage, 10User-fgiunchedi: Track down the source of periodic increases in requests to swift eqiad - https://phabricator.wikimedia.org/T173721#3537809 (10fgiunchedi) [14:07:21] 10Operations, 10Page-Previews, 10Traffic, 10Readers-Web-Backlog (Tracking): Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3537851 (10fgiunchedi) FYI the periodic increase in swift requests is now tracked separately at... [14:26:38] (03PS1) 10Jdlrobson: Hygiene: Remove dead config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372838 [14:30:19] (03CR) 10Jforrester: [C: 04-2] "Shouldn't be merged until after 2017-09-23." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372824 (https://phabricator.wikimedia.org/T173684) (owner: 10Urbanecm) [14:32:59] (03CR) 10Urbanecm: [C: 031] Hygiene: Remove dead config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372838 (owner: 10Jdlrobson) [14:42:26] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: allow wdqs-admins to pool / depool wdqs servers - https://phabricator.wikimedia.org/T172798#3537913 (10Gehel) [14:47:45] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372840 [14:47:50] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372840 [14:48:09] (03PS1) 10Muehlenhoff: Add new config options to debdeploy.conf to support pre-defined library name mapping for the restart check [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/372841 [14:49:18] 10Operations, 10Page-Previews, 10Traffic, 10Readers-Web-Backlog (Tracking): Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3527818 (10BBlack) You can see a view of cache_upload's over all 2xx (and everything else) here:... [14:49:35] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372840 (owner: 10Marostegui) [14:50:23] (03CR) 10Muehlenhoff: [C: 032] Add new config options to debdeploy.conf to support pre-defined library name mapping for the restart check [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/372841 (owner: 10Muehlenhoff) [14:50:32] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:51:07] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372840 (owner: 10Marostegui) [14:52:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1079 - T163190 (duration: 00m 44s) [14:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:46] T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190 [14:53:46] !log mobrovac@tin Started deploy [cxserver/deploy@1065ffe]: Deploy 1065ffe2 to canary scb2001 for debugging - T173038 [14:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:00] T173038: /v1/translate/{from}/{to}{/provider} endpoint fails while deploying cxserver - https://phabricator.wikimedia.org/T173038 [14:54:04] !log mobrovac@tin Finished deploy [cxserver/deploy@1065ffe]: Deploy 1065ffe2 to canary scb2001 for debugging - T173038 (duration: 00m 18s) [14:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:05] !log cxserver depool scb2001 to debug failed checks - T173038 [14:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:17] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3537968 (10Papaul) Extended test came out with not HW problem. I will be contacting Dell once again to follow up on the case. {F9142196} [14:57:02] PROBLEM - cxserver endpoints health on scb2001 is CRITICAL: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) is CRITICAL: Could not fetch url http://10.192.32.132:8080/v1/translate/en/es/Apertium: Generic connection error: HTTPConnectionPool(host=u10.192.32.132, port=8080): Max retries exceeded with url: /v1/translate/en/es/Apertium (Caused by Pro [14:57:02] on aborted., BadStatusLine(,))) [14:58:09] (03CR) 10Ottomata: "One comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [14:59:43] known ^ [15:01:59] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372840 (owner: 10Marostegui) [15:02:52] 10Operations, 10monitoring, 10User-fgiunchedi: Update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635#3537974 (10faidon) 05stalled>03Resolved Fixed for our purposes, we can follow-up on upstream's/Debian's bug reports for the long-term fixes. [15:07:42] 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Backfill librenms data in graphite with historical RRDs - https://phabricator.wikimedia.org/T173698#3538001 (10faidon) p:05Normal>03Low [15:07:47] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3538002 (10Papaul) @madhuvishy here is my proposal: labstore2001 2 shelves labstore-array [0-1] labstore2002 3 shelves la... [15:13:39] 10Operations, 10monitoring: Monitor internal CA expirations - https://phabricator.wikimedia.org/T171157#3538015 (10faidon) 05Open>03stalled Setting to stalled until we decide what to actually do with the internal CA, as we're considering dropping it entirely in favour of other options. [15:14:10] 10Operations, 10monitoring: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3538017 (10faidon) a:05MoritzMuehlenhoff>03Dzahn [15:16:42] PROBLEM - HHVM jobrunner on mw1165 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:18:42] RECOVERY - HHVM jobrunner on mw1165 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.001 second response time [15:18:43] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:20:07] 10Operations, 10Operations-Software-Development, 10monitoring: monitor SSD wear levels - https://phabricator.wikimedia.org/T86556#3538038 (10faidon) a:05Volans>03fgiunchedi [15:21:22] 10Operations, 10Operations-Software-Development, 10monitoring, 10User-fgiunchedi: monitor SSD wear levels - https://phabricator.wikimedia.org/T86556#3538042 (10fgiunchedi) [15:36:22] 10Operations, 10media-storage: Reduce swift frontend conntrack usage - https://phabricator.wikimedia.org/T173731#3538120 (10fgiunchedi) [15:36:28] herron: ^ [15:37:19] (03PS4) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 [15:37:58] (03CR) 10jerkins-bot: [V: 04-1] logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [15:39:02] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2030051 [15:39:20] godog thx! [15:39:31] (03PS1) 10Filippo Giunchedi: role: collect from restbase test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/372845 (https://phabricator.wikimedia.org/T173490) [15:40:58] !log cp1099 - varnish backend restart for mailbox lag [15:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:10] 10Operations, 10Mail, 10OTRS: Automatically merge bounces/DSNs in ticket - https://phabricator.wikimedia.org/T173733#3538151 (10akosiaris) [15:44:39] (03CR) 10Volans: "Logic looks good, seems that there is some leftover in the tests part and I've added few minor comments inline." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [15:49:02] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [15:52:13] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Get translations for "IE8 on XP won't work" - https://phabricator.wikimedia.org/T172418#3538180 (10BBlack) I see we have a few new translations up today, I'll incorporate them shortly! :) [15:57:31] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, 10User-Joe: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3538225 (10Halfak) Ran another test. Looks like we can handle 2000 requests per minute without much trouble. But we barf at 3000 requests per... [15:58:48] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, 10User-Joe: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3538255 (10Halfak) So my thought is that we can certainly bump up the number or workers. I also think we should increase the max size for the... [16:00:34] 10Operations, 10media-storage: Reduce swift frontend conntrack usage - https://phabricator.wikimedia.org/T173731#3538263 (10fgiunchedi) Note that statsd and swift account for the majority of entries in conntrack. statsd needs to be explicitly excluded. For swift traffic the "direction" in the ferm rules nee... [16:06:44] 10Operations, 10Mail, 10OTRS: Automatically merge bounces/DSNs in ticket - https://phabricator.wikimedia.org/T173733#3538311 (10akosiaris) p:05Triage>03Normal [16:10:21] (03PS1) 10Alexandros Kosiaris: mail::mx: Ship bounce/warn message files [puppet] - 10https://gerrit.wikimedia.org/r/372848 (https://phabricator.wikimedia.org/T173733) [16:15:42] (03PS2) 10Jforrester: Enable OOjs UI EditPage on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366868 [16:15:48] (03CR) 10Jforrester: [C: 031] "Now good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366868 (owner: 10Jforrester) [16:34:07] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3538495 (10Johan) >>! In T163251#3531895, @Johan wrote: >... [16:35:43] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2103975 [16:36:05] (03CR) 10Ottomata: [C: 031] webperf: Add unit tests for schema handlers and stat dispatching [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [16:36:13] (03CR) 10Ottomata: [C: 031] "Shall I merge?" [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [16:40:46] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3538536 (10BBlack) Heh yeah I guess you're right. Still,... [16:43:35] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3538561 (10Halfak) a:05Halfak>03Ladsgroup [16:44:54] 10Operations, 10media-storage, 10User-fgiunchedi: Reduce swift frontend conntrack usage - https://phabricator.wikimedia.org/T173731#3538574 (10fgiunchedi) [16:46:30] 10Operations, 10ORES, 10Scoring-platform-team-Backlog: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3538590 (10Halfak) [16:49:37] (03PS1) 10Alexandros Kosiaris: Introduce ganeti100[789].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/372857 (https://phabricator.wikimedia.org/T173565) [16:53:36] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3538657 (10madhuvishy) @Papaul Yeah that seems fine to me. Thanks! [16:57:33] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Get translations for "IE8 on XP won't work" - https://phabricator.wikimedia.org/T172418#3538680 (10Johan) 05Open>03Resolved There's now also a Russian translation, which means we've got the very most basic ones (major languages... [16:57:35] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3538682 (10Johan) [17:00:04] gehel: Respected human, time to deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170821T1700). Please do the needful. [17:01:22] PROBLEM - puppet last run on mx1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:25] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3538821 (10GWicke) @ovasileva @phuedx, could you update this task with your current estimate for OCG's sunsetting? [17:05:32] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: allow wdqs-admins to pool / depool wdqs servers - https://phabricator.wikimedia.org/T172798#3509813 (10RobH) This access request was reviewed and approved in today's (2017-08-21) operations team meeting. Giving these admins the r... [17:07:12] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team (Watching / External), 10User-Addshore: Make @daniel a MediaWiki deployer - https://phabricator.wikimedia.org/T173230#3538839 (10RobH) The request to add Daniel to mw deployers was approved in today's Operations team meeting. [17:09:15] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10User-Addshore: Requesting access to contint-admins for addshore - https://phabricator.wikimedia.org/T173233#3538863 (10RobH) The request to give addshore contint-admins was approved in today's operations t... [17:10:22] !log gehel@tin Started deploy [wdqs/wdqs@a1c4f1f]: (no justification provided) [17:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:09] (03CR) 10MarcoAurelio: Set project logo for wikimania2018wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371135 (https://phabricator.wikimedia.org/T173042) (owner: 10MarcoAurelio) [17:12:14] (03PS4) 10MarcoAurelio: Set project logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371135 (https://phabricator.wikimedia.org/T173042) [17:12:50] !log gehel@tin Finished deploy [wdqs/wdqs@a1c4f1f]: (no justification provided) (duration: 02m 27s) [17:12:59] 10Operations, 10Ops-Access-Requests: Requesting @ops in #wikimedia-tech for Luke081515 - https://phabricator.wikimedia.org/T172793#3538884 (10RobH) 05Open>03Resolved approved and done [17:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:05] SMalyshev: ^ deployment completed, tests are green... [17:13:19] (03PS5) 10MarcoAurelio: Set project logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371135 (https://phabricator.wikimedia.org/T173042) [17:18:03] (03CR) 10MarcoAurelio: [C: 04-1] "This is not right. It now removes valid stuff." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371135 (https://phabricator.wikimedia.org/T173042) (owner: 10MarcoAurelio) [17:18:03] gehel: tjanks! [17:18:07] *thanks [17:18:34] (03PS6) 10MarcoAurelio: Set project logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371135 (https://phabricator.wikimedia.org/T173042) [17:21:37] (03PS7) 10MarcoAurelio: Set project logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371135 (https://phabricator.wikimedia.org/T173042) [17:22:18] !log mobrovac@tin Started deploy [cxserver/deploy@f43ef96]: Bring back cxserver on scb2001 to a stable state - T173038 [17:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:31] T173038: /v1/translate/{from}/{to}{/provider} endpoint fails while deploying cxserver - https://phabricator.wikimedia.org/T173038 [17:22:32] !log mobrovac@tin Finished deploy [cxserver/deploy@f43ef96]: Bring back cxserver on scb2001 to a stable state - T173038 (duration: 00m 14s) [17:22:42] (03PS2) 10MarcoAurelio: Increase AbuseFilter autodisable thresholds for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372768 (https://phabricator.wikimedia.org/T173633) [17:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:59] (03CR) 10Volans: [C: 031] "Great!" [puppet] - 10https://gerrit.wikimedia.org/r/370993 (https://phabricator.wikimedia.org/T164780) (owner: 10Jcrespo) [17:23:52] (03PS2) 10Jforrester: Remove setting no longer in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366869 [17:24:27] (03PS5) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 [17:25:37] (03PS6) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 [17:26:12] (03CR) 10jerkins-bot: [V: 04-1] logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [17:26:39] bblack: are you around? [17:27:35] 10Operations, 10Page-Previews, 10Traffic, 10Readers-Web-Backlog (Tracking): Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3527818 (10Jdlrobson) Related: T173434 [17:28:04] (03PS3) 10Jdlrobson: Roll page previews out to all wikis except en and de wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372160 (https://phabricator.wikimedia.org/T162672) [17:28:12] RECOVERY - cxserver endpoints health on scb2001 is OK: All endpoints are healthy [17:29:30] (03PS1) 10Ottomata: Apply base::firewall and standard to druid100[456] [puppet] - 10https://gerrit.wikimedia.org/r/372859 (https://phabricator.wikimedia.org/T171626) [17:29:33] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:30:09] 10Operations, 10Page-Previews, 10Traffic, 10Readers-Web-Backlog (Tracking): Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3538943 (10Jdlrobson) @bblack do you have any qualms about us continuing our roll out in T162672... [17:31:26] (03CR) 10Ottomata: [C: 032] Apply base::firewall and standard to druid100[456] [puppet] - 10https://gerrit.wikimedia.org/r/372859 (https://phabricator.wikimedia.org/T171626) (owner: 10Ottomata) [17:33:42] (03PS7) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 [17:34:30] (03CR) 10jerkins-bot: [V: 04-1] logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [17:35:22] jdlrobson: yes :) [17:37:31] bblack: hey so i wrote my question down here > https://phabricator.wikimedia.org/T173422#3538943 [17:38:12] my read of the situation is we're seeing expected behaviour for such a release, but it doesn't sound like we should worry about turning on the last few wikis (all the big ones were in the 2nd roll out) [17:38:14] is that fair? [17:39:07] at least from cache_upload's perspective, it's not melting anything, so yeah [17:39:24] godog: maybe could comment if it's likely to cause a swift-layer problem (but doesn't sound like it?) [17:40:00] bblack: yup I'll comment on the task but LGTM from where I'm standing cc jdlrobson [17:40:41] 10Operations, 10Page-Previews, 10Traffic, 10Readers-Web-Backlog (Tracking): Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3538950 (10BBlack) No qualms on the cache end of things! [17:40:45] 10Operations, 10Page-Previews, 10Traffic, 10Readers-Web-Backlog (Tracking): Investigate the increase in the number of requests to Swift after the Page Previews deploy - https://phabricator.wikimedia.org/T173422#3538951 (10fgiunchedi) I'm +1 on the swift side to resume rollout everywhere but en/de [17:44:44] 10Operations, 10Electron-PDFs, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815#3538954 (10GWicke) The Electron render service currently requires manual attention every few days, so w... [17:48:22] (03PS2) 10Dzahn: admins: add additional admin addshore to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/372211 (https://phabricator.wikimedia.org/T173233) [17:50:34] (03CR) 10Dzahn: [C: 032] "approved in today's ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/372211 (https://phabricator.wikimedia.org/T173233) (owner: 10Dzahn) [17:50:45] (03PS8) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 [17:51:45] (03CR) 10Gehel: logrotate - introduce a generic logrotate template (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [17:53:04] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3538961 (10Papaul) @madhuvishy Let me know when i have green light to disconnect everything and start working on the new s... [17:53:39] (03PS1) 10Ottomata: Conditionally include zookeeper server in druid worker role [puppet] - 10https://gerrit.wikimedia.org/r/372862 (https://phabricator.wikimedia.org/T171626) [17:53:51] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3538963 (10madhuvishy) @Papaul The servers are not in use and have no useful data in them, you have green light to disconn... [17:55:29] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3471092 (10Cmjohnson) @papaul please let me know and @madhuvishy know if you have any issues getting the disk shelves to b... [17:57:15] bblack, hey [17:58:25] (03CR) 10Ottomata: "No op https://puppet-compiler.wmflabs.org/compiler02/7556/" [puppet] - 10https://gerrit.wikimedia.org/r/372862 (https://phabricator.wikimedia.org/T171626) (owner: 10Ottomata) [17:58:27] (03CR) 10Ottomata: [C: 032] Conditionally include zookeeper server in druid worker role [puppet] - 10https://gerrit.wikimedia.org/r/372862 (https://phabricator.wikimedia.org/T171626) (owner: 10Ottomata) [17:58:32] (03PS2) 10Ottomata: Conditionally include zookeeper server in druid worker role [puppet] - 10https://gerrit.wikimedia.org/r/372862 (https://phabricator.wikimedia.org/T171626) [17:58:34] (03CR) 10Ottomata: [V: 032 C: 032] Conditionally include zookeeper server in druid worker role [puppet] - 10https://gerrit.wikimedia.org/r/372862 (https://phabricator.wikimedia.org/T171626) (owner: 10Ottomata) [17:58:53] Krenair: hi :) [17:59:30] I kept trying to talk to you on IRC but I guess I kept getting unlucky with timezones etc [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170821T1800). Please do the needful. [18:00:05] TabbyCat, Krenair, James_F, and Jdlrobson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:34] bblack, did faidon mention I was thinking about https://gerrit.wikimedia.org/r/#/c/317450/ ? [18:00:48] o/ [18:01:47] (Hey.) [18:02:02] I can SWAT. [18:03:12] (03CR) 10Niharika29: [C: 032] Set project logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371135 (https://phabricator.wikimedia.org/T173042) (owner: 10MarcoAurelio) [18:04:44] (03Merged) 10jenkins-bot: Set project logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371135 (https://phabricator.wikimedia.org/T173042) (owner: 10MarcoAurelio) [18:06:08] (03CR) 10jenkins-bot: Set project logo for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371135 (https://phabricator.wikimedia.org/T173042) (owner: 10MarcoAurelio) [18:06:19] TabbyCat: I don't think you can test https://gerrit.wikimedia.org/r/#/c/371135/ on mwdebug1002, can you? [18:06:29] Krenair: I heard through the grapevine :) I haven't had time to stare at it, but in general it's on our radar to get this done one way or another by the end of the calendar year. [18:06:40] Niharika: I think I can, allow me to please [18:06:42] I'll try to make some time this week! [18:06:59] TabbyCat: Go on, it's there. [18:07:31] (03CR) 10Niharika29: [C: 032] Administrators to add/remove 'transwiki' at nowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372045 (https://phabricator.wikimedia.org/T172365) (owner: 10MarcoAurelio) [18:07:34] bblack, okay. right now my main interest is ensuring that what I already have is fundamentally what we want, then I should turn that "list of stuff this commit includes" into "list of stuff this commit doesn't include" [18:07:51] (03CR) 10Niharika29: [C: 032] Added Cookbook and Cookbook talk NS on hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372387 (https://phabricator.wikimedia.org/T173398) (owner: 10MarcoAurelio) [18:07:59] Krenair: ok, sounds great :) [18:08:05] (03PS1) 10Ottomata: Set up druid on druid100[456] [puppet] - 10https://gerrit.wikimedia.org/r/372863 (https://phabricator.wikimedia.org/T171626) [18:08:30] (03PS2) 10Ottomata: Set up druid on druid100[456] [puppet] - 10https://gerrit.wikimedia.org/r/372863 (https://phabricator.wikimedia.org/T171626) [18:08:38] Niharika: lgtm [18:08:54] 10Operations, 10Electron-PDFs, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815#3539023 (10faidon) I don't really mind who owns the service (Services or Readers), as long as it's owne... [18:09:07] TabbyCat: Alrighty, going live then. [18:09:21] (03PS3) 10Niharika29: Administrators to add/remove 'transwiki' at nowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372045 (https://phabricator.wikimedia.org/T172365) (owner: 10MarcoAurelio) [18:09:52] (03CR) 10Ottomata: [C: 032] Set up druid on druid100[456] [puppet] - 10https://gerrit.wikimedia.org/r/372863 (https://phabricator.wikimedia.org/T171626) (owner: 10Ottomata) [18:09:53] (im here when needed) [18:10:10] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, 10Research-collaborations: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3539028 (10RobH) a:05RobH>03herron So @herron is on clinic duty this week, and expressed an interest in taking care... [18:10:48] (03CR) 10Niharika29: [C: 032] Administrators to add/remove 'transwiki' at nowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372045 (https://phabricator.wikimedia.org/T172365) (owner: 10MarcoAurelio) [18:11:11] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Set project logo for wikimania2018wiki T173042 (duration: 00m 44s) [18:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:24] T173042: Wikimania 2018 wiki site icon - https://phabricator.wikimedia.org/T173042 [18:12:20] (03Merged) 10jenkins-bot: Administrators to add/remove 'transwiki' at nowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372045 (https://phabricator.wikimedia.org/T172365) (owner: 10MarcoAurelio) [18:12:30] !log niharika29@tin Synchronized static/images/: Set project logo for wikimania2018wiki T173042 (duration: 00m 44s) [18:12:33] TabbyCat: And it's live. [18:12:34] (03CR) 10jenkins-bot: Administrators to add/remove 'transwiki' at nowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372045 (https://phabricator.wikimedia.org/T172365) (owner: 10MarcoAurelio) [18:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:56] Niharika: retested and looks good on live, thanks [18:13:32] TabbyCat: https://gerrit.wikimedia.org/r/#/c/372045/ is on mwdebug1002 [18:13:38] (03PS3) 10Niharika29: Added Cookbook and Cookbook talk NS on hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372387 (https://phabricator.wikimedia.org/T173398) (owner: 10MarcoAurelio) [18:13:48] (03CR) 10Niharika29: [C: 032] Added Cookbook and Cookbook talk NS on hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372387 (https://phabricator.wikimedia.org/T173398) (owner: 10MarcoAurelio) [18:14:07] Niharika: ack checking [18:14:27] (03PS3) 10Niharika29: Increase AbuseFilter autodisable thresholds for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372768 (https://phabricator.wikimedia.org/T173633) (owner: 10MarcoAurelio) [18:14:41] (03CR) 10Niharika29: [C: 032] Increase AbuseFilter autodisable thresholds for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372768 (https://phabricator.wikimedia.org/T173633) (owner: 10MarcoAurelio) [18:15:06] Niharika: lgtm [18:15:21] (03Merged) 10jenkins-bot: Added Cookbook and Cookbook talk NS on hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372387 (https://phabricator.wikimedia.org/T173398) (owner: 10MarcoAurelio) [18:15:31] (03CR) 10jenkins-bot: Added Cookbook and Cookbook talk NS on hi.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372387 (https://phabricator.wikimedia.org/T173398) (owner: 10MarcoAurelio) [18:15:47] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10User-Addshore: Requesting access to contint-admins for addshore - https://phabricator.wikimedia.org/T173233#3539046 (10Dzahn) 05Open>03Resolved a:03Dzahn Hi @addshore you have been added to the group... [18:16:35] (03PS4) 10Niharika29: Increase AbuseFilter autodisable thresholds for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372768 (https://phabricator.wikimedia.org/T173633) (owner: 10MarcoAurelio) [18:16:39] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Administrators to add/remove 'transwiki' at nowiktionary T172365 (duration: 00m 45s) [18:16:44] too many pingz [18:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:50] TabbyCat: Synced! [18:16:50] T172365: Rights to assign transwiki group on nowiktionary - https://phabricator.wikimedia.org/T172365 [18:17:04] another happy customer :) [18:17:26] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10User-Addshore: Requesting access to contint-admins for addshore - https://phabricator.wikimedia.org/T173233#3539052 (10Dzahn) And here are the things you can do as root: ``` [contint1001:~] $ sudo cat /e... [18:17:29] TabbyCat: And https://gerrit.wikimedia.org/r/#/c/372387/ is on mwdebug1002 as well. [18:17:51] Jenkins is giving me merge conflicts even after rebasing. [18:18:02] PROBLEM - Check systemd state on druid1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:18:10] (03PS1) 10Ottomata: Make sure profile::cdh::apt happens before apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/372864 (https://phabricator.wikimedia.org/T171626) [18:18:12] PROBLEM - puppet last run on druid1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hadoop-client] [18:18:13] ^ me applying stuff [18:18:14] Niharika: I can try testing that but I cannot read Hindi... all letters seem the same to me [18:18:14] sorry [18:18:18] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372768 (https://phabricator.wikimedia.org/T173633) (owner: 10MarcoAurelio) [18:18:51] TabbyCat: Just check if nothing's broken? [18:19:32] Niharika: nothing broke then [18:19:36] Krenair: You around to test yours? [18:19:39] on mwdebug1002 [18:19:45] (03Merged) 10jenkins-bot: Increase AbuseFilter autodisable thresholds for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372768 (https://phabricator.wikimedia.org/T173633) (owner: 10MarcoAurelio) [18:19:45] sure [18:19:49] mine is beta-only, so [18:19:51] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/7557/druid1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/372864 (https://phabricator.wikimedia.org/T171626) (owner: 10Ottomata) [18:19:54] (03CR) 10Ottomata: [C: 032] Make sure profile::cdh::apt happens before apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/372864 (https://phabricator.wikimedia.org/T171626) (owner: 10Ottomata) [18:19:56] (03PS2) 10Niharika29: Wikibase on deployment-prep: Exclude non-existent wikis from clientDbList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372761 (https://phabricator.wikimedia.org/T173571) (owner: 10Alex Monk) [18:19:59] (03CR) 10jenkins-bot: Increase AbuseFilter autodisable thresholds for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372768 (https://phabricator.wikimedia.org/T173633) (owner: 10MarcoAurelio) [18:21:45] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372761 (https://phabricator.wikimedia.org/T173571) (owner: 10Alex Monk) [18:22:02] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Add CookBook and Cookbook Talk NS on hiwikibooks T173398 (duration: 00m 45s) [18:22:13] RECOVERY - puppet last run on druid1004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:16] T173398: Add cookbook namespace to Hindi Wikibooks - https://phabricator.wikimedia.org/T173398 [18:22:40] TabbyCat: And your last patch https://gerrit.wikimedia.org/r/#/c/372768/4 is on mwdebug1002 as well now. [18:23:08] (03Merged) 10jenkins-bot: Wikibase on deployment-prep: Exclude non-existent wikis from clientDbList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372761 (https://phabricator.wikimedia.org/T173571) (owner: 10Alex Monk) [18:23:18] (03CR) 10jenkins-bot: Wikibase on deployment-prep: Exclude non-existent wikis from clientDbList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372761 (https://phabricator.wikimedia.org/T173571) (owner: 10Alex Monk) [18:24:06] Niharika: untestable right now [18:24:19] it'd require a filter which is autodisabled and we've got none [18:24:22] TabbyCat: So sync directly? [18:24:30] nothing seems broken on mwdebug1002@meta [18:24:46] Niharika: looks like the only option [18:25:43] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [18:25:49] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Increase AbuseFilter autodisable thresholds for Meta-Wiki T173633 (duration: 00m 44s) [18:25:50] TabbyCat: ....and done! [18:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:00] T173633: Increase AbuseFilter autodisable thresholds on Meta-Wiki - https://phabricator.wikimedia.org/T173633 [18:26:04] Niharika: thanks [18:26:16] Krenair: Yours is on mwdebug1002. [18:26:20] no moar patches from me to swat, giving way [18:26:40] (03PS3) 10Niharika29: Enable OOjs UI EditPage on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366868 (owner: 10Jforrester) [18:26:43] PROBLEM - configured eth on sodium is CRITICAL: eth1 reporting no carrier. [18:26:45] Niharika, the patch won't change anything in prod [18:26:58] it just adds a wmfRealm === 'labs' block [18:27:12] Krenair: Alright so you can't test it? Just sync? [18:27:20] there's nothing to test in prod [18:27:24] other than everything remaining the same [18:27:45] Krenair: Can I pull it on a different server for you to test? [18:27:55] why? [18:28:07] You didn't answer "Just sync?" [18:28:23] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366868 (owner: 10Jforrester) [18:28:30] just sync [18:29:49] (03Merged) 10jenkins-bot: Enable OOjs UI EditPage on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366868 (owner: 10Jforrester) [18:29:58] (03CR) 10jenkins-bot: Enable OOjs UI EditPage on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366868 (owner: 10Jforrester) [18:30:01] !log niharika29@tin Synchronized wmf-config/Wikibase.php: Wikibase on deployment-prep: Exclude non-existent wikis from clientDbList T173571 (duration: 00m 44s) [18:30:06] Done. [18:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:15] T173571: Disk full on deployment-jobrunner02 - https://phabricator.wikimedia.org/T173571 [18:30:17] jdlrobson: Are you around? [18:30:29] Yes [18:30:41] Niharika ready for service:) [18:30:52] RECOVERY - configured eth on sodium is OK: OK - interfaces up [18:30:55] jdlrobson: Cool, you're up next in a moment. [18:31:02] James_F: Your patch is on mwdebug1002. [18:31:10] thanks Niharika [18:31:23] You're welcome. [18:31:30] Niharika: LGTM. [18:31:40] (03PS1) 10Ladsgroup: ores: More configs for stress testing [puppet] - 10https://gerrit.wikimedia.org/r/372866 (https://phabricator.wikimedia.org/T169246) [18:32:41] (03PS1) 10Ottomata: Move use_cdh setting into profile hiera param [puppet] - 10https://gerrit.wikimedia.org/r/372867 (https://phabricator.wikimedia.org/T171626) [18:33:00] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Enable OOjs UI EditPage on all wikis https://gerrit.wikimedia.org/r/#/c/366868/ (duration: 00m 44s) [18:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:26] There's the IS. [18:33:30] Niharika: Argh, CommonsSettings first would have been better. [18:33:36] (03PS4) 10Niharika29: Roll page previews out to all wikis except en and de wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372160 (https://phabricator.wikimedia.org/T162672) (owner: 10Jdlrobson) [18:33:47] (03PS2) 10Ottomata: Move use_cdh setting into profile hiera param [puppet] - 10https://gerrit.wikimedia.org/r/372867 (https://phabricator.wikimedia.org/T171626) [18:33:51] Niharika: It'll spike the error logs and people will get the wrong experience until you do. [18:33:57] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372160 (https://phabricator.wikimedia.org/T162672) (owner: 10Jdlrobson) [18:34:03] !log niharika29@tin Synchronized wmf-config/CommonSettings.php: Enable OOjs UI EditPage on all wikis https://gerrit.wikimedia.org/r/#/c/366868/ (duration: 00m 44s) [18:34:10] Done now. :-) [18:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:16] (03CR) 10jerkins-bot: [V: 04-1] Move use_cdh setting into profile hiera param [puppet] - 10https://gerrit.wikimedia.org/r/372867 (https://phabricator.wikimedia.org/T171626) (owner: 10Ottomata) [18:34:25] Ouch. I somehow misremembered them to be the other way around. [18:34:31] My bad. [18:34:51] Such is life. [18:35:18] (03PS3) 10Ottomata: Move use_cdh setting into profile hiera param [puppet] - 10https://gerrit.wikimedia.org/r/372867 (https://phabricator.wikimedia.org/T171626) [18:35:23] (03Merged) 10jenkins-bot: Roll page previews out to all wikis except en and de wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372160 (https://phabricator.wikimedia.org/T162672) (owner: 10Jdlrobson) [18:35:34] James_F: Wait, it is IS and the CS, isn't it? We define the variables in IS and then use them in CS. [18:35:44] (03CR) 10jerkins-bot: [V: 04-1] Move use_cdh setting into profile hiera param [puppet] - 10https://gerrit.wikimedia.org/r/372867 (https://phabricator.wikimedia.org/T171626) (owner: 10Ottomata) [18:36:00] Niharika: Except in this case we are unsetting in IS and setting a general variable in CS, so you should do CS first. [18:36:03] (03CR) 10jenkins-bot: Roll page previews out to all wikis except en and de wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372160 (https://phabricator.wikimedia.org/T162672) (owner: 10Jdlrobson) [18:36:37] Ah, I see. [18:36:51] jdlrobson: Your change is on mwdebug1002. [18:36:56] Niharika: on it! [18:37:18] \o/ I'm getting better at this. [18:37:24] :-) [18:38:35] Niharika: looks good to me! sync away [18:38:41] jdlrobson: Ack! [18:40:02] !log niharika29@tin Synchronized dblists/pp_stage1.dblist: Roll page previews out to all wikis except en and de wiki T162672 (duration: 00m 44s) [18:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:13] T162672: Deploy page previews to 90% of users on all wikis but English and German - https://phabricator.wikimedia.org/T162672 [18:40:22] And your change is live, jdlrobson. [18:40:38] SWAT over. [18:41:35] Niharika: sure it's live everywhere? [18:41:37] RECOVERY - Host labvirt1015 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [18:41:58] jdlrobson: I did sync it. [18:42:02] What's wrong? [18:42:14] might be caching so let me check.. [18:42:28] yeh probably caching. I'll wait 5 mins to double check [18:42:57] Okay. Ping me if you want me to sync it again or something. [18:46:37] Niharika: looks like we are good [18:46:50] Yay! [18:48:17] PROBLEM - Check systemd state on logstash1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:48:27] PROBLEM - ElasticSearch health check for shards on logstash1006 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.109:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.48.109, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f210a35ab90: Failed to establish a new connection: [Errno 111] Connection [18:48:53] (03PS4) 10Ottomata: Move use_cdh setting into profile hiera param [puppet] - 10https://gerrit.wikimedia.org/r/372867 (https://phabricator.wikimedia.org/T171626) [18:49:06] PROBLEM - Druid broker on druid1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server broker [18:49:33] (03CR) 10Ottomata: [C: 032] Move use_cdh setting into profile hiera param [puppet] - 10https://gerrit.wikimedia.org/r/372867 (https://phabricator.wikimedia.org/T171626) (owner: 10Ottomata) [18:49:58] PROBLEM - Druid coordinator on druid1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server coordinator [18:50:58] PROBLEM - Druid historical on druid1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server historical [18:51:46] PROBLEM - Druid middlemanager on druid1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server middleManager [18:52:37] PROBLEM - Druid overlord on druid1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server overlord [18:56:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3539262 (10Andrew) Syslog jumps from Aug 15 15:53:01 to Aug 21 18:40:37 with no indication of trouble: Aug 15 15:53:01 labvirt1015 CRON[136739]: (prometheus) CMD (/usr/local... [18:57:01] (03PS1) 10Ottomata: Check for druid-hdfs-storage-cdh dir existance before looking for jar link [puppet] - 10https://gerrit.wikimedia.org/r/372872 (https://phabricator.wikimedia.org/T171626) [18:57:40] (03CR) 10Ottomata: [C: 032] Check for druid-hdfs-storage-cdh dir existance before looking for jar link [puppet] - 10https://gerrit.wikimedia.org/r/372872 (https://phabricator.wikimedia.org/T171626) (owner: 10Ottomata) [18:58:54] ottomata: ^ [18:59:26] RECOVERY - Check systemd state on druid1004 is OK: OK - running: The system is fully operational [18:59:32] yaya am on it, trying to fix some puppet dependencies before i keep moving [18:59:35] so this works better next time [18:59:46] RECOVERY - Druid overlord on druid1004 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server overlord [18:59:47] RECOVERY - Druid middlemanager on druid1004 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server middleManager [18:59:57] RECOVERY - Druid historical on druid1004 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server historical [19:00:07] RECOVERY - Druid coordinator on druid1004 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server coordinator [19:00:16] RECOVERY - Druid broker on druid1004 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server broker [19:03:30] (03CR) 10Herron: [C: 032] admins: Make daniel a deployer [puppet] - 10https://gerrit.wikimedia.org/r/371661 (https://phabricator.wikimedia.org/T173230) (owner: 10Reedy) [19:04:06] (03PS1) 10Ottomata: Make cdh::hadoop class depend on apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/372873 (https://phabricator.wikimedia.org/T171626) [19:04:49] (03CR) 10Ottomata: [C: 032] Make cdh::hadoop class depend on apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/372873 (https://phabricator.wikimedia.org/T171626) (owner: 10Ottomata) [19:05:36] PROBLEM - Check systemd state on druid1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:05:37] PROBLEM - puppet last run on druid1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hadoop-client] [19:07:36] RECOVERY - Check systemd state on druid1005 is OK: OK - running: The system is fully operational [19:07:46] RECOVERY - puppet last run on druid1005 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:09:52] (03PS1) 10Urbanecm: Set $wmgUseWikimediaShopLink to true for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372874 (https://phabricator.wikimedia.org/T173768) [19:10:16] PROBLEM - DPKG on druid1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:11:16] RECOVERY - DPKG on druid1006 is OK: All packages OK [19:14:57] (03PS4) 10Herron: admins: Make daniel a deployer [puppet] - 10https://gerrit.wikimedia.org/r/371661 (https://phabricator.wikimedia.org/T173230) (owner: 10Reedy) [19:15:52] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: rack/setup/install druid100[456].eqiad.wmnet - https://phabricator.wikimedia.org/T171626#3539353 (10Ottomata) [19:19:16] (03PS1) 10BryanDavis: wmcs: generate /etc/dbusers.yaml with ordered_yaml() [puppet] - 10https://gerrit.wikimedia.org/r/372876 [19:19:38] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team (Watching / External), 10User-Addshore: Make @daniel a MediaWiki deployer - https://phabricator.wikimedia.org/T173230#3539361 (10herron) 05Open>03Resolved a:03herron Change 371661 has been merged and is propagating out now. Transitionin... [19:23:55] robh: thx :) [19:24:34] welcome [19:46:43] (03CR) 10Luke081515: [C: 031] Set $wmgUseWikimediaShopLink to true for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372874 (https://phabricator.wikimedia.org/T173768) (owner: 10Urbanecm) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170821T2000). Please do the needful. [20:00:13] no parsoid deploy today [20:14:01] subbu, not sure if I'm posting it during the right window, but can https://gerrit.wikimedia.org/r/372878 be deployed? [20:31:51] robh: do you know what the equivalent of ctrl+\ for exiting the serial console in HP boxes are [20:31:59] (03PS1) 10Andrew Bogott: bootstrap_vz: chase more setup races [puppet] - 10https://gerrit.wikimedia.org/r/372880 (https://phabricator.wikimedia.org/T165555) [20:32:00] (03PS1) 10Andrew Bogott: nova fullstack: increase timeouts yet again [puppet] - 10https://gerrit.wikimedia.org/r/372881 (https://phabricator.wikimedia.org/T165555) [20:32:44] madhuvishy: esc+( [20:32:58] o/ thank you [20:33:04] so folks keep making new platform docks for the hp's i need to consolidate them, heh [20:33:13] there are like 4 of them and they are all nearly identical but some hav emissing info [20:33:42] oh ya [20:33:44] i saw that [20:33:55] missed the get out of serial console line there [20:34:44] Urbanecm, i am a bit busy right now .. can you check with arlo in #mediawiki-parsoid? [20:34:54] arlora is his irc handle [20:36:45] (03PS2) 10Andrew Bogott: bootstrap_vz: chase more setup races [puppet] - 10https://gerrit.wikimedia.org/r/372880 (https://phabricator.wikimedia.org/T165555) [20:36:47] (03PS2) 10Andrew Bogott: nova fullstack: increase timeouts yet again [puppet] - 10https://gerrit.wikimedia.org/r/372881 (https://phabricator.wikimedia.org/T165555) [20:37:35] (03CR) 10Andrew Bogott: [C: 032] bootstrap_vz: chase more setup races [puppet] - 10https://gerrit.wikimedia.org/r/372880 (https://phabricator.wikimedia.org/T165555) (owner: 10Andrew Bogott) [20:37:45] (03CR) 10Andrew Bogott: [C: 032] nova fullstack: increase timeouts yet again [puppet] - 10https://gerrit.wikimedia.org/r/372881 (https://phabricator.wikimedia.org/T165555) (owner: 10Andrew Bogott) [20:42:35] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3539555 (10madhuvishy) @Cmjohnson I tried getting into the management interface for 1007, and powercycled it, booted from network and was looking at... [20:45:38] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3539556 (10madhuvishy) @Cmjohnson I also can't even seem to get into the management interface for labstore1006 ``` ☁ ~ ssh root@labstore1006.mgmt.... [20:53:29] 10Operations, 10Ops-Access-Requests: Requesting access to restricted hosts for dbarratt - https://phabricator.wikimedia.org/T173779#3539582 (10dbarratt) [20:54:44] (03CR) 10Zoranzoki21: [C: 031] "Looks good to me, but someone else must approve." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372874 (https://phabricator.wikimedia.org/T173768) (owner: 10Urbanecm) [20:59:08] PROBLEM - configured eth on sodium is CRITICAL: eth1 reporting no carrier. [21:00:04] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170821T2100). [21:03:28] 10Operations, 10Ops-Access-Requests: Requesting access to restricted hosts for dbarratt - https://phabricator.wikimedia.org/T173779#3539626 (10kaldari) David is a developer on the anti-harassment tools team and thus will commonly need access to data that isn't replicated (as many of the tools he will be workin... [21:09:13] !log arlolra@tin Started deploy [parsoid/deploy@2210a38]: Updating Parsoid to 28a9a22b [21:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:41] 10Operations, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3539657 (10Dzahn) a:05Dzahn>03None assigning it from me to pool. it can now be finalized and iridium can be shutdown and wiped. I can't do all the non-interruptable steps myself due to lack of switch access. [21:17:32] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-CentralNotice-Administration (Q3-2017), 10Wikimedia-log-errors: «BannerDataException» when trying to clone a banner - https://phabricator.wikimedia.org/T173782#3539659 (10Base) [21:18:03] 10Operations, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3539671 (10Dzahn) status is: in puppet as role::spare, in DHCP and DNS. remnants in mysql grants (https://gerrit.wikimedia.org/r/369832) , all else is removed [21:18:50] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-CentralNotice-Administration (Q3-2017), 10Wikimedia-log-errors: «BannerDataException» when trying to clone a banner - https://phabricator.wikimedia.org/T173782#3539672 (10Base) [21:19:18] RECOVERY - configured eth on sodium is OK: OK - interfaces up [21:20:11] !log arlolra@tin Finished deploy [parsoid/deploy@2210a38]: Updating Parsoid to 28a9a22b (duration: 10m 59s) [21:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:08] (03PS6) 10Rush: openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) [21:26:57] !log bast2001 - running Dell BIOS firmware upgrade (T162850) [21:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:09] T162850: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850 [21:30:31] SMalyshev: robh madhuvishy i see you are the ones currently SSHing via bast2001, could i reboot that host quickly to apply BIOS upgrade [21:30:42] mutante: sure np [21:30:49] mutante: sure [21:31:03] ok, cool, there are still the other 3 bastions and it should be back soon :) [21:31:06] yeah lemme just jump out [21:31:10] cool [21:31:11] thanks for the ping :) not using that for now, so feel free to but it [21:31:12] done [21:31:14] *boot [21:31:17] yeah i was mid network switch [21:31:25] so a sudden d/c would have slightly panic'd me [21:31:29] :) ok [21:31:31] with a 'oh fuck what did i just do?' [21:31:31] hehe [21:31:44] 'this is how outages start, shittttttt' ;D [21:31:48] :)) [21:32:01] then again, that is typically my reaction whenever somehting unexpected happens [21:32:06] 'could i have just caused an outage?' [21:32:14] now it just needs to do something after i said "Y" [21:32:58] PROBLEM - Host bast2001 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:01] ah :) [21:33:35] sees on console that it is "task in progress" [21:33:49] "staged update" [21:37:58] RECOVERY - Host bast2001 is UP: PING OK - Packet loss = 0%, RTA = 36.90 ms [21:38:01] bast2001 is back, you can use it again as normal [21:41:07] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2060540 [21:43:16] mobrovac: can i just reboot Cassandra test hosts without announcing it first? [21:43:33] just "test_cluster" [21:43:42] but preferably all 3 [21:44:17] ah, you are probably on a plane right now, heh [21:49:56] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786#3539734 (10Jdforrester-WMF) [21:54:01] !log cerium - installing Dell BIOS upgrade (T162850) [21:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:14] T162850: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850 [21:54:25] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-CentralNotice-Administration, 10Wikimedia-log-errors: «BannerDataException» when trying to clone a banner - https://phabricator.wikimedia.org/T173782#3539753 (10Base) [21:55:43] 10Operations, 10Traffic, 10HTTPS: setup CAA record for policy.wikimedia.org for namecheap (used by WP VIP GO) due by 2017-09-27 - https://phabricator.wikimedia.org/T173787#3539756 (10RobH) [21:57:35] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-CentralNotice-Administration, 10Wikimedia-log-errors: «BannerDataException» when trying to clone a banner - https://phabricator.wikimedia.org/T173782#3539773 (10Base) [21:58:31] !log cerium (cassandra test) - rebooting for firmware upgrade [21:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:00] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-CentralNotice-Administration, 10Wikimedia-log-errors: «BannerDataException» when trying to clone a banner - https://phabricator.wikimedia.org/T173782#3539795 (10Base) Worked for the name of [[https://meta.wikimedia.org/wiki/Special:CentralNotice... [22:00:18] PROBLEM - Host cerium is DOWN: PING CRITICAL - Packet loss = 100% [22:01:54] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3539796 (10Dzahn) [22:02:24] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3177068 (10Dzahn) [22:03:01] (03PS1) 10RobH: Setting namecheap/comodo CAA records [dns] - 10https://gerrit.wikimedia.org/r/372900 (https://phabricator.wikimedia.org/T173787) [22:05:18] RECOVERY - Host cerium is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [22:06:27] PROBLEM - cassandra-a SSL 10.64.16.153:7001 on cerium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:07:27] RECOVERY - cassandra-a SSL 10.64.16.153:7001 on cerium is OK: SSL OK - Certificate cerium-a valid until 2018-07-13 14:23:32 +0000 (expires in 325 days) [22:07:34] (03PS2) 10RobH: Setting namecheap/comodo CAA records [dns] - 10https://gerrit.wikimedia.org/r/372900 (https://phabricator.wikimedia.org/T173787) [22:10:11] nice @cassandra.. as you should [22:10:20] (03PS7) 10Rush: openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) [22:11:36] !log xenon - installing BIOS upgrade [22:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:22] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: setup CAA record for policy.wikimedia.org for namecheap (used by WP VIP GO) due by 2017-09-27 - https://phabricator.wikimedia.org/T173787#3539836 (10RobH) I've assumed namecheap uses Comodo, since the namecheap site itself is secured with Comodo. I've... [22:12:46] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for labvirt1019-20 T172538 [dns] - 10https://gerrit.wikimedia.org/r/372390 (owner: 10Cmjohnson) [22:13:27] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3539838 (10Cmjohnson) [22:13:28] PROBLEM - Host xenon is DOWN: PING CRITICAL - Packet loss = 100% [22:17:41] (03CR) 10Platonides: "Not allowing another CA as backup?" [dns] - 10https://gerrit.wikimedia.org/r/372900 (https://phabricator.wikimedia.org/T173787) (owner: 10RobH) [22:19:57] RECOVERY - Host xenon is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [22:20:24] !log praseodymium - installing BIOS upgrade, reboot [22:20:31] (03CR) 10RobH: ""Not allowing another CA as backup?" This is an issuance by a third party, the only backup we would possibly allow is LetsEncrypt, since " [dns] - 10https://gerrit.wikimedia.org/r/372900 (https://phabricator.wikimedia.org/T173787) (owner: 10RobH) [22:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:57] PROBLEM - cassandra-a CQL 10.64.0.202:9042 on xenon is CRITICAL: connect to address 10.64.0.202 and port 9042: Connection refused [22:22:07] PROBLEM - cassandra-a SSL 10.64.0.202:7001 on xenon is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:22:47] that should recover in a sec, just like cerium [22:22:57] RECOVERY - cassandra-a CQL 10.64.0.202:9042 on xenon is OK: TCP OK - 0.000 second response time on 10.64.0.202 port 9042 [22:23:07] RECOVERY - cassandra-a SSL 10.64.0.202:7001 on xenon is OK: SSL OK - Certificate xenon-a valid until 2018-07-13 14:23:28 +0000 (expires in 325 days) [22:23:28] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [22:28:38] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [22:30:07] PROBLEM - cassandra-a SSL 10.64.16.188:7001 on praseodymium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:30:45] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3539870 (10Dzahn) [22:31:07] RECOVERY - cassandra-a SSL 10.64.16.188:7001 on praseodymium is OK: SSL OK - Certificate praseodymium-a valid until 2018-07-13 14:23:35 +0000 (expires in 325 days) [22:31:08] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 22 [22:31:24] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3177068 (10Dzahn) done now: cerium xenon, praseodymium (cassandra test cluster) removed machines from list that are already gone: subra, suhail, mira [22:33:12] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3539876 (10Dzahn) [22:35:54] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3539877 (10Dzahn) [22:36:20] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3177068 (10Dzahn) [22:37:00] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3177068 (10Dzahn) [22:39:29] !next [22:39:37] or how do i ask for next deploy now? [22:39:41] jouncebot: next [22:39:41] In 0 hour(s) and 20 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170821T2300) [22:39:42] like that [22:39:45] heh [22:40:05] no patches listed [22:40:21] ah :) [22:40:29] oh, since releng is offsite [22:40:30] right [22:40:45] so that would be good time to upgrade deployment server i suppose [22:40:50] i think so too! [22:45:08] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 29 probes of 269 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:46:31] no, releng is here :) [22:46:36] services is at an offsite [22:46:42] mutante: ^ [22:47:47] we've just been quiet today due to the eclipse [22:48:24] * greg-g had to drive to Hopland due to the fog [22:50:07] !log rebooting tin for firmware update since its idle [22:50:08] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 7 probes of 269 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:27] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786#3539894 (10MaxSem) In principle, nothing prevents us from switching deployment-prep right now by editing https://wikite... [22:50:36] greg-g: oh, heh :) !ok [22:51:50] mutante: robh and there is a patch waiting for SWAT :) [22:52:28] heh! it should be back any moment :) [22:53:29] ahhh welllllll fck.. [22:53:49] fck or fsck? :) [22:53:51] i think we'll be in time :) [22:54:35] heh [22:54:40] so yeah, it just ifnished the reboot [22:54:45] sorry, finished update [22:54:46] its rebooting now [22:55:56] then it will say something about the staged upgrade.. and then come back normally [22:56:15] yep [22:56:19] it did the staged upgrade thing already [22:56:21] its back [22:56:24] \o/ [22:56:27] nice [22:56:42] the staged upgrade was not very good about updating serial output [22:56:55] it showed like 2 updates from 0 to over half done [22:56:57] to reboot [22:57:03] ah yea, i noticed that too. good output only if am connected BEFORE the reboot [22:57:06] or so [22:57:50] robh: you might have to do "keyholder arm" now [22:57:52] on tin [22:58:00] and enter the passphrase that is saved in pwstore [22:58:15] will do [22:58:16] but at least it's just _one_ passphrase now and not 10, since i made them the same [22:58:19] thx [22:58:25] ive not done that so nice to konw it [22:58:48] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [22:58:54] it should be called "deployment-key-passphrase" or so in pwstore [22:58:57] yea, that ^ [22:59:12] so sudo -i keyholder arm ? [22:59:34] and done [22:59:57] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [23:00:03] :) ok [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170821T2300). Please do the needful. [23:00:06] Niharika: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:10] just in time! [23:00:12] perfect timing. 1 second [23:00:12] lol [23:00:13] heh [23:00:20] o/ I'm available but can't SWAT. [23:01:12] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3539904 (10RobH) [23:04:33] /12/12 [23:06:49] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3539905 (10Dzahn) [23:17:12] :( [23:29:31] 10Operations, 10monitoring, 10Patch-For-Review: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#3539928 (10Dzahn) Alex said on Gerrit: "This patch makes it possible for a host to not be in our icinga installation configured. Which is not what T151632 originally asked for... [23:39:23] (03CR) 10Dzahn: base::monitoring: make it possible to disable monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/368124 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [23:39:25] (03PS9) 10Dzahn: base::monitoring: make it possible to disable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/368124 (https://phabricator.wikimedia.org/T151632) [23:39:51] (03CR) 10Dzahn: "addressed the technical comments and put the meta comment on the ticket" [puppet] - 10https://gerrit.wikimedia.org/r/368124 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn)