[00:00:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 46.67% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:19:41] (03PS1) 10Brion VIBBER: Switch in WebM VP9/Opus video transcodes to replace WebM VP8/Vorbis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447572 (https://phabricator.wikimedia.org/T63805) [00:24:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [02:23:24] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.12) (duration: 09m 44s) [02:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:04] robh: Not sure who best to ask, but I’m still awaiting review of these webperf patches, would like feedback and/or to land soon so I can work continue with moving arclamp from mwlog1001 [02:25:10] https://gerrit.wikimedia.org/r/#/q/status:open+hashtag:beta-picked+project:operations/puppet+branch:production+topic:webperf [02:26:32] At this point they should all be no-ops for prod, mostly refactoring to prepare for the next step. [02:43:18] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54478 MB (3% inode=99%) [02:52:46] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.13) (duration: 09m 19s) [02:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:38] RECOVERY - Disk space on maps1001 is OK: DISK OK [03:03:08] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Jul 24 03:03:08 UTC 2018 (duration 10m 22s) [03:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:29] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 783.45 seconds [03:50:38] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 177.34 seconds [04:44:49] !log Deploy schema change on db1066 (s2 primary master) T144010 T51190 T199368 [04:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:56] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [04:44:56] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [04:44:59] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [04:49:53] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081, db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447574 (https://phabricator.wikimedia.org/T200061) [04:51:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081, db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447574 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [04:52:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081, db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447574 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [04:53:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081, db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447574 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [04:54:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084, db1121 (duration: 00m 56s) [04:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:12] !log Stop replication in sync on db1081 and db1121 [04:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:55] !log Deploy schema change on db1081 T144010 T51190 T199368 [05:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:01] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:06:01] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:06:01] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [05:13:15] (03PS1) 10Marostegui: db-eqiad.php: Repool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447577 [05:14:57] (03PS2) 10Marostegui: db-eqiad.php: Repool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447577 [05:19:33] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447577 (owner: 10Marostegui) [05:20:52] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447577 (owner: 10Marostegui) [05:21:05] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447577 (owner: 10Marostegui) [05:25:25] marostegui: OK to deploy cxserver change? Let me know. [05:26:15] kart_: give me a minute :) [05:26:24] Sure [05:26:26] got distracted and didn't deploy my merged change above [05:27:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121 (duration: 00m 54s) [05:27:31] kart_: all yours! [05:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:38] marostegui: cool. [05:29:52] !log kartik@deploy1001 Started deploy [cxserver/deploy@d378d27]: Update cxserver to d3c9d15 (T198941) [05:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:56] T198941: SyntaxError: Unexpected token u in JSON at position 0 - https://phabricator.wikimedia.org/T198941 [05:33:53] !log kartik@deploy1001 Finished deploy [cxserver/deploy@d378d27]: Update cxserver to d3c9d15 (T198941) (duration: 04m 01s) [05:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:26] marostegui: done. [05:34:38] thanks [06:32:07] !log Deploy schema change on dbstore1002:s4 T144010 T51190 T199368 [06:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:13] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [06:32:13] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [06:32:14] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [06:33:56] (03PS1) 10Marostegui: db-eqiad.php: Repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447582 [06:35:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447582 (owner: 10Marostegui) [06:36:45] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447582 (owner: 10Marostegui) [06:38:17] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447582 (owner: 10Marostegui) [06:38:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1081 (duration: 00m 55s) [06:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:09] PROBLEM - Memory correctable errors -EDAC- on cp1049 is CRITICAL: 43 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1049&var-datasource=eqiad%2520prometheus%252Fops [06:47:06] (03PS1) 10Jcrespo: mariadb: Promote es1017 as the master of es3-eqiad (instead of es1014) [puppet] - 10https://gerrit.wikimedia.org/r/447584 (https://phabricator.wikimedia.org/T197073) [06:49:20] (03PS1) 10Jcrespo: Correct es2 and es3 masters on prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/447585 [06:50:17] (03PS2) 10Jcrespo: mariadb: Correct es2 and es3 masters on prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/447585 [06:51:13] (03CR) 10Jcrespo: [C: 032] mariadb: Correct es2 and es3 masters on prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/447585 (owner: 10Jcrespo) [06:53:35] (03PS2) 10Jcrespo: mariadb: Promote es1017 as the master of es3-eqiad (instead of es1014) [puppet] - 10https://gerrit.wikimedia.org/r/447584 (https://phabricator.wikimedia.org/T197073) [06:54:36] (03CR) 10Jcrespo: [C: 04-1] "Blocking until deployment time." [puppet] - 10https://gerrit.wikimedia.org/r/447584 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [06:56:51] (03PS1) 10Jcrespo: mariadb: Promote es1017 as the master of es3-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447586 (https://phabricator.wikimedia.org/T197073) [06:59:35] (03CR) 10Marostegui: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/11835/" [puppet] - 10https://gerrit.wikimedia.org/r/447584 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [07:06:23] (03PS2) 10Elukey: profile::kafka::broker: raise default max open files to 128k [puppet] - 10https://gerrit.wikimedia.org/r/447389 (https://phabricator.wikimedia.org/T200177) [07:07:23] elukey: dbstore1002 is misbehaving due to a schema change that is probably overloading it [07:07:26] I am on it [07:07:49] thanks :( [07:08:04] Big schema changes there are a pain :( [07:12:18] (03PS1) 10Jcrespo: Setup es1017 as the backend for the es3-eqiad master [dns] - 10https://gerrit.wikimedia.org/r/447587 (https://phabricator.wikimedia.org/T197073) [07:13:03] (03CR) 10Marostegui: "The change looks good to me, the commit message looks a bit strange to me though" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447586 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [07:13:35] elukey: Not much we can do now just let the alter finish [07:14:01] (03CR) 10Marostegui: [C: 031] Setup es1017 as the backend for the es3-eqiad master [dns] - 10https://gerrit.wikimedia.org/r/447587 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [07:16:28] (03CR) 10Volans: "I left some minor comments in the Python file" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447565 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [07:23:09] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11836/" [puppet] - 10https://gerrit.wikimedia.org/r/447389 (https://phabricator.wikimedia.org/T200177) (owner: 10Elukey) [07:42:40] (03CR) 10Volans: "Forgot to add one comment, see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447565 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [08:20:54] (03CR) 10Elukey: [C: 032] profile::kafka::broker: raise default max open files to 128k [puppet] - 10https://gerrit.wikimedia.org/r/447389 (https://phabricator.wikimedia.org/T200177) (owner: 10Elukey) [08:21:37] !log rolling restart of kafka jumbo/main-(eqiad|codfw) clusters to pick up the new max open files limit (infinity -> 128k) [08:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:05] !log restart varnish-fe on cache_text instances with cold, labeled VCL T200207 [08:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:09] T200207: Discard of cold labeled VCL crashes varnish parent and child - https://phabricator.wikimedia.org/T200207 [08:47:18] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10fgiunchedi) Thanks for the update @Cmjohnson, not particularly urgent but it would be nice to have graphite1004 before the end of the quarter [08:58:37] 10Operations, 10Traffic: Discard of cold, labeled VCL crashes varnish parent and child - https://phabricator.wikimedia.org/T200207 (10ema) [09:13:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447592 [09:15:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447592 (owner: 10Marostegui) [09:16:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447592 (owner: 10Marostegui) [09:17:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097:3314 (duration: 00m 54s) [09:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:56] !log Deploy schema change on db1097:3314 T144010 T51190 T199368 [09:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:01] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [09:18:02] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [09:18:02] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [09:18:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447592 (owner: 10Marostegui) [09:19:10] (03PS1) 10Ema: Revert "Revert "cache_text: add support for alternate_domains"" [puppet] - 10https://gerrit.wikimedia.org/r/447593 (https://phabricator.wikimedia.org/T164609) [09:19:25] (03PS2) 10Ema: Revert "Revert "cache_text: add support for alternate_domains"" [puppet] - 10https://gerrit.wikimedia.org/r/447593 (https://phabricator.wikimedia.org/T164609) [09:20:07] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "cache_text: add support for alternate_domains"" [puppet] - 10https://gerrit.wikimedia.org/r/447593 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [09:22:55] (03PS3) 10Ema: Revert "Revert "cache_text: add support for alternate_domains"" [puppet] - 10https://gerrit.wikimedia.org/r/447593 (https://phabricator.wikimedia.org/T164609) [09:30:19] (03PS2) 10DCausse: Upgrade to 6.3.1-alpha1 (without hebrew) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/446869 (https://phabricator.wikimedia.org/T199791) [09:37:25] (03PS1) 10Jcrespo: switchover: Make posible replica migration an optional, separate step [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447595 (https://phabricator.wikimedia.org/T199224) [09:37:49] (03CR) 10jerkins-bot: [V: 04-1] switchover: Make posible replica migration an optional, separate step [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447595 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [09:47:20] (03PS2) 10Jcrespo: switchover: Make posible replica migration an optional, separate step [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447595 (https://phabricator.wikimedia.org/T199224) [09:59:11] 10Operations, 10Analytics, 10EventBus, 10Services (watching): Set a proper max open files limit for Kafka clusters - https://phabricator.wikimedia.org/T200177 (10mobrovac) [10:09:55] (03CR) 10Jcrespo: [C: 032] switchover: Make posible replica migration an optional, separate step [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447595 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [10:17:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447596 [10:23:01] (03PS1) 10Filippo Giunchedi: mtail: gather metrics on systemd respawns [puppet] - 10https://gerrit.wikimedia.org/r/447597 (https://phabricator.wikimedia.org/T147923) [10:30:19] (03PS2) 10Filippo Giunchedi: mtail: gather metrics on systemd respawns [puppet] - 10https://gerrit.wikimedia.org/r/447597 (https://phabricator.wikimedia.org/T147923) [10:30:25] (03CR) 10Filippo Giunchedi: [C: 032] mtail: gather metrics on systemd respawns [puppet] - 10https://gerrit.wikimedia.org/r/447597 (https://phabricator.wikimedia.org/T147923) (owner: 10Filippo Giunchedi) [10:32:04] 10Operations, 10Analytics, 10EventBus, 10Services (watching): Set a proper max open files limit for Kafka clusters - https://phabricator.wikimedia.org/T200177 (10elukey) 05Open>03Resolved [10:38:24] (03PS1) 10Filippo Giunchedi: syslog: add systemd.mtail [puppet] - 10https://gerrit.wikimedia.org/r/447598 (https://phabricator.wikimedia.org/T147923) [10:38:34] (03CR) 10Filippo Giunchedi: [C: 032] syslog: add systemd.mtail [puppet] - 10https://gerrit.wikimedia.org/r/447598 (https://phabricator.wikimedia.org/T147923) (owner: 10Filippo Giunchedi) [10:39:38] (03PS2) 10Filippo Giunchedi: syslog: add systemd.mtail [puppet] - 10https://gerrit.wikimedia.org/r/447598 (https://phabricator.wikimedia.org/T147923) [10:40:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447599 [10:42:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447599 (owner: 10Marostegui) [10:43:35] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447599 (owner: 10Marostegui) [10:44:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 (duration: 00m 55s) [10:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:59] !log Stop replication in sync on db1084 and db1097:3314 [10:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:10] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447599 (owner: 10Marostegui) [10:48:48] (03PS1) 10Marostegui: db-eqiad.php: Repool db1084, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447601 [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180724T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:01:01] (03PS1) 10Jcrespo: switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) [11:01:21] (03CR) 10jerkins-bot: [V: 04-1] switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [11:02:34] (03PS2) 10Jcrespo: switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) [11:02:54] (03CR) 10jerkins-bot: [V: 04-1] switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [11:04:34] (03PS1) 10Marostegui: db-eqiad.php: Repool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447604 [11:06:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447604 (owner: 10Marostegui) [11:07:35] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447604 (owner: 10Marostegui) [11:07:58] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447604 (owner: 10Marostegui) [11:08:25] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447596 (owner: 10Marostegui) [11:08:35] (03PS2) 10Marostegui: db-eqiad.php: Repool db1084, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447601 [11:08:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097:3314 (duration: 01m 06s) [11:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1084, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447601 (owner: 10Marostegui) [11:11:31] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1084, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447601 (owner: 10Marostegui) [11:12:10] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1084, db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447601 (owner: 10Marostegui) [11:12:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 (duration: 00m 55s) [11:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:02] !log start of ladsgroup@mwmaint1001:~$ foreachwikiindblist s2 populateChangeTagDef.php --sleep 2 (T193873) [11:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:06] T193873: Run maintenance script to populate change_tag_def on WMF production (all wikis) - https://phabricator.wikimedia.org/T193873 [11:19:47] https://phabricator.wikimedia.org/T200121 [11:20:10] this is a serious issue, some files are lost :(( [11:21:40] !log start of ladsgroup@mwmaint1001:~$ foreachwikiindblist s1 populateChangeTagDef.php (T193873) [11:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:58] (03PS3) 10Jcrespo: switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) [11:31:21] (03CR) 10jerkins-bot: [V: 04-1] switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [11:34:28] (03PS1) 10Zfilipin: Group0 to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447607 [11:35:26] !log disable puppet on cp-text hosts to merge alternate domains patch T164609 [11:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:30] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [11:36:29] (03CR) 10Ema: [C: 032] Revert "Revert "cache_text: add support for alternate_domains"" [puppet] - 10https://gerrit.wikimedia.org/r/447593 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [11:36:37] (03PS4) 10Ema: Revert "Revert "cache_text: add support for alternate_domains"" [puppet] - 10https://gerrit.wikimedia.org/r/447593 (https://phabricator.wikimedia.org/T164609) [11:37:25] (03PS4) 10Jcrespo: switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) [11:37:47] (03CR) 10jerkins-bot: [V: 04-1] switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [11:41:33] (03PS5) 10Jcrespo: switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) [11:42:00] !log zfilipin@deploy1001 Started scap: testwiki to php-1.32.0-wmf.14 and rebuild l10n cache [11:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:14] (03CR) 10Jcrespo: "This is still untested, but it will handle automatically (but without puppet) the stop and kill of heartbeat." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [11:43:19] (03CR) 10Jcrespo: [C: 04-1] switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [11:45:42] !log zfilipin@deploy1001 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_2212739269" --threads=30 --lang en --quiet' returned non-zero exit status 1 (duration: 03m 42s) [11:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:33] !log zfilipin@deploy1001 Started scap: testwiki to php-1.32.0-wmf.14 and rebuild l10n cache [11:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:23] !log zfilipin@deploy1001 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_4179557944" --threads=30 --lang en --quiet' returned non-zero exit status 1 (duration: 02m 50s) [11:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180724T1200) [12:00:31] (03PS6) 10Jcrespo: switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) [12:03:15] (03PS7) 10Jcrespo: switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) [12:06:40] kart_: re T199941, is it resolved? or just no longer blocking the train? [12:06:41] T199941: Fatal MWException in Babel: "Language::isValidBuiltInCode must be passed a string" - https://phabricator.wikimedia.org/T199941 [12:07:53] !log depool cp1067 to test alternate domains patch T164609 [12:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:56] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [12:11:49] !log vacuum full of postgres on maps1001 to try to reclaim space - T200228 [12:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:17] T200228: disk space alert on maps1001 - https://phabricator.wikimedia.org/T200228 [12:22:21] zeljkof: I tried reproduced as originally mentioned in the ticket. eg. Not happening at: https://test.wikipedia.org/wiki/User:KartikMistry [12:26:12] kart_: can you confirm it's resolved? or should somebody else confirm it? Krinkle? [12:27:44] zeljkof: Krinkle is better person to confirm. [12:28:49] kart_: thanks, this is the only thing blocking the last week's train and I am trying to get it resolved as quickly as possible, we should start with .14 today, and .13 is still blocked :/ [12:52:25] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447611 [12:53:32] (03CR) 10BBlack: Serve WebP variants for the hottest thumbnails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434055 (https://phabricator.wikimedia.org/T27611) (owner: 10Gilles) [12:53:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447611 (owner: 10Marostegui) [12:55:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447611 (owner: 10Marostegui) [12:56:14] zeljkof: I was hoping to deploy right before the train, but there is an untracked file in mediawiki-staging on deploy1001 [12:56:18] modified: wikiversions.json [12:58:11] marostegui: fixed, train stuff [12:58:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447611 (owner: 10Marostegui) [12:58:27] ok thanks! deploying! [12:59:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 (duration: 00m 55s) [12:59:37] all done - thanks zeljkof [12:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] hashar: That opportune time is upon us again. Time for a MediaWiki train - European version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180724T1300). [13:01:48] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1330467944 [13:02:08] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 16783224 [13:02:28] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 789072856 [13:02:44] gehel: ^^^ [13:03:19] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 865520 [13:03:38] RECOVERY - Postgres Replication Lag on maps1004 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 587272 [13:03:59] yep, that's clearly me! [13:04:03] volans: thanks! [13:05:09] yw :) [13:05:18] hopefully nothing major [13:05:54] volans: nope, vacuuming the tables to see if there is recoverable space (and it looks like there is) [13:06:13] ack, nice [13:06:19] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2256 [13:06:34] I should have downtimed that alert (now done) [13:07:40] !log repool cp1067 with alternate domains support T164609 [13:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:43] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [13:15:57] (03PS10) 10Ema: cache_text: load misc VCL as wikimedia_misc in VTC files [puppet] - 10https://gerrit.wikimedia.org/r/443930 (https://phabricator.wikimedia.org/T164609) [13:16:42] (03CR) 10Ema: [C: 032] cache_text: load misc VCL as wikimedia_misc in VTC files [puppet] - 10https://gerrit.wikimedia.org/r/443930 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [13:17:01] (03PS8) 10Ema: cache_text: add misc-specific VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/443974 (https://phabricator.wikimedia.org/T164609) [13:17:38] (03CR) 10Ema: [C: 032] cache_text: add misc-specific VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/443974 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [13:18:38] PROBLEM - Check systemd state on ms-be2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:19:23] marostegui: sorry, just saw this, I was on lunch, train is blocked :/ [13:28:50] Ah :| [13:28:55] Still? [13:29:06] Then I will deploy something else I think :) [13:31:22] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447615 [13:32:27] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447615 (owner: 10Marostegui) [13:33:32] (03PS2) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447615 [13:34:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447615 (owner: 10Marostegui) [13:35:52] (03PS1) 10Zfilipin: all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447617 [13:35:54] (03CR) 10Zfilipin: [C: 032] all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447617 (owner: 10Zfilipin) [13:35:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447615 (owner: 10Marostegui) [13:37:10] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447617 (owner: 10Zfilipin) [13:38:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 (duration: 01m 59s) [13:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:13] !log Stop replication in sync db1081 and db1103:3314 [13:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447615 (owner: 10Marostegui) [13:40:34] !log Deploy schema change on db1081 T144010 T51190 T199368 [13:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:40] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [13:40:41] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [13:40:41] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [13:42:28] !log Stop replication in sync db1084 and db1103:3314 [13:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:34] !log Deploy schema change on db1084 T144010 T51190 T199368 [13:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:51] !log apply alternate domains patch to text-eqiad T164609 [13:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:55] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [13:53:22] (03CR) 10Marostegui: switchover: Add the functionality to start and stop heartbeat (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [13:58:52] (03PS1) 10Zfilipin: all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447620 (https://phabricator.wikimedia.org/T191059) [13:59:09] PROBLEM - Varnish frontend child restarted on cp1068 is CRITICAL: 4 gt 3 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1068&var-datasource=eqiad+prometheus/ops [14:00:20] known ^ [14:03:23] Krinkle: I have almost deployed .13 everywhere, but I see you have added T200269 to blockers of T191059 [14:03:24] T191059: 1.32.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T191059 [14:03:24] T200269: Unable to undelete revision (Fatal error: given Title does not belong to page ID, RevisionStoreRecord) - https://phabricator.wikimedia.org/T200269 [14:03:42] well, this train is going well :/ [14:03:55] zeljkof: Sorry :/ [14:04:28] Either these are really difficult problem, or we're understaffed, or it seems people aren't working on the UBNs. [14:04:35] Krinkle: thanks for the report :) I just want this train to finish, somehow [14:04:39] Yeah [14:05:09] or somehow we don't test important things before train [14:09:29] !log Deploy schema change on db1103:3314 T144010 T51190 T199368 [14:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:35] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [14:09:35] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [14:09:35] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [14:14:59] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1069 is CRITICAL: cluster=mysql device=megaraid,0 instance=db1069:9100 job=node site=eqiad Jcrespo https://phabricator.wikimedia.org/T199056 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [14:19:54] (03Abandoned) 10Zfilipin: all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447620 (https://phabricator.wikimedia.org/T191059) (owner: 10Zfilipin) [14:21:31] (03PS1) 10Zhuyifei1999: Add libmysqlclient-dev to python 3 base docker image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/447622 (https://phabricator.wikimedia.org/T190274) [14:22:03] (03PS1) 10Zfilipin: Revert "all wikis to 1.32.0-wmf.13" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447623 (https://phabricator.wikimedia.org/T191059) [14:22:28] (03PS1) 10Zhuyifei1999: Add .gitreview [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/447625 [14:23:55] (03PS2) 10Zfilipin: Group 2 back to php-1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447623 (https://phabricator.wikimedia.org/T191059) [14:24:21] (03CR) 10Zfilipin: [C: 032] Group 2 back to php-1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447623 (https://phabricator.wikimedia.org/T191059) (owner: 10Zfilipin) [14:24:29] (03CR) 10Thcipriani: [C: 031] Group 2 back to php-1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447623 (https://phabricator.wikimedia.org/T191059) (owner: 10Zfilipin) [14:25:35] (03Merged) 10jenkins-bot: Group 2 back to php-1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447623 (https://phabricator.wikimedia.org/T191059) (owner: 10Zfilipin) [14:34:38] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [14:42:59] looking ^ [14:47:51] !log T156137: banning elastic1031 due to high load (same "getEntryAfterMiss" symptoms) [14:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:55] T156137: Reduce impact of GC pauses on elasticsearch response time - https://phabricator.wikimedia.org/T156137 [14:48:57] (03PS1) 10Marostegui: db-eqiad.php: Repool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447626 [14:50:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447626 (owner: 10Marostegui) [14:51:44] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447626 (owner: 10Marostegui) [14:53:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 (duration: 01m 02s) [14:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:32] (03PS1) 10Gehel: maps: disable OSM updates on eqiad while vacuum is running [puppet] - 10https://gerrit.wikimedia.org/r/447627 [14:57:06] (03CR) 10jerkins-bot: [V: 04-1] maps: disable OSM updates on eqiad while vacuum is running [puppet] - 10https://gerrit.wikimedia.org/r/447627 (owner: 10Gehel) [14:58:07] (03PS2) 10Gehel: maps: disable OSM updates on eqiad while vacuum is running [puppet] - 10https://gerrit.wikimedia.org/r/447627 (https://phabricator.wikimedia.org/T200228) [15:08:09] (03PS6) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [15:09:39] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [15:10:00] (03PS1) 10Andrew Bogott: Restrict cloud dns recursors to $LABS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/447632 [15:10:34] (03PS8) 10Jcrespo: switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) [15:10:48] (03CR) 10jerkins-bot: [V: 04-1] Restrict cloud dns recursors to $LABS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/447632 (owner: 10Andrew Bogott) [15:11:04] (03CR) 10jerkins-bot: [V: 04-1] WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [15:11:13] (03CR) 10jerkins-bot: [V: 04-1] switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [15:12:30] (03PS2) 10Andrew Bogott: Restrict cloud dns recursors to $LABS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/447632 [15:20:16] (03PS9) 10Jcrespo: switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) [15:27:36] (03CR) 10Andrew Bogott: [C: 032] Restrict cloud dns recursors to $LABS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/447632 (owner: 10Andrew Bogott) [15:29:20] !log restart postgres on maps1001 - T200228 [15:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:24] T200228: disk space alert on maps1001 - https://phabricator.wikimedia.org/T200228 [15:33:42] (03CR) 10Mholloway: [C: 031] maps: disable OSM updates on eqiad while vacuum is running [puppet] - 10https://gerrit.wikimedia.org/r/447627 (https://phabricator.wikimedia.org/T200228) (owner: 10Gehel) [15:34:29] RECOVERY - Recursive DNS on 208.80.153.78 is OK: DNS OK: 0.240 seconds response time. www.wikipedia.org returns 208.80.154.224 [15:37:22] (03PS1) 10Andrew Bogott: labservices: typo fix in heira [puppet] - 10https://gerrit.wikimedia.org/r/447634 [15:37:59] (03PS10) 10Jcrespo: switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) [15:38:13] (03CR) 10Andrew Bogott: [C: 032] labservices: typo fix in heira [puppet] - 10https://gerrit.wikimedia.org/r/447634 (owner: 10Andrew Bogott) [15:38:55] !log stopping puppet on es2017, es2018; changing mysql configuration for production testing [15:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:27] (03PS1) 10Marostegui: db-eqiad.php: Repool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447635 [15:42:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447635 (owner: 10Marostegui) [15:44:50] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447635 (owner: 10Marostegui) [15:45:39] es2018 and es2019 will alert of replica lag [15:45:43] this is expected [15:46:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103:3314 (duration: 01m 02s) [15:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:13] it is part of the test I am doing (diconnected both datacenter to make sure they do not afffect the primary dc) [15:48:28] (03PS7) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [15:48:51] (03CR) 10Jcrespo: [C: 032] switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [15:49:05] (03CR) 10Jcrespo: [C: 032] switchover: Add the functionality to start and stop heartbeat [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447603 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [15:49:12] (03CR) 10jerkins-bot: [V: 04-1] WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [15:55:48] !log test switchover from es2017 to es2018 [15:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:38] PROBLEM - MariaDB Slave IO: es3 on es2017 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1275, Errmsg: error connecting to master repl@es2018.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Server is running in --secure-auth mode, but repl@10.192.0.142 has a password in the old format: please change the password to the new format [15:59:17] !log T156137: restarting elasticsearch on elastic1031 to disable G1GC [15:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:21] T156137: Reduce impact of GC pauses on elasticsearch response time - https://phabricator.wikimedia.org/T156137 [15:59:29] We are fixing that alert [15:59:46] It was part of a test [16:00:05] godog, moritzm, and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180724T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:01:18] (03PS8) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [16:01:41] (03CR) 10jerkins-bot: [V: 04-1] WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [16:01:58] RECOVERY - MariaDB Slave IO: es3 on es2017 is OK: OK slave_io_state Slave_IO_Running: Yes [16:02:46] 10Operations, 10netops: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10ayounsi) p:05Triage>03Normal [16:02:47] !log T156137: unbanning elastic1031 [16:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:49] PROBLEM - MariaDB Slave Lag: es3 on es2017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1468.56 seconds [16:07:07] ^ that is part of a test [16:07:42] !log test switchover from es2018 to es2017 [16:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:59] RECOVERY - MariaDB Slave Lag: es3 on es2017 is OK: OK slave_sql_lag not a slave [16:08:01] SUCCESS: Master switch completed successfully [16:08:25] it took a bit more, but I am doing cross-dc commands (around 4-5 seconds) [16:20:14] (03PS2) 10Bstorm: gridengine: try to translate all the Ubuntu package calls to Debian [puppet] - 10https://gerrit.wikimedia.org/r/447561 (https://phabricator.wikimedia.org/T199276) [16:21:32] (03PS1) 10Anomie: Set MCR write-both-read-old on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447638 (https://phabricator.wikimedia.org/T197817) [16:21:49] (03PS1) 10Anomie: Set MCR read-old-write-both on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447639 (https://phabricator.wikimedia.org/T198311) [16:21:51] (03PS1) 10Anomie: Set MCR read-new-write-both on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447640 (https://phabricator.wikimedia.org/T198311) [16:28:39] PROBLEM - MariaDB Slave Lag: es3 on es2018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2776.41 seconds [16:29:09] anomie: it's taking so long to merge :/ [16:29:26] ETA 6 minutes [16:29:42] (03PS1) 10Ema: cache_canary: add phabricator for testing purposes [puppet] - 10https://gerrit.wikimedia.org/r/447643 (https://phabricator.wikimedia.org/T164609) [16:34:02] (03PS2) 10Ema: cache_canary: add phabricator for testing purposes [puppet] - 10https://gerrit.wikimedia.org/r/447643 (https://phabricator.wikimedia.org/T164609) [16:37:00] (03CR) 10Ema: [C: 032] cache_canary: add phabricator for testing purposes [puppet] - 10https://gerrit.wikimedia.org/r/447643 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [16:38:05] Amir1: it's merged, right? [16:38:15] merged right now [16:38:21] pulling it in wmdebug1002 [16:39:57] anomie: It's live in mwdebug1002, can you test it there? [16:40:13] Amir1: Worked. [16:41:49] anomie: Thanks, syncing [16:42:45] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.13/includes/page/PageArchive.php: [[gerrit:447636|PageArchive: Pass correct overrides to newRevisionFromArchiveRow() (T200072)]] (duration: 01m 03s) [16:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:49] T200072: MCR causes Cognate integration test fails: The given Title does not belong to page ID 2 but actually belongs to 4 - https://phabricator.wikimedia.org/T200072 [16:43:22] zeljkof: It's deployed now [16:43:24] (03PS2) 10EBernhardson: Split elasticsearch::log::hot_threads into two pieces [puppet] - 10https://gerrit.wikimedia.org/r/447565 (https://phabricator.wikimedia.org/T198351) [16:43:26] (03CR) 10EBernhardson: Split elasticsearch::log::hot_threads into two pieces (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447565 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [16:43:38] Amir1: thanks! [16:43:40] merging and deploying the .14 atm [16:44:21] (03PS1) 10Ema: cache_text: disable all alternate_domains but grafana [puppet] - 10https://gerrit.wikimedia.org/r/447646 (https://phabricator.wikimedia.org/T164609) [16:45:40] 10Operations, 10ops-eqiad, 10User-fgiunchedi: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Host is back in service [16:49:55] (03PS1) 10Jcrespo: mariadb: Fix heartbeat regex [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447648 (https://phabricator.wikimedia.org/T199224) [16:54:31] (03PS1) 10Dzahn: planet: fix missing language in link element [puppet] - 10https://gerrit.wikimedia.org/r/447649 (https://phabricator.wikimedia.org/T198680) [16:54:35] (03PS2) 10Ema: cache_text: disable all alternate domains but config-master [puppet] - 10https://gerrit.wikimedia.org/r/447646 (https://phabricator.wikimedia.org/T164609) [16:55:53] (03PS1) 10Zfilipin: all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447650 [16:55:55] (03CR) 10Zfilipin: [C: 032] all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447650 (owner: 10Zfilipin) [16:56:33] (03Abandoned) 10Zfilipin: Group0 to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447607 (owner: 10Zfilipin) [16:56:56] (03CR) 10Zfilipin: [C: 04-2] all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447650 (owner: 10Zfilipin) [16:57:10] (03CR) 10Zfilipin: all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447650 (owner: 10Zfilipin) [16:58:00] !log finishing test on es3 hosts T199224 [16:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:04] T199224: Test database master switchover script on codfw - https://phabricator.wikimedia.org/T199224 [16:58:18] (03PS3) 10Ema: cache_text: disable all alternate domains [puppet] - 10https://gerrit.wikimedia.org/r/447646 (https://phabricator.wikimedia.org/T164609) [16:58:19] RECOVERY - MariaDB Slave Lag: es3 on es2018 is OK: OK slave_sql_lag Replication lag: 0.16 seconds [16:58:20] alerts will end as soon as replicas catch up soon [16:59:45] (03CR) 10Jcrespo: [C: 032] mariadb: Fix heartbeat regex [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447648 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180724T1700). [17:00:49] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54522 MB (3% inode=99%) [17:00:59] (03PS2) 10Zfilipin: all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447650 [17:01:15] gehel: ^ [17:02:56] jynus: thanks! [17:03:29] is that normal? [17:03:42] not normal, but something you were aware? [17:04:09] jynus: not normal, but Î'm aware and working on it [17:04:18] I'm silencing it for now [17:05:58] (03CR) 10Thcipriani: [C: 031] all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447650 (owner: 10Zfilipin) [17:06:55] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.14/includes/page/PageArchive.php: [[gerrit:447636|PageArchive: Pass correct overrides to newRevisionFromArchiveRow() (T200072)]] (duration: 01m 01s) [17:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:58] T200072: MCR causes Cognate integration test fails: The given Title does not belong to page ID 2 but actually belongs to 4 - https://phabricator.wikimedia.org/T200072 [17:07:11] (03CR) 10Zfilipin: [C: 032] all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447650 (owner: 10Zfilipin) [17:08:27] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447650 (owner: 10Zfilipin) [17:13:13] (03CR) 10Ema: [C: 032] cache_text: disable all alternate domains [puppet] - 10https://gerrit.wikimedia.org/r/447646 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [17:14:02] Amir1: the fix you link in your sync is for wmf.13, but you sync'd wmf.14, is that correct? [17:14:30] thcipriani: I just sync'd the .14 fix it's in SAL [17:14:47] I meant gerrit 447636 [17:14:58] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/447636/ [17:15:05] seems to be for wmf.13 [17:15:26] ah, yeah, I forgot to change the deploy summary, sorry [17:15:35] ah, ok [17:15:43] wmf.14 isn't deployed anywhere yet [17:16:01] so we're wrangling it on the deployment servers now [17:16:01] just to be sure :D [17:16:12] I rebased it as well [17:16:22] as long as that change is merged in wmf.14 my current plan should be ok :) [17:17:40] !log restart eventstreams on scb2* nodes (hopefully last time before deploying the fix) to avoid mem leaks issues during the EU night [17:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:18] !log restart varnish-fe on cp1068 to clear "child restarted" alert T164609 [17:18:18] !log train window running long, services deploy delayed [17:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:22] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [17:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:59] RECOVERY - Varnish frontend child restarted on cp1068 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1068&var-datasource=eqiad+prometheus/ops [17:19:40] cscott, arlolra, subbu,halfak, Amir1: please stand by for services deploy, we are probably going to move wmf.13 to all wikis right now [17:27:52] (03PS2) 10Dzahn: planet: fix missing language in link element [puppet] - 10https://gerrit.wikimedia.org/r/447649 (https://phabricator.wikimedia.org/T198680) [17:28:15] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/11844/planet1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/447649 (https://phabricator.wikimedia.org/T198680) (owner: 10Dzahn) [17:33:35] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.13 [17:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:21] load time of Special:Tags in frwiki is now down from 3s to 0.4s [17:38:28] RECOVERY - Disk space on maps1001 is OK: DISK OK [17:40:19] !log re-enable puppet on all cache nodes with alternate domains disabled T164609 [17:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:23] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [17:42:56] (03PS3) 10Gehel: maps: disable OSM updates on eqiad while vacuum is running [puppet] - 10https://gerrit.wikimedia.org/r/447627 (https://phabricator.wikimedia.org/T200228) [17:43:37] (03CR) 10Gehel: [C: 032] maps: disable OSM updates on eqiad while vacuum is running [puppet] - 10https://gerrit.wikimedia.org/r/447627 (https://phabricator.wikimedia.org/T200228) (owner: 10Gehel) [17:44:25] (03PS4) 10Gehel: Enable fetching constraints for Updater [puppet] - 10https://gerrit.wikimedia.org/r/445454 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [17:45:00] (03PS1) 10Bstorm: wiki replicas: moving compatibility views to $table_compat [puppet] - 10https://gerrit.wikimedia.org/r/447654 (https://phabricator.wikimedia.org/T174047) [17:46:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019 (10Dzahn) a:03Dzahn [17:47:09] (03CR) 10Gehel: [C: 032] Enable fetching constraints for Updater [puppet] - 10https://gerrit.wikimedia.org/r/445454 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [17:57:00] (03Abandoned) 10Dzahn: add IPv6 for bast3003 [dns] - 10https://gerrit.wikimedia.org/r/405225 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [17:57:08] (03Abandoned) 10Dzahn: bast3002->bast3003 in DHCP,network constants,smokeping [puppet] - 10https://gerrit.wikimedia.org/r/405229 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [17:57:13] (03Abandoned) 10Dzahn: bast3002->bast3003 as prometheus node, rm from site [puppet] - 10https://gerrit.wikimedia.org/r/405230 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [17:57:19] (03Abandoned) 10Dzahn: prometheus.svc.esams.wmnet: bast3002->bast3003 [dns] - 10https://gerrit.wikimedia.org/r/405231 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [17:57:25] (03Abandoned) 10Dzahn: decom bast3002, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/405232 (https://phabricator.wikimedia.org/T184936) (owner: 10Dzahn) [17:59:12] 10Operations, 10ops-esams: bast3002 sdb broken - https://phabricator.wikimedia.org/T169035 (10Dzahn) [17:59:27] 10Operations, 10ops-esams, 10Patch-For-Review: install/designate other machine as esams bastion - https://phabricator.wikimedia.org/T184936 (10Dzahn) 05stalled>03Invalid Thanks Mark :) i think we can close this as Invalid then. [18:01:02] (03PS1) 10Jcrespo: switchover: Commit pending transactions when setting read only [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447655 [18:01:28] (03CR) 10Jcrespo: [C: 032] switchover: Commit pending transactions when setting read only [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447655 (owner: 10Jcrespo) [18:02:42] 10Operations, 10ops-eqiad: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Dzahn) You can assign this to me after the initial setup to implement service. [18:06:11] (03PS2) 10Dzahn: phabricator weekly project changes email: Ignore disabled new assignees [puppet] - 10https://gerrit.wikimedia.org/r/443401 (https://phabricator.wikimedia.org/T195780) (owner: 10Aklapper) [18:07:22] (03PS1) 10Andrew Bogott: wmcs region-migrate: add a wait and reboot after the copy [puppet] - 10https://gerrit.wikimedia.org/r/447656 [18:08:09] (03CR) 10Andrew Bogott: [C: 032] wmcs region-migrate: add a wait and reboot after the copy [puppet] - 10https://gerrit.wikimedia.org/r/447656 (owner: 10Andrew Bogott) [18:10:57] (03CR) 10Dzahn: [C: 032] phabricator weekly project changes email: Ignore disabled new assignees [puppet] - 10https://gerrit.wikimedia.org/r/443401 (https://phabricator.wikimedia.org/T195780) (owner: 10Aklapper) [18:11:17] (03PS3) 10Dzahn: phabricator weekly project changes email: Ignore disabled new assignees [puppet] - 10https://gerrit.wikimedia.org/r/443401 (https://phabricator.wikimedia.org/T195780) (owner: 10Aklapper) [18:15:29] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Jenkins, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561 (10Dzahn) [18:15:33] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Jenkins, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561 (10Dzahn) @Hashar Can we change Jenkins config to use the new host per Krinkle's question above? [18:19:45] (03PS1) 10Jcrespo: switchover: Fix bug where shards are added with an extra 'b' [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447659 (https://phabricator.wikimedia.org/T199224) [18:20:09] (03CR) 10Jcrespo: [C: 032] switchover: Fix bug where shards are added with an extra 'b' [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/447659 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [18:27:27] (03PS2) 10Dzahn: Remove priyankaivy.blogspot.com from Planet [puppet] - 10https://gerrit.wikimedia.org/r/444328 (owner: 10Amire80) [18:32:00] !log mobrovac@deploy1001 Started deploy [eventstreams/deploy@690fdad]: Wait for the client to consume the meesage being sent before consuming the next one - T199813 [18:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:05] T199813: EventStreams accumulates too much memory on SCB nodes in CODFW - https://phabricator.wikimedia.org/T199813 [18:32:26] (03CR) 10Dzahn: [C: 032] Remove priyankaivy.blogspot.com from Planet [puppet] - 10https://gerrit.wikimedia.org/r/444328 (owner: 10Amire80) [18:32:36] (03CR) 10Volans: [C: 031] "Thanks for the py3 migration and all the fixes! LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/447565 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [18:34:18] !log mobrovac@deploy1001 Finished deploy [eventstreams/deploy@690fdad]: Wait for the client to consume the meesage being sent before consuming the next one - T199813 (duration: 02m 18s) [18:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:34] (03PS2) 10Dzahn: phabricator: Print IDs of projects of tasks assigned to disabled accounts [puppet] - 10https://gerrit.wikimedia.org/r/446367 (owner: 10Aklapper) [18:45:02] (03CR) 10Dzahn: "+----------------------------------------------+--------------------------------+--------------------------------+" [puppet] - 10https://gerrit.wikimedia.org/r/446367 (owner: 10Aklapper) [18:45:57] (03CR) 10Dzahn: [C: 032] phabricator: Print IDs of projects of tasks assigned to disabled accounts [puppet] - 10https://gerrit.wikimedia.org/r/446367 (owner: 10Aklapper) [18:46:36] (03PS2) 10Thiemo Kreuz (WMDE): Do not leak local $wgWBShared… variables to th eglobal scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 [18:46:48] (03CR) 10jerkins-bot: [V: 04-1] Do not leak local $wgWBShared… variables to th eglobal scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 (owner: 10Thiemo Kreuz (WMDE)) [18:46:55] (03PS4) 10Dzahn: dumps: add phab1002 as second phab server [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) [18:49:43] (03Abandoned) 10Dzahn: convert check_prometheus_metric.py to python3 [puppet] - 10https://gerrit.wikimedia.org/r/441208 (owner: 10Dzahn) [18:59:34] (03CR) 10Gehel: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/447564 (owner: 10EBernhardson) [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180724T1900) [19:00:32] (03CR) 10Gehel: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/445320 (owner: 10EBernhardson) [19:00:38] (03CR) 10Gehel: [C: 031] Delete unused code in elasticsearch module [puppet] - 10https://gerrit.wikimedia.org/r/445320 (owner: 10EBernhardson) [19:05:11] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/447565 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [19:09:16] !log resetting postgres data on maps1003 after failing replication - T200228 [19:09:18] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /srv 52360 MB (10% inode=99%) [19:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:20] T200228: disk space alert on maps1001 - https://phabricator.wikimedia.org/T200228 [19:11:28] PROBLEM - MegaRAID on db1069 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [19:11:29] ACKNOWLEDGEMENT - MegaRAID on db1069 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T200287 [19:11:29] RECOVERY - Device not healthy -SMART- on db1069 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [19:13:45] (03PS3) 10EBernhardson: Delete unused code in elasticsearch module [puppet] - 10https://gerrit.wikimedia.org/r/445320 [19:23:07] 10Operations, 10ops-eqiad: Degraded RAID on db1069 - https://phabricator.wikimedia.org/T200287 (10Cmjohnson) Swapped disk current state is rebuild Firmware state: Rebuild Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware s... [19:24:37] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10Cmjohnson) [19:25:51] (03PS3) 10EBernhardson: Split elasticsearch::log::hot_threads into two pieces [puppet] - 10https://gerrit.wikimedia.org/r/447565 (https://phabricator.wikimedia.org/T198351) [19:25:53] (03PS2) 10EBernhardson: Make cirrus specific elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/447566 (https://phabricator.wikimedia.org/T198351) [19:32:02] (03CR) 10EBernhardson: "elasticsearch 5 is the default, so the only thing with elasticsearch 2 would have to have hiera config specifying it. Given that, there is" [puppet] - 10https://gerrit.wikimedia.org/r/447564 (owner: 10EBernhardson) [19:46:48] RECOVERY - Disk space on elastic1024 is OK: DISK OK [19:47:58] RECOVERY - MegaRAID on db1069 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [19:48:54] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1069 - https://phabricator.wikimedia.org/T200287 (10Marostegui) 05Open>03Resolved a:03Cmjohnson This is all good now Thank you! ``` root@db1069:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [19:49:44] 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) 05Open>03Resolved The disk got replaced and this is all good now: T200287#4448846 [19:51:32] (03PS3) 10EBernhardson: Make cirrus specific elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/447566 (https://phabricator.wikimedia.org/T198351) [19:58:15] (03PS2) 10Urbanecm: Add wikimania2019wiki Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/445766 (https://phabricator.wikimedia.org/T199509) [19:59:44] (03PS4) 10EBernhardson: Make cirrus specific elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/447566 (https://phabricator.wikimedia.org/T198351) [19:59:53] (03Abandoned) 10Urbanecm: Add wikimania2019wiki Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/445766 (https://phabricator.wikimedia.org/T199509) (owner: 10Urbanecm) [20:04:08] (03PS5) 10EBernhardson: Make cirrus specific elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/447566 (https://phabricator.wikimedia.org/T198351) [20:05:17] Reedy, Dereckson: We have 4 wikis pending. Can somebody clear the list? [20:05:36] I'm waiting for the wikimania.wikimedia.org to be handled in apache [20:05:42] (03PS2) 10Urbanecm: Initial configuration for wikimania2019wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445765 (https://phabricator.wikimedia.org/T199509) [20:07:00] ok [20:07:21] BTW, why you want to "Stop redirecting wikimania.wikimedia.org to the yearly wiki"? I don't understand that Reedy [20:07:34] Because we're going to put a wiki there [20:07:57] Oh, totally forgot... [20:08:00] Thanks [20:08:37] BTW, I've finally uploaded logos to the MW patch for wikimania2019wiki, so it is ready from this side as well. [20:08:41] (03PS6) 10EBernhardson: Make cirrus specific elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/447566 (https://phabricator.wikimedia.org/T198351) [20:11:22] (03PS1) 10Andrew Bogott: wmcs region-migrate: add ssh tests before and after [puppet] - 10https://gerrit.wikimedia.org/r/447719 [20:12:34] (03PS2) 10Andrew Bogott: wmcs region-migrate: add ssh tests before and after [puppet] - 10https://gerrit.wikimedia.org/r/447719 [20:13:13] (03CR) 10Andrew Bogott: [C: 032] wmcs region-migrate: add ssh tests before and after [puppet] - 10https://gerrit.wikimedia.org/r/447719 (owner: 10Andrew Bogott) [20:15:09] 10Operations, 10Analytics-Kanban, 10DNS, 10Release-Engineering-Team, and 5 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10JAllemandou) [20:17:00] (03PS7) 10EBernhardson: Make cirrus specific elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/447566 (https://phabricator.wikimedia.org/T198351) [20:17:02] (03PS2) 10EBernhardson: Split per-cluster config out of elasticsearch::curator [puppet] - 10https://gerrit.wikimedia.org/r/447567 (https://phabricator.wikimedia.org/T180807) [20:23:34] (03PS3) 10EBernhardson: Split per-cluster config out of elasticsearch::curator [puppet] - 10https://gerrit.wikimedia.org/r/447567 (https://phabricator.wikimedia.org/T180807) [20:23:36] (03PS8) 10EBernhardson: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) [20:27:24] (03PS4) 10EBernhardson: Split per-cluster config out of elasticsearch::curator [puppet] - 10https://gerrit.wikimedia.org/r/447567 (https://phabricator.wikimedia.org/T180807) [20:27:26] (03PS9) 10EBernhardson: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) [20:30:08] (03PS1) 10Ayounsi: Depool eqsin for cr1-eqsin software upgrade [dns] - 10https://gerrit.wikimedia.org/r/447721 [20:31:34] (03CR) 10Ayounsi: [C: 032] Depool eqsin for cr1-eqsin software upgrade [dns] - 10https://gerrit.wikimedia.org/r/447721 (owner: 10Ayounsi) [20:32:58] !log depooling eqsin for cr1-eqsin software upgrade [20:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:16] (03PS5) 10EBernhardson: Split per-cluster config out of elasticsearch::curator [puppet] - 10https://gerrit.wikimedia.org/r/447567 (https://phabricator.wikimedia.org/T180807) [20:42:28] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [20:43:03] ^ expected [20:46:34] and confirmed that this alert is working as expected [20:56:29] I downtimed everything I could find with a eqsin mention in icinga [21:00:20] !log restarting cr1-eqsin for software upgrade [21:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:49] PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL: CRITICAL - Plugin timed out while executing system call [21:02:32] ^ that's labs-recursor1.wikimedia.org, I'd guess unrelated to cr1-eqsin maintenance [21:02:39] PROBLEM - Host cp5011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:02:39] PROBLEM - Host cp5010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:02:39] PROBLEM - Host cp5004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:02:39] PROBLEM - Host cp5009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:02:59] PROBLEM - Host lvs5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:02:59] PROBLEM - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:02:59] PROBLEM - Host cp5005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:03:19] hum, mgmt doesn't have mr1 as parent I guess [21:04:08] RECOVERY - Recursive DNS on 208.80.154.20 is OK: DNS OK: 0.046 seconds response time. www.wikipedia.org returns 208.80.154.224 [21:04:09] PROBLEM - Host cp5012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:04:18] PROBLEM - Host cp5007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:04:18] PROBLEM - Host bast5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:04:38] PROBLEM - Host cp5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:04:38] PROBLEM - Host cp5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:04:38] PROBLEM - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:05:08] PROBLEM - Host lvs5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:05:08] PROBLEM - Host lvs5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:05:39] PROBLEM - Host dns5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:05:39] PROBLEM - Host dns5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:06:50] !log Install done, cr1-eqsin re-rebooting [21:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:58] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:03] 04Critical Alert for device cr1-eqsin.wikimedia.org - Critical syslog messages [21:08:08] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:08] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:08] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:08] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:08] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:08] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:08] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:09] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:09] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:18] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:18] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:18] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:18] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:19] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:19] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:19] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:20] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:28] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:28] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:29] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:29] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:39] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:39] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:39] IPsec alerts expected [21:08:48] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:48] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:48] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:48] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [21:08:49] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:49] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:58] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6 [21:08:58] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: (unnamed) not-conn: cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v6, cp5011_v6, cp5012_v6 [21:08:59] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 61 connecting: (unnamed) not-conn: cp5001_v6, cp5002_v6, cp5003_v6, cp5004_v6, cp5005_v6 [21:08:59] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 48 connecting: (unnamed) not-conn: cp5007_v6, cp5008_v6, cp5009_v6, cp5010_v6, cp5011_v6, cp5012_v6 [21:09:18] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 54 ESP OK [21:09:18] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 54 ESP OK [21:09:18] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 54 ESP OK [21:09:19] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 54 ESP OK [21:09:19] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 54 ESP OK [21:09:19] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 54 ESP OK [21:09:19] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 68 ESP OK [21:09:28] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK [21:09:28] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 68 ESP OK [21:09:29] RECOVERY - Host dns5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 239.19 ms [21:09:29] RECOVERY - Host cp5012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 241.44 ms [21:09:29] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK [21:09:29] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 54 ESP OK [21:09:29] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 68 ESP OK [21:09:29] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 68 ESP OK [21:09:30] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 68 ESP OK [21:09:38] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 68 ESP OK [21:09:38] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK [21:09:38] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 68 ESP OK [21:09:39] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 68 ESP OK [21:09:39] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 54 ESP OK [21:09:48] RECOVERY - Host cp5007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.85 ms [21:09:48] RECOVERY - Host bast5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 234.91 ms [21:09:48] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK [21:09:48] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 54 ESP OK [21:09:58] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 54 ESP OK [21:09:58] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 54 ESP OK [21:09:58] RECOVERY - Host lvs5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 247.92 ms [21:09:59] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 66 ESP OK [21:09:59] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 54 ESP OK [21:09:59] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK [21:09:59] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 54 ESP OK [21:09:59] RECOVERY - Host cp5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 240.75 ms [21:09:59] RECOVERY - Host cp5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 236.73 ms [21:10:00] RECOVERY - Host cp5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 245.65 ms [21:10:06] !log starting to see recoveries from cr1-eqsin upgrade [21:10:08] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK [21:10:08] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK [21:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:09] RECOVERY - Host lvs5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.59 ms [21:10:09] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [21:10:09] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK [21:10:18] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 54 ESP OK [21:10:18] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK [21:10:18] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 54 ESP OK [21:10:19] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK [21:10:29] RECOVERY - Host lvs5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 228.58 ms [21:10:59] RECOVERY - Host dns5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 239.58 ms [21:12:32] (03PS3) 10Bstorm: gridengine: try to translate all the Ubuntu package calls to Debian [puppet] - 10https://gerrit.wikimedia.org/r/447561 (https://phabricator.wikimedia.org/T199276) [21:13:05] 08Warning Alert for device mr1-eqsin.wikimedia.org - Inbound interface errors [21:13:08] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:28] RECOVERY - Host cp5011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 245.35 ms [21:13:28] RECOVERY - Host cp5010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 246.35 ms [21:13:28] RECOVERY - Host cp5004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 245.44 ms [21:13:28] RECOVERY - Host cp5009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 253.01 ms [21:13:48] RECOVERY - Host cp5008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 253.14 ms [21:13:48] RECOVERY - Host cp5005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 252.28 ms [21:14:19] PROBLEM - puppet last run on cp5012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:25] !log re1 is master routing engine on cr1-eqsin, triggering a re switch [21:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:03] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqsin.wikimedia.org recovered from Critical syslog messages [21:20:16] (03CR) 10Bstorm: [C: 032] gridengine: try to translate all the Ubuntu package calls to Debian [puppet] - 10https://gerrit.wikimedia.org/r/447561 (https://phabricator.wikimedia.org/T199276) (owner: 10Bstorm) [21:21:08] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:23:39] PROBLEM - puppet last run on cp5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:25:39] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:25:48] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:26:06] 08̶W̶a̶r̶n̶i̶n̶g Device mr1-eqsin.wikimedia.org recovered from Inbound interface errors [21:28:47] (03CR) 10Krinkle: [C: 031] JobQueue: Signal JobQueueEventBus is never read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447055 (https://phabricator.wikimedia.org/T199594) (owner: 10Mobrovac) [21:31:49] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 211 probes of 327 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [21:33:59] RECOVERY - puppet last run on cp5003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:34:58] RECOVERY - puppet last run on cp5012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:35:09] PROBLEM - puppet last run on lvs5001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_pybal] [21:39:37] rope-atlas probe is getting better [21:39:48] ripe* [21:41:18] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:42:08] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 5 probes of 327 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [21:48:35] (03PS1) 10Ayounsi: Revert "Depool eqsin for cr1-eqsin software upgrade" [dns] - 10https://gerrit.wikimedia.org/r/447726 [21:49:03] (03CR) 10Ayounsi: [C: 032] Revert "Depool eqsin for cr1-eqsin software upgrade" [dns] - 10https://gerrit.wikimedia.org/r/447726 (owner: 10Ayounsi) [21:50:38] RECOVERY - puppet last run on lvs5001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:59:42] !log re-pooling eqsin [21:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:56] (03PS1) 10Bstorm: gridengine: some more exec node package cleanup for stretch [puppet] - 10https://gerrit.wikimedia.org/r/447727 (https://phabricator.wikimedia.org/T199276) [22:02:38] (03CR) 10Bstorm: [C: 032] gridengine: some more exec node package cleanup for stretch [puppet] - 10https://gerrit.wikimedia.org/r/447727 (https://phabricator.wikimedia.org/T199276) (owner: 10Bstorm) [22:15:28] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [22:15:30] (03PS1) 10Bstorm: gridengine: just a couple more changes to work with stretch [puppet] - 10https://gerrit.wikimedia.org/r/447729 (https://phabricator.wikimedia.org/T199276) [22:17:23] (03CR) 10Bstorm: [C: 032] gridengine: just a couple more changes to work with stretch [puppet] - 10https://gerrit.wikimedia.org/r/447729 (https://phabricator.wikimedia.org/T199276) (owner: 10Bstorm) [22:21:32] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [22:22:34] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [22:23:13] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10awight) Creating a separate task presenting our questions as an RFC: {T200297} [22:29:58] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [22:39:19] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [22:41:02] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180724T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:01:58] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Sakretsu) For the record, IP and registered users are still reporting this issue from mobi... [23:02:15] (03PS1) 10RobH: decom prod dns for [dataset|ms]1001 [dns] - 10https://gerrit.wikimedia.org/r/447732 (https://phabricator.wikimedia.org/T194060) [23:05:06] (03PS1) 10RobH: decom dataset1001 & ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/447733 (https://phabricator.wikimedia.org/T194060) [23:12:18] 10Operations, 10ops-eqiad, 10decommission, 10User-ArielGlenn: decommission dataset1001, ms1001 - https://phabricator.wikimedia.org/T194060 (10RobH) [23:13:06] (03PS1) 10Dzahn: planet: fix broken URL in xmldescription, missing dot [puppet] - 10https://gerrit.wikimedia.org/r/447736 (https://phabricator.wikimedia.org/T198680) [23:14:14] 10Operations, 10ops-eqiad, 10decommission, 10User-ArielGlenn: decommission dataset1001, ms1001 - https://phabricator.wikimedia.org/T194060 (10RobH) [23:14:30] (03CR) 10RobH: [C: 032] decom dataset1001 & ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/447733 (https://phabricator.wikimedia.org/T194060) (owner: 10RobH) [23:14:59] (03CR) 10RobH: [C: 032] decom prod dns for [dataset|ms]1001 [dns] - 10https://gerrit.wikimedia.org/r/447732 (https://phabricator.wikimedia.org/T194060) (owner: 10RobH) [23:17:22] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [23:17:33] (03PS1) 10Smalyshev: Enable constraints fetching for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/447740 (https://phabricator.wikimedia.org/T192567) [23:17:34] 10Operations, 10ops-eqiad, 10decommission, 10User-ArielGlenn: decommission dataset1001, ms1001 - https://phabricator.wikimedia.org/T194060 (10RobH) a:03Cmjohnson [23:17:36] (03PS1) 10Smalyshev: Enable constraints fetching on internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/447741 (https://phabricator.wikimedia.org/T192567) [23:17:38] (03PS1) 10Smalyshev: Enable constraints loading everywhere [puppet] - 10https://gerrit.wikimedia.org/r/447742 (https://phabricator.wikimedia.org/T192567) [23:18:12] (03PS2) 10Dzahn: planet: fix broken URL in xmldescription, missing dot [puppet] - 10https://gerrit.wikimedia.org/r/447736 (https://phabricator.wikimedia.org/T198680) [23:18:29] (03PS2) 10Smalyshev: Enable constraints fetching for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/447740 (https://phabricator.wikimedia.org/T192567) [23:19:36] (03CR) 10Paladox: [C: 031] planet: fix broken URL in xmldescription, missing dot [puppet] - 10https://gerrit.wikimedia.org/r/447736 (https://phabricator.wikimedia.org/T198680) (owner: 10Dzahn) [23:20:06] (03CR) 10Dzahn: [C: 032] planet: fix broken URL in xmldescription, missing dot [puppet] - 10https://gerrit.wikimedia.org/r/447736 (https://phabricator.wikimedia.org/T198680) (owner: 10Dzahn) [23:26:00] (03CR) 10Legoktm: [C: 032] Add .gitreview [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/447625 (owner: 10Zhuyifei1999) [23:26:23] (03Merged) 10jenkins-bot: Add .gitreview [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/447625 (owner: 10Zhuyifei1999) [23:38:24] (03PS10) 10EBernhardson: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) [23:38:26] (03PS2) 10EBernhardson: Make elasticsearch http and transport ports explicit [puppet] - 10https://gerrit.wikimedia.org/r/447568 (https://phabricator.wikimedia.org/T198351) [23:41:53] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10Dzahn) Deprecated link on German Wikipedia. https://de.wikipedia.org/w/index.php?title=Wikipedia%3ATechnik%2FNetzwerk%2FDomains&type=revision&diff=1794... [23:43:35] (03PS1) 10Dzahn: planet: tune feed name, description, owneremail, maxarticles [puppet] - 10https://gerrit.wikimedia.org/r/447743 [23:44:04] (03CR) 10jerkins-bot: [V: 04-1] planet: tune feed name, description, owneremail, maxarticles [puppet] - 10https://gerrit.wikimedia.org/r/447743 (owner: 10Dzahn) [23:47:51] (03PS2) 10Dzahn: planet: tune feed name, description, owneremail, maxarticles [puppet] - 10https://gerrit.wikimedia.org/r/447743 [23:48:57] (03PS3) 10EBernhardson: Make elasticsearch http and transport ports explicit [puppet] - 10https://gerrit.wikimedia.org/r/447568 (https://phabricator.wikimedia.org/T198351) [23:49:36] (03PS21) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) [23:57:52] (03CR) 10EBernhardson: "I've split most of the other parts out of this patch, leaving only conversion to systemd/elasticsearch_5@ and minor elasticsearch.yml conf" [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson)