[00:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180802T0000). [00:00:36] PROBLEM - Maps - OSM synchronization lag - eqiad on einsteinium is CRITICAL: 1.728e+05 ge 1.728e+05 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [00:15:46] (03CR) 10Alex Monk: labs-ip-alias-dump: Update to work with pdns-recursor v4.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449627 (https://phabricator.wikimedia.org/T200294) (owner: 10Andrew Bogott) [00:24:41] (03CR) 10Andrew Bogott: labs-ip-alias-dump: Update to work with pdns-recursor v4.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449627 (https://phabricator.wikimedia.org/T200294) (owner: 10Andrew Bogott) [00:36:44] (03PS1) 10BBlack: logstash-syslog-udp: use one-packet-scheduler [puppet] - 10https://gerrit.wikimedia.org/r/449913 [00:58:18] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2018), 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Johan) [02:19:56] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Lost access to archiva - https://phabricator.wikimedia.org/T200954 (10Legoktm) [02:27:27] 10Operations, 10Cloud-VPS, 10cloud-services-team: labvirt1009 has high CPU, disk I/O and skyrocketted load - https://phabricator.wikimedia.org/T200888 (10Legoktm) {F24420622} It looks much better now. [02:36:58] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.14) (duration: 15m 42s) [02:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:47] RECOVERY - Maps - OSM synchronization lag - eqiad on einsteinium is OK: (C)1.728e+05 ge (W)9e+04 ge 9463 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [02:48:32] (03PS2) 10Tim Starling: Enable MCR migration stage "write both, read old" (the default) on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449649 (https://phabricator.wikimedia.org/T197816) [02:48:42] (03CR) 10Tim Starling: [C: 032] Enable MCR migration stage "write both, read old" (the default) on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449649 (https://phabricator.wikimedia.org/T197816) (owner: 10Tim Starling) [02:50:12] (03Merged) 10jenkins-bot: Enable MCR migration stage "write both, read old" (the default) on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449649 (https://phabricator.wikimedia.org/T197816) (owner: 10Tim Starling) [02:53:34] !log tstarling@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable MCR write-both mode on all wikis (duration: 00m 50s) [02:54:35] (03CR) 10jenkins-bot: Enable MCR migration stage "write both, read old" (the default) on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449649 (https://phabricator.wikimedia.org/T197816) (owner: 10Tim Starling) [03:08:30] (03PS1) 10Legoktm: apt_install: Allow newline separated list of packages [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/449918 [03:10:52] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.15) (duration: 14m 51s) [03:21:20] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Thu Aug 2 03:21:20 UTC 2018 (duration 10m 29s) [03:53:09] Bug report: "Cannot save edits at Wikitech-wiki - exception - Wikimedia\Rdbms\DBQueryError" https://phabricator.wikimedia.org/T200963 [03:58:43] TimStarling: do you think ^ is related to what you just deployed? the exception message is about the slots table: https://phabricator.wikimedia.org/T200963#4471004 [04:48:33] !log Deploy schema change on db1071 (s8 primary master) T144010 T51190 T199368 [04:54:18] legoktm: looking [04:59:14] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Marostegui) >>! In T200297#4466074, @awight wrote: >>>! In T200297#4464608, @Marostegui wrote: >> What does: "our r... [05:07:10] !log ran patch-slot-origin.sql on labswiki and labtestwiki [05:08:02] "PHP fatal error: [05:08:02] entire web request took longer than 60 seconds and timed out" [05:08:02] aw [05:08:16] where? [05:08:16] For a private CheckUser action btw [05:09:09] got to restart stashbot [05:09:12] Seems to be okay now. [05:09:23] labswiki edits seem to work now [05:13:48] boop [05:14:14] I just killed it too quickly, I assumed it was broken when actually it is just slow (containerised) [05:14:30] it takes a minute or two to get into the channel after you restart [05:14:36] !log restarted stashbot [05:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:43] !log on mwmaint1001 running populateContentTables.php as described in T183488 [05:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:47] T183488: MCR schema migration stage 2: populate new fields - https://phabricator.wikimedia.org/T183488 [05:39:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449927 [05:44:52] (03PS2) 10Muehlenhoff: Validate SSH keys in account cross check [puppet] - 10https://gerrit.wikimedia.org/r/420810 (https://phabricator.wikimedia.org/T189890) [05:54:20] 10Operations, 10Patch-For-Review: requesting additional production ssh key for jmorgan - https://phabricator.wikimedia.org/T200103 (10MoritzMuehlenhoff) 05Resolved>03Open @Capt_Swing You're now using the same SSH key in WMCS as you do in the production network. This is a security risk since WMCS allows SSH... [05:57:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449927 (owner: 10Marostegui) [05:59:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449927 (owner: 10Marostegui) [06:00:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1119 (duration: 00m 57s) [06:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:10] (03PS3) 10Muehlenhoff: Validate SSH keys in account cross check [puppet] - 10https://gerrit.wikimedia.org/r/420810 (https://phabricator.wikimedia.org/T189890) [06:05:37] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449930 [06:08:52] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449930 (owner: 10Marostegui) [06:09:32] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449927 (owner: 10Marostegui) [06:10:08] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449930 (owner: 10Marostegui) [06:10:21] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449930 (owner: 10Marostegui) [06:11:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1119 (duration: 00m 55s) [06:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449932 [06:15:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449932 (owner: 10Marostegui) [06:16:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449932 (owner: 10Marostegui) [06:17:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1106 (duration: 00m 54s) [06:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:11] !log Deploy schema change on db1106 with replication, this will generate lag on labsdb:s1 T144010 T51190 T199368 [06:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:17] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [06:18:18] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [06:18:18] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [06:25:32] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449932 (owner: 10Marostegui) [06:27:34] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449934 [06:30:36] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/varnishmtail-backend/varnishbackend.mtail] [06:31:07] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ferm] [06:31:34] mmh [06:31:37] Could not evaluate: Could not retrieve file metadata for puppet:///modules/mtail/programs/varnishbackend.mtail: Error 500 on SERVER: [06:32:06] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.eqiad.wmnet.crt] [06:32:38] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449934 (owner: 10Marostegui) [06:33:16] oh that's the usual 8:30 CEST puppet issue [06:33:54] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449934 (owner: 10Marostegui) [06:35:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1106 (duration: 00m 55s) [06:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449934 (owner: 10Marostegui) [06:43:45] ema: what happen at 8:30? [06:44:21] volans: some cron job makes apache temporarily sad IIRC, logrotate or similar? [06:46:11] (03PS3) 10Jcrespo: mariadb: Depool db1092 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449697 [06:47:41] ema: yeah cron.daily would be at 6:25 UTC, I can have a quick look [06:48:55] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1092 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449697 (owner: 10Jcrespo) [06:50:08] (03Merged) 10jenkins-bot: mariadb: Depool db1092 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449697 (owner: 10Jcrespo) [06:53:19] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1092 (duration: 00m 55s) [06:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:07] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:37] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:21] !log stop db1092 for reimage [06:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:03] (03CR) 10jenkins-bot: mariadb: Depool db1092 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449697 (owner: 10Jcrespo) [06:59:25] !log on logstash1007 restarting logstash [06:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:59] !log on logstash1007 increased net.core.rmem_default from 212992 to 851968 in soft state and restarted logstash [07:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:17] (03PS2) 10Ema: cache::text: enable websocket_support [puppet] - 10https://gerrit.wikimedia.org/r/449752 (https://phabricator.wikimedia.org/T164609) [07:12:01] (03CR) 10Ema: [C: 032] cache::text: enable websocket_support [puppet] - 10https://gerrit.wikimedia.org/r/449752 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [07:13:48] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Lost access to archiva - https://phabricator.wikimedia.org/T200954 (10elukey) p:05Unbreak!>03Normal Hello! The password changed due to an accidental leak to gerrit. We are working on a long term solution that will allow us to use LDAP with Archiva to au... [07:16:57] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Lost access to archiva - https://phabricator.wikimedia.org/T200954 (10elukey) 05Open>03Resolved a:03elukey There you go: ``` elukey@stat1005:/home/smalyshev$ ls -l archiva.txt -rw------- 1 smalyshev root 40 Aug 2 07:16 archiva.txt ``` Please re-ope... [07:20:32] !log restarting logstash on logstash1008 and logstash1009 [07:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:53] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1092 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449938 [07:24:42] AaronSchulz: good catch, mind opening a task for missing "add" memcached ops? [07:27:30] (03CR) 10Gehel: [C: 032] Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/449719 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [07:27:37] (03PS1) 10WMDE-Fisch: Enable moved paragraph detection for inline diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449939 (https://phabricator.wikimedia.org/T200975) [07:28:09] (03PS4) 10Gehel: Set up cron task to regen low-zoom vector tiles [puppet] - 10https://gerrit.wikimedia.org/r/449719 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [07:28:32] (03PS2) 10Jcrespo: Revert partially "mariadb: Depool db1092 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449938 [07:31:40] (03CR) 10Jcrespo: [C: 032] Revert partially "mariadb: Depool db1092 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449938 (owner: 10Jcrespo) [07:32:55] (03Merged) 10jenkins-bot: Revert partially "mariadb: Depool db1092 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449938 (owner: 10Jcrespo) [07:34:51] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1092 with low load (duration: 00m 56s) [07:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:37] (03PS1) 10Jcrespo: mariadb: Repool db1092 fully after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449943 [07:41:08] (03PS1) 10Ema: cache_text: listen for cache_misc PURGE multicasts [puppet] - 10https://gerrit.wikimedia.org/r/449945 (https://phabricator.wikimedia.org/T164609) [07:44:36] (03CR) 10Ema: [C: 032] cache_text: listen for cache_misc PURGE multicasts [puppet] - 10https://gerrit.wikimedia.org/r/449945 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [07:47:29] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema) [07:48:22] (03CR) 10jenkins-bot: Revert partially "mariadb: Depool db1092 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449938 (owner: 10Jcrespo) [07:48:39] (03PS8) 10Gehel: Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 [07:49:42] (03PS3) 10Gehel: Fix integration tests setup. [software/cumin] - 10https://gerrit.wikimedia.org/r/449224 [07:51:53] (03CR) 10jerkins-bot: [V: 04-1] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [07:52:56] (03CR) 10jerkins-bot: [V: 04-1] Fix integration tests setup. [software/cumin] - 10https://gerrit.wikimedia.org/r/449224 (owner: 10Gehel) [08:00:21] jouncebot: now [08:00:21] No deployments scheduled for the next 2 hour(s) and 59 minute(s) [08:00:23] jouncebot: next [08:00:23] In 2 hour(s) and 59 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180802T1100) [08:00:30] (03PS1) 10Elukey: profile::archiva: depend on default-jdk [puppet] - 10https://gerrit.wikimedia.org/r/449950 (https://phabricator.wikimedia.org/T192639) [08:01:34] (03CR) 10Elukey: [C: 032] profile::archiva: depend on default-jdk [puppet] - 10https://gerrit.wikimedia.org/r/449950 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [08:03:30] (03CR) 10Muehlenhoff: "It's probably also worth investigating whether this really needs the JDK package or whether default-jre or default-jre-headless is suffici" [puppet] - 10https://gerrit.wikimedia.org/r/449950 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [08:06:40] (03CR) 10Gehel: [C: 031] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/449519 (owner: 10Volans) [08:06:59] moritzm: this time I said to myself "I know the right dependency so I won't bother Moritz!" and of course I failed :D [08:08:18] I think that default-jre might suffice (even for zookeeper) [08:08:44] error: cannot update the ref 'refs/remotes/origin/REL1_31': unable to append to '.git/logs/refs/remotes/origin/REL1_31': Permission denied [08:08:44] ! 43ee1abaca..568bd6d1bf REL1_31 -> origin/REL1_31 (unable to update local ref) [08:08:46] this is new [08:09:13] -rw-r--r-- 1 tgr wikidev 160 Jul 30 11:24 .git/logs/refs/remotes/origin/REL1_31 [08:10:49] elukey: it's just a thought/consideration, take it with a grain of salt :-) but given that we're now setting up a new setup anyway, seems like a good time to revisit [08:11:08] yep yep I was joking :D [08:11:20] but it is a good suggestion, default-jre might be the best one [08:11:42] !log legoktm@deploy1001 Started scap: LST: Use modern i18n mechanisms for localization (T198173, T200960) [08:11:48] legoktm: ran chmod -R on it [08:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:50] T198173: [LabeledSectionTransclusion] Use of LanguageGetMagic hook (used in LabeledSectionTransclusion::setupMagic) was deprecated in MediaWiki 1.16 - https://phabricator.wikimedia.org/T198173 [08:11:50] T200960: Logstash has ~90% packet loss since June 29 - https://phabricator.wikimedia.org/T200960 [08:12:01] tgr: thanks [08:12:30] (03CR) 10Volans: [C: 032] Fix prospector tests [software/cumin] - 10https://gerrit.wikimedia.org/r/449519 (owner: 10Volans) [08:12:46] T200690 has some discussion on how to prevent [08:12:47] T200690: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 [08:14:50] !log Deploy schema change on db2043 (s3 codfw master) this will generate lag on codfw:s3 T144010 T51190 T199368 [08:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:57] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [08:14:57] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [08:14:58] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [08:15:48] (03Merged) 10jenkins-bot: Fix prospector tests [software/cumin] - 10https://gerrit.wikimedia.org/r/449519 (owner: 10Volans) [08:16:58] (03PS1) 10Filippo Giunchedi: logstash: alert on udp packet loss [puppet] - 10https://gerrit.wikimedia.org/r/449958 (https://phabricator.wikimedia.org/T200960) [08:17:07] (03CR) 10jenkins-bot: Fix prospector tests [software/cumin] - 10https://gerrit.wikimedia.org/r/449519 (owner: 10Volans) [08:22:07] jouncebot: now [08:22:07] No deployments scheduled for the next 2 hour(s) and 37 minute(s) [08:22:09] jouncebot: next [08:22:09] In 2 hour(s) and 37 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180802T1100) [08:24:44] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` ['cp5008.eqsin.wmnet', 'cp5002.eqsin.wmnet'] ``` The log can be found in `/var/log/w... [08:25:43] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/11961/" [puppet] - 10https://gerrit.wikimedia.org/r/449958 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [08:25:52] (03PS2) 10Filippo Giunchedi: logstash: alert on udp packet loss [puppet] - 10https://gerrit.wikimedia.org/r/449958 (https://phabricator.wikimedia.org/T200960) [08:25:57] Reedy: I'm scapping atm [08:26:06] legoktm: pfft [08:26:37] get in line ;) [08:26:56] !deploywindow [08:27:08] I was gonna create some wikis or something [08:27:46] (03CR) 10Filippo Giunchedi: [C: 032] logstash: alert on udp packet loss [puppet] - 10https://gerrit.wikimedia.org/r/449958 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [08:28:39] (03PS1) 10Volans: Remove unnecessary parentheses from class defs [software/cumin] - 10https://gerrit.wikimedia.org/r/449970 [08:34:45] (03CR) 10Gehel: [C: 032] "LGTM, trivial enough" [software/cumin] - 10https://gerrit.wikimedia.org/r/449970 (owner: 10Volans) [08:35:50] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 25 probes of 334 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [08:36:05] (03CR) 10jenkins-bot: Remove unnecessary parentheses from class defs [software/cumin] - 10https://gerrit.wikimedia.org/r/449970 (owner: 10Volans) [08:36:16] !log legoktm@deploy1001 Scap failed!: 11/11 canaries failed their endpoint checks(http://en.wikipedia.org) [08:36:16] !log legoktm@deploy1001 scap failed: RuntimeError Scap failed!: 11/11 canaries failed their endpoint checks(http://en.wikipedia.org) (duration: 24m 33s) [08:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:39] uhhhhhh ffs [08:36:41] 08:36:07 Check 'Check endpoints for mw1261.eqiad.wmnet' failed: /wiki/{title} (Main Page) is CRITICAL: Test Main Page returned the unexpected status 500 (expecting: 200); /wiki/{title} (Special Version) is CRITICAL: Test Special Version returned the unexpected status 500 (expecting: 200) [08:37:40] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 885 bytes in 0.040 second response time [08:37:42] womp womp [08:38:00] PROBLEM - HHVM rendering on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.069 second response time [08:38:18] 2018-08-02 08:35:50 [W2LCZgpAADsAAJt3bCAAAAAV] mw1264 enwiki 1.32.0-wmf.14 exception ERROR: [W2LCZgpAADsAAJt3bCAAAAAV] /w/index.php?title=Charlotte_Bront%C3%AB&action=edit&section=12 MWException from line 164 of /srv/mediawiki/php-1.32.0-wmf.14/includes/Hooks.php: Invalid callback LabeledSectionTransclusion::setupMagic in hooks for LanguageGetMagic [08:38:20] umm [08:38:20] PROBLEM - HHVM rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.043 second response time [08:38:20] PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.042 second response time [08:38:21] PROBLEM - HHVM rendering on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.066 second response time [08:38:29] PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.041 second response time [08:38:35] :/ [08:38:39] one moment [08:38:39] PROBLEM - HHVM rendering on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.046 second response time [08:38:40] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 885 bytes in 0.063 second response time [08:38:50] PROBLEM - HHVM rendering on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.045 second response time [08:38:50] PROBLEM - HHVM rendering on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.046 second response time [08:38:51] it didn't sync properly?? [08:38:59] PROBLEM - HHVM rendering on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.038 second response time [08:39:00] I'm just gonna revert [08:39:05] Kinda looks like it :( [08:39:19] (03PS9) 10Gehel: Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 [08:39:38] (03CR) 10Gehel: Extract progress bars from clustershell event handling. (039 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [08:40:10] syncing [08:40:17] but as far as I can tell, extension.json did not sync properly [08:40:20] the patch is fine [08:40:20] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 76512 bytes in 0.122 second response time [08:40:29] RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 76512 bytes in 0.113 second response time [08:40:29] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 76512 bytes in 0.117 second response time [08:40:30] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 76510 bytes in 0.096 second response time [08:40:40] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 76511 bytes in 0.104 second response time [08:40:40] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 76529 bytes in 0.146 second response time [08:40:41] !log legoktm@deploy1001 Synchronized php-1.32.0-wmf.14/extensions/LabeledSectionTransclusion/: revert (duration: 00m 56s) [08:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:50] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 76510 bytes in 0.088 second response time [08:40:50] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 76524 bytes in 5.602 second response time [08:40:59] RECOVERY - HHVM rendering on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 76510 bytes in 0.083 second response time [08:41:00] RECOVERY - HHVM rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 76512 bytes in 0.122 second response time [08:41:09] RECOVERY - HHVM rendering on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 76511 bytes in 0.103 second response time [08:41:25] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Nemo_bis) Thank you all for the investigation. The amount of indexed URLs seems w... [08:41:43] (03PS1) 10Elukey: profile::archiva: move proxy settings to a different profile [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) [08:41:48] ok but now the l10n cache is built right? so theoretically a re-revert should be safe? [08:42:02] (03CR) 10jerkins-bot: [V: 04-1] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [08:43:21] Timestamps on deploy1001 look like they're built [08:44:33] pulling on mwdebug1002 [08:45:09] (03PS10) 10Gehel: Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 [08:45:37] hm, I guess it didn't do the scap-cdb-rebuild step [08:45:42] so it'll require another scap? [08:46:00] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 15 probes of 334 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [08:46:34] I'm guessing it hadn't sync'd everywhere by then? [08:46:35] (03PS2) 10Elukey: profile::archiva: move proxy settings to a different profile [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) [08:47:45] (03CR) 10jerkins-bot: [V: 04-1] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [08:49:54] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11963/ - seems fine from this pcc run.." [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [08:52:03] ok, it's in wmf-config/ExtensionMessages-1.32.0-wmf.14.php [08:52:06] I'm gonna try scap once more [08:52:17] mwdebug1002 testing was productive [08:52:29] !log legoktm@deploy1001 Started scap: try once more [08:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:16] 10Operations, 10Puppet: Stop introducing new code expanded from erb templates - https://phabricator.wikimedia.org/T200984 (10fgiunchedi) [09:07:08] (03PS3) 10Elukey: profile::archiva: move proxy settings to a different profile [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) [09:08:29] (03PS4) 10Elukey: profile::archiva: move proxy settings to a different profile [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) [09:10:20] (03PS11) 10Gehel: Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 [09:11:38] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::site: backport changes from mediawiki_exp [puppet] - 10https://gerrit.wikimedia.org/r/449661 [09:12:24] (03CR) 10Elukey: "new pcc: https://puppet-compiler.wmflabs.org/compiler02/11964/meitnerium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [09:12:43] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5002.eqsin.wmnet', 'cp5008.eqsin.wmnet'] ``` and were **ALL** successful. [09:13:02] (03CR) 10jerkins-bot: [V: 04-1] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [09:14:37] !log legoktm@deploy1001 Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org) [09:14:37] !log legoktm@deploy1001 scap failed: RuntimeError Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org) (duration: 22m 08s) [09:14:39] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::site: backport changes from mediawiki_exp [puppet] - 10https://gerrit.wikimedia.org/r/449661 (owner: 10Giuseppe Lavagetto) [09:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:53] wtffffff [09:15:41] legoktm: Has it checked out correctly? [09:15:56] yes [09:16:05] 2018-08-02 09:15:27 [W2LLrwpAAEoAAIQJBnEAAACK] mw1279 enwiki 1.32.0-wmf.14 exception ERROR: [W2LLrwpAAEoAAIQJBnEAAACK] /w/api.php MWException from line 355 of /srv/mediawiki/php-1.32.0-wmf.14/includes/MagicWord.php: Error: invalid magic word 'lst' {"exception_id":"W2LLrwpAAEoAAIQJBnEAAACK","exception_url":"/w/api.php","caught_by":"mwe_handler"} [09:16:32] which means the l10n cache didn't build properly? [09:16:40] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 885 bytes in 0.078 second response time [09:16:41] PROBLEM - HHVM rendering on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.041 second response time [09:16:50] PROBLEM - HHVM rendering on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.049 second response time [09:17:00] PROBLEM - HHVM rendering on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.049 second response time [09:17:01] PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.051 second response time [09:17:17] reverting again... [09:17:21] PROBLEM - HHVM rendering on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.048 second response time [09:17:31] PROBLEM - HHVM rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.057 second response time [09:17:31] PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.047 second response time [09:17:38] godog: I have no idea why it won't backport properly [09:17:40] PROBLEM - HHVM rendering on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 880 bytes in 0.049 second response time [09:17:46] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698 (10MoritzMuehlenhoff) Actually, let's use stretch. [09:17:50] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 76520 bytes in 0.098 second response time [09:17:51] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 76511 bytes in 0.106 second response time [09:18:00] RECOVERY - HHVM rendering on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 76511 bytes in 0.103 second response time [09:18:05] but I'm giving up now [09:18:08] !log legoktm@deploy1001 Synchronized php-1.32.0-wmf.14/extensions/LabeledSectionTransclusion/: revert again (duration: 00m 55s) [09:18:10] RECOVERY - HHVM rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 76511 bytes in 0.107 second response time [09:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:11] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 76511 bytes in 0.115 second response time [09:18:18] legoktm: It's definitely there in l10n_cache-en.cdb.json [09:18:31] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 76511 bytes in 0.108 second response time [09:18:40] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 76510 bytes in 0.085 second response time [09:18:41] RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 76511 bytes in 0.109 second response time [09:18:50] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 76510 bytes in 0.100 second response time [09:19:09] Reedy: do you want to give it a shot? it's 2am here and I'm probably not thinking straight [09:19:19] it's reverted locally on deploy1001 right now [09:19:21] I was just looking to check beta managed it [09:19:55] Reedy: it's in wmf.15, we were just trying to get it everywhere before the train rolls out in a few hours [09:19:59] <_joe_> brb [09:20:00] Looks like it should've couple of days ago [09:20:37] (03PS1) 10Volans: Upgrade Django and other dependencies [software/debmonitor] - 10https://gerrit.wikimedia.org/r/449979 [09:21:22] (03CR) 10jerkins-bot: [V: 04-1] Upgrade Django and other dependencies [software/debmonitor] - 10https://gerrit.wikimedia.org/r/449979 (owner: 10Volans) [09:22:34] (03CR) 10Volans: "CI failures expected as it runs on py34 only and it installs the dependencies also if it's specified not to." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/449979 (owner: 10Volans) [09:25:50] legoktm: no worries! thanks for trying it anyways late at night :)) [09:26:40] legoktm: Don't suppose it's being daft and not rebuilding the l10ncache because it thinks it's up to date (no json changes) [09:26:50] * Reedy pokes mwdebug [09:26:54] 09:23:58 Started scap-cdb-rebuild [09:27:34] Reedy: but didn't you say that it was in the cdb file? [09:27:41] it was on deploy1001 [09:27:45] But that doesn't mean it is elsewher [09:27:55] 09:23:58 Started scap-cdb-rebuild [09:27:55] 09:27:47 09:27:47 Updated 412 CDB files(s) in /srv/mediawiki/php-1.32.0-wmf.14/cache/l10n [09:28:29] (03PS12) 10Gehel: Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 [09:28:48] (03PS1) 10Filippo Giunchedi: prometheus: fix udp loss alert [puppet] - 10https://gerrit.wikimedia.org/r/449984 (https://phabricator.wikimedia.org/T200960) [09:29:10] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix udp loss alert [puppet] - 10https://gerrit.wikimedia.org/r/449984 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [09:29:16] 2018-08-02 09:15:27 [W2LLrwpAAEoAAIQJBnEAAACK] mw1279 enwiki 1.32.0-wmf.14 exception ERROR: [W2LLrwpAAEoAAIQJBnEAAACK] /w/api.php MWException from line 355 of /srv/mediawiki/php-1.32.0-wmf.14/includes/MagicWord.php: Error: invalid magic word 'lst' {"exception_id":"W2LLrwpAAEoAAIQJBnEAAACK","exception_url":"/w/api.php","caught_by":"mwe_handler"} [09:29:16] 2018-08-02 09:15:27 [W2LLrwpAAEoAAIQJBnEAAACK] mw1279 enwiki 1.32.0-wmf.14 exception ERROR: [W2LLrwpAAEoAAIQJBnEAAACK] /w/api.php BadMethodCallException from line 188 of /srv/mediawiki/php-1.32.0-wmf.14/extensions/TemplateStyles/includes/TemplateStylesHooks.php: Call to a member function clear() on a non-object (null) {"exception_id":"W2LLrwpAAEoAAIQJBnEAAACK","exception_url":"/w/api.php","caught_by":"mwe_handler"} [09:31:29] (03PS2) 10Filippo Giunchedi: prometheus: fix udp loss alert [puppet] - 10https://gerrit.wikimedia.org/r/449984 (https://phabricator.wikimedia.org/T200960) [09:32:28] > var_dump( MagicWord::get( 'lst' ) ); [09:32:28] object(MagicWord)#418 (11) { [09:32:28] ["mId"]=> [09:32:28] string(3) "lst" [09:32:47] legoktm: Race condition? [09:33:01] Syncing updated files before l10n is updated [09:33:06] So of course it's going to fail like this [09:33:08] that would be my guess [09:33:09] (03CR) 10Muehlenhoff: "Like the approach, but some comments." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [09:33:12] Ok... So [09:33:45] theoretically the patch could be split into two [09:33:50] add the magic.php file [09:33:52] That's what I'm just thinking [09:33:53] then drop the hook [09:33:54] Yeah, exactly [09:34:06] Add the magic file to json, the php [09:34:09] Leave the hook etc in place [09:34:17] Let me have a go [09:34:22] ok :) [09:34:38] tldr; letting the train deploy shit is easier ;P [09:34:50] Reedy: I don't know whether that much work is worth the 5 hours it'll be deployed for before the train puts wmf.15 everywhere [09:34:53] yeah, exactly [09:35:08] Reedy: https://phabricator.wikimedia.org/T200960 was the motivation [09:35:15] (03CR) 10Elukey: profile::archiva: move proxy settings to a different profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [09:35:25] Should only take a few mins for this.. [09:36:36] (03PS5) 10Elukey: profile::archiva: move proxy settings to a different profile [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) [09:37:19] gasp, I thought it'd be easier, thanks folks [09:37:49] godog: You must be new around here [09:37:55] legoktm: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/LabeledSectionTransclusion/+/449986/ ? [09:38:06] (I'm happy to deploy, just want a quick sanity check) [09:38:26] (03CR) 10Muehlenhoff: profile::archiva: move proxy settings to a different profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [09:38:55] Reedy: lgtm [09:39:08] Let's see if jenkins barfs over it [09:39:37] (03CR) 10Muehlenhoff: [C: 031] "One nitpick, looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [09:40:14] (03CR) 10Elukey: profile::archiva: move proxy settings to a different profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [09:40:17] Of course, this just what happens when you're modifying "code" that's in use... [09:40:46] (03PS6) 10Elukey: profile::archiva: move proxy settings to a different profile [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) [09:41:29] (03PS7) 10Elukey: profile::archiva: move proxy settings to a different profile [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) [09:41:46] Reedy: ...we could have just backed out the hard deprecation from wmf.14 [09:41:49] (03CR) 10Volans: [C: 031] "LGTM! Thanks a lot!" [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [09:42:00] Haha [09:42:02] True [09:42:16] If this doesn't work, let's just do that [09:42:21] (03CR) 10Elukey: [C: 032] profile::archiva: move proxy settings to a different profile [puppet] - 10https://gerrit.wikimedia.org/r/449974 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [09:43:18] (03CR) 10Volans: [C: 031] "Missing rebase? LGTM otherwise" [software/cumin] - 10https://gerrit.wikimedia.org/r/449224 (owner: 10Gehel) [09:43:31] Jenkins test failure due to parsertest [09:44:53] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/449979 (owner: 10Volans) [09:46:16] (03CR) 10Volans: [V: 032 C: 032] Upgrade Django and other dependencies [software/debmonitor] - 10https://gerrit.wikimedia.org/r/449979 (owner: 10Volans) [09:47:04] (03PS4) 10Gehel: Fix integration tests setup. [software/cumin] - 10https://gerrit.wikimedia.org/r/449224 [09:47:20] (03PS1) 10Volans: Fix submodule directory [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/449991 [09:47:22] (03PS1) 10Volans: Rebuild requirements to pick security upgrades [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/449992 [09:47:24] (03PS1) 10Volans: Rebuild wheels with upgraded dependencies [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/449993 [09:52:33] (03PS1) 10ArielGlenn: category diffs: full ts dir is different than dailies ts dir [puppet] - 10https://gerrit.wikimedia.org/r/449994 (https://phabricator.wikimedia.org/T198356) [09:53:47] (03CR) 10ArielGlenn: [C: 032] category diffs: full ts dir is different than dailies ts dir [puppet] - 10https://gerrit.wikimedia.org/r/449994 (https://phabricator.wikimedia.org/T198356) (owner: 10ArielGlenn) [09:54:46] Something is going on on s3 slaves: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1077&var-port=9104&from=1533201494698&to=1533203511882 [09:54:50] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 54.36, 38.46, 24.73 [09:57:23] (03PS9) 10Giuseppe Lavagetto: mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733 (owner: 10Faidon Liambotis) [09:58:58] * Reedy dies of boredom waiting for jerkins [09:59:02] (03PS1) 10Volans: Updated src to v0.1.7 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/449995 [09:59:04] (03PS1) 10Volans: Built wheels for v0.1.7 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/449996 [09:59:46] (03PS1) 10Elukey: archiva::proxy: include acme config only when needed [puppet] - 10https://gerrit.wikimedia.org/r/449997 (https://phabricator.wikimedia.org/T192639) [10:00:57] Reedy: Could https://grafana.wikimedia.org/dashboard/db/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1077&var-port=9104&from=1533201494698&to=1533203511882 be related to the deploy? [10:01:20] It was reverted [10:01:38] Those look to spike after the scap failed [10:02:20] I don't think this extension does any sql queries directly (other than potentially parser cache stuff) [10:03:11] it is slowly decreasing [10:03:52] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=10&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1077&var-port=9104&from=1533201494698&to=1533203511882 [10:04:17] that is happening in all replicas [10:05:29] :( [10:05:55] (03PS2) 10Elukey: archiva::proxy: include acme config only when needed [puppet] - 10https://gerrit.wikimedia.org/r/449997 (https://phabricator.wikimedia.org/T192639) [10:06:37] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733 (owner: 10Faidon Liambotis) [10:06:47] <_joe_> paravoid: ^^ [10:07:48] (03CR) 10Elukey: [C: 032] "No op https://puppet-compiler.wmflabs.org/compiler02/11968/" [puppet] - 10https://gerrit.wikimedia.org/r/449997 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [10:08:26] (03PS3) 10Elukey: archiva::proxy: include acme config only when needed [puppet] - 10https://gerrit.wikimedia.org/r/449997 (https://phabricator.wikimedia.org/T192639) [10:08:33] (03CR) 10Volans: [C: 032] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [10:10:33] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.14/languages/Language.php: Revert out deprecation warning of hook (duration: 00m 57s) [10:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:21] (03Merged) 10jenkins-bot: Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [10:11:46] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.14/extensions/LabeledSectionTransclusion/: Consistency (duration: 00m 55s) [10:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:37] (03CR) 10jenkins-bot: Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [10:13:09] (03CR) 10Volans: [C: 032] Fix integration tests setup. [software/cumin] - 10https://gerrit.wikimedia.org/r/449224 (owner: 10Gehel) [10:13:33] legoktm: That should've shut it up [10:13:46] godog: ^^ [10:14:12] \o/ indeed, quite the drop in received messages [10:14:33] quite == 10x [10:14:41] bad James_F|Away [10:15:57] (03Merged) 10jenkins-bot: Fix integration tests setup. [software/cumin] - 10https://gerrit.wikimedia.org/r/449224 (owner: 10Gehel) [10:16:41] thanks again, still some work to do on logstash side of course but the biggest offender is gone [10:17:09] (03CR) 10jenkins-bot: Fix integration tests setup. [software/cumin] - 10https://gerrit.wikimedia.org/r/449224 (owner: 10Gehel) [10:19:12] !log test bumping rmem_default to 4MB on logstash1007 - T200960 [10:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:16] T200960: Logstash has ~90% packet loss since June 29 - https://phabricator.wikimedia.org/T200960 [10:19:24] (03PS9) 10Reedy: id_internalwikimedia: Initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438279 (owner: 10Urbanecm) [10:21:52] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 11.43, 14.25, 23.43 [10:22:11] (03CR) 10Reedy: [C: 032] id_internalwikimedia: Initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438279 (owner: 10Urbanecm) [10:23:30] (03Merged) 10jenkins-bot: id_internalwikimedia: Initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438279 (owner: 10Urbanecm) [10:24:59] !log volans@deploy1001 Started deploy [debmonitor/deploy@691d2f8]: Release v0.1.7 [10:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:15] godog: still not helping by the looks of it, despite the drop in demand [10:25:51] !log volans@deploy1001 Finished deploy [debmonitor/deploy@691d2f8]: Release v0.1.7 (duration: 00m 51s) [10:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:11] TimStarling: indeed, more manageable but still can't drain the receive buffer [10:27:28] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10mobrovac) I assume the task description implies the topic would get multiple messages every week... [10:28:28] (03CR) 10Volans: [V: 032 C: 032] Updated src to v0.1.7 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/449995 (owner: 10Volans) [10:28:41] (03CR) 10Volans: [V: 032 C: 032] Built wheels for v0.1.7 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/449996 (owner: 10Volans) [10:29:42] !log volans@deploy1001 Started deploy [debmonitor/deploy@73b640d]: Release v0.1.7 [10:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:49] Urbanecm: As always, creating wikis is broken [10:30:43] !log volans@deploy1001 Finished deploy [debmonitor/deploy@73b640d]: Release v0.1.7 (duration: 01m 01s) [10:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:59] godog: it's only receiving 1.5 MB/s per second according to prometheus, it would have to stall for 2.5 seconds to fill that buffer you gave it [10:35:11] !log volans@deploy1001 Started deploy [netbox/deploy@ac54feb]: Security upgrade of dependency [10:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:16] !log volans@deploy1001 Finished deploy [netbox/deploy@ac54feb]: Security upgrade of dependency (duration: 00m 05s) [10:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:35] (03CR) 10jenkins-bot: id_internalwikimedia: Initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438279 (owner: 10Urbanecm) [10:35:45] (03PS1) 10Reedy: Revert "id_internalwikimedia: Initial configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450002 [10:38:08] jouncebot: next [10:38:08] In 0 hour(s) and 21 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180802T1100) [10:38:15] bleugh, guess I should clean up then [10:38:18] (03CR) 10Volans: [C: 04-2] "Before merging this also Puppet should be adapted and also ensure that a scap deploy will work, it might need some manual tweak." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/449991 (owner: 10Volans) [10:38:19] TimStarling: heh, the buffer isn't full all the time though, sometimes it gets drained, I'm using watch -d netstat -an \| grep -i udp [10:38:30] (03CR) 10Reedy: [C: 032] "addWiki is broken. Can't get a resolution yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450002 (owner: 10Reedy) [10:38:32] (03PS2) 10Volans: Rebuild requirements to pick security upgrades [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/449992 [10:38:44] (03PS2) 10Volans: Rebuild wheels with upgraded dependencies [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/449993 [10:38:52] we'll need to add jmx_exporter to logstash too to get some jvm stats into prometheus [10:39:55] (03Merged) 10jenkins-bot: Revert "id_internalwikimedia: Initial configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450002 (owner: 10Reedy) [10:40:37] (03PS1) 10Reedy: Re-instate "id_internalwikimedia: Initial configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450003 (https://phabricator.wikimedia.org/T196747) [10:40:41] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 913.49 seconds [10:40:52] TimStarling: FWIW the other thing I'm looking at is watch -n1 -d curl -s 'localhost:9600/_node/stats/jvm' \| jq . [10:41:22] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 960.55 seconds [10:42:32] but yeah unsurprisingly a ton of young objects [10:45:27] I'll try with some more heap [10:46:21] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.34 seconds [10:47:04] (03PS2) 10Giuseppe Lavagetto: mediawiki: copy all individual wiki templates over from mediawiki_test [puppet] - 10https://gerrit.wikimedia.org/r/449721 (https://phabricator.wikimedia.org/T196968) [10:47:33] !log test bumping heap to 512mb on logstash1007 - T200960 [10:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:37] T200960: Logstash has ~90% packet loss since June 29 - https://phabricator.wikimedia.org/T200960 [10:48:03] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: copy all individual wiki templates over from mediawiki_test [puppet] - 10https://gerrit.wikimedia.org/r/449721 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [10:48:18] I'll write an update on the task, then I'm off for the night [10:49:15] sounds good, thanks TimStarling ! [10:52:15] (03CR) 10jenkins-bot: Revert "id_internalwikimedia: Initial configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450002 (owner: 10Reedy) [10:54:19] Reedy, what happened with creating wikis? [10:54:23] * Urbanecm is too lazy to read the history [10:54:28] https://phabricator.wikimedia.org/T200994 [10:59:20] Thx [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180802T1100). [11:00:04] CFisch_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] Reedy, were you able to create other wikis? [11:00:18] \o/ [11:00:20] I've not tried [11:00:21] I can SWAT today [11:00:26] There's no point if it's not working for one [11:01:32] wikimania2019wiki is a public wiki, id_internalwikimedia is a private one. It can differ per wiki type, in theory [11:01:46] CFisch_WMDE: I'll ping you when your commit is at mwdebug for testing, please stand by [11:02:12] zeljkof: ack [11:02:21] Urbanecm: it's the eleastic search cluster timing out [11:02:23] So I doub it [11:02:41] There's no variance on the cirrus indexing config for this part [11:03:08] Ok, you know probably more then I [11:03:12] !log comment "pipeline.workers: 1" from logstash1007 - T200960 [11:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:17] T200960: Logstash has ~90% packet loss since June 29 - https://phabricator.wikimedia.org/T200960 [11:04:37] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449939 (https://phabricator.wikimedia.org/T200975) (owner: 10WMDE-Fisch) [11:05:02] [2018-08-02T11:02:55,692][WARN ][logstash.pipeline ] Defaulting pipeline worker threads to 1 because there are some filters that might not work with multiple worker threads {:count_was=>4, :filters=>["multiline"]} [11:05:09] Reedy, FYI, I'm going to write an update to the Wikimania task, as they asked when the wiki will be created. [11:05:47] (03Merged) 10jenkins-bot: Enable moved paragraph detection for inline diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449939 (https://phabricator.wikimedia.org/T200975) (owner: 10WMDE-Fisch) [11:08:14] CFisch_WMDE: 449939 is at mwdebug1002, but I'm not sure if you can test it there, maybe it's at labs already? [11:08:28] CFisch_WMDE: but please do test and let me know if I can deploy it [11:09:56] (03CR) 10jenkins-bot: Enable moved paragraph detection for inline diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449939 (https://phabricator.wikimedia.org/T200975) (owner: 10WMDE-Fisch) [11:10:09] zeljkof: yeah good question it should be fine to deploy it [11:10:41] CFisch_WMDE: ok, deploying [11:10:46] it's not really visible atm but I guess that's due to diff caching [11:11:43] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:449939|Enable moved paragraph detection for inline diffs on beta cluster (T200975)]] (duration: 00m 58s) [11:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:47] T200975: Enable inline moved paragraphs on the beta cluster - https://phabricator.wikimedia.org/T200975 [11:12:13] CFisch_WMDE: it's deployed, please test and thanks for deploying with #releng! :) [11:12:30] Thanks zeljkof ! :-) [11:12:51] !log EU SWAT finished [11:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:15] !log installing tomcat8 security updates [11:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:38] !log temporarily remove multiline filter from logstash to allow using multiple workers - T200960 [11:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:42] T200960: Logstash has ~90% packet loss since June 29 - https://phabricator.wikimedia.org/T200960 [11:19:43] ok that's more like it, the receive buffer is getting drained [11:21:30] !log installing jansson security updates on trusty [11:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:46] !log temporarily remove multiline filter from logstash100[789] and bump pipeline workers to 4 - T200960 [11:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:50] T200960: Logstash has ~90% packet loss since June 29 - https://phabricator.wikimedia.org/T200960 [11:23:59] (03PS2) 10Reedy: Re-instate "id_internalwikimedia: Initial configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450003 (https://phabricator.wikimedia.org/T196747) [11:24:22] (03CR) 10Reedy: [C: 032] Re-instate "id_internalwikimedia: Initial configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450003 (https://phabricator.wikimedia.org/T196747) (owner: 10Reedy) [11:25:02] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0 [11:25:39] (03Merged) 10jenkins-bot: Re-instate "id_internalwikimedia: Initial configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450003 (https://phabricator.wikimedia.org/T196747) (owner: 10Reedy) [11:25:51] (03CR) 10jenkins-bot: Re-instate "id_internalwikimedia: Initial configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450003 (https://phabricator.wikimedia.org/T196747) (owner: 10Reedy) [11:27:01] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect, AS1299/IPv6: Active [11:28:15] FYI no pages are loading for wiki [11:28:24] No APIs either [11:28:44] cc herron [11:29:12] It's back now [11:29:15] (03CR) 10Reedy: [C: 04-1] "Needs langlist and interwiki sorting orders for this new language code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442871 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [11:29:23] (and of course as *soon* as I ping an op( [11:29:24] )* [11:29:39] I suspect it was just Oshwah discovering what happens when he blinks. [11:30:04] NotASpy: My robot drives were idle; had to spin those up. [11:31:01] (03CR) 10Reedy: [C: 04-1] "Needs updating to be wikimaniawiki not wikimania2019 wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445765 (https://phabricator.wikimedia.org/T199509) (owner: 10Urbanecm) [11:31:22] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 85 probes of 334 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [11:32:31] (03PS3) 10Reedy: Initial configuration for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445764 (https://phabricator.wikimedia.org/T199577) (owner: 10Urbanecm) [11:32:45] (03PS12) 10Muehlenhoff: webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [11:33:05] (03CR) 10Reedy: [C: 032] Initial configuration for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445764 (https://phabricator.wikimedia.org/T199577) (owner: 10Urbanecm) [11:33:06] (03PS2) 10Filippo Giunchedi: Add sat to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/442867 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [11:33:45] (03CR) 10Filippo Giunchedi: [C: 032] Add sat to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/442867 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [11:34:20] (03Merged) 10jenkins-bot: Initial configuration for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445764 (https://phabricator.wikimedia.org/T199577) (owner: 10Urbanecm) [11:35:17] Reedy, you're trying the wikis again? [11:35:22] Yes [11:35:37] I got id_internal done eventually [11:35:38] Ok, I'm going to solve your CR-1 and other issues [11:35:40] Gonna do the rest differently [11:36:31] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 7 probes of 334 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [11:36:53] I see still No wiki found when I browse to https://id-internal.wikimedia.org [11:37:11] (03CR) 10Muehlenhoff: [C: 032] webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [11:37:16] Yes [11:37:21] I've not synced anything [11:37:35] sat.wikipedia.org. 600 IN A 91.198.174.192 [11:37:35] ;; Received 90 bytes from 208.80.154.238#53(ns0.wikimedia.org) in 106 ms [11:37:40] Reedy Urbanecm ^ [11:37:45] Thanks! [11:38:02] yw [11:38:03] Oh, I forgot you need to sync everything [11:38:12] RECOVERY - BGP status on cr1-ulsfo is OK: BGP OK - up: 16, down: 0, shutdown: 0 [11:39:34] (03PS10) 10Muehlenhoff: webperf: Rename webperf profiles for clarity [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [11:39:52] (03PS5) 10Urbanecm: Initial configuration for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442871 (https://phabricator.wikimedia.org/T198400) [11:40:06] ^^ Reedy please check ^^ [11:40:16] I'm not sure about the InterwikiSortOrder thing [11:40:35] (03PS1) 10Reedy: Add id_internalwikimedia and zhwikiversity to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450006 [11:40:50] Urbanecm: TBH, never am I when I add it [11:40:57] But you need to add it to alphabetic_svwiktionary too [11:41:18] :D [11:41:19] will do [11:41:33] THe list is much sorted btw [11:41:36] *smaller [11:42:02] and in very strange ordering, sg, sc, st, th, sq, where should I put sat? :D [11:42:02] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [11:42:13] (03CR) 10jenkins-bot: Initial configuration for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445764 (https://phabricator.wikimedia.org/T199577) (owner: 10Urbanecm) [11:42:44] Bleugh [11:42:46] I've no idea [11:42:50] Leave it then? :P [11:42:52] Wikidata can sort it later [11:42:59] Yeah, it is marked as "post install" in the docs [11:43:08] Should I revert the changes in the file I've made? [11:43:30] "If you added a new language code to the langlist (see above), you probably need to add it to the InterwikiSortingOrder.php file too" [11:43:32] Nah, leave them [11:44:03] "Make sure that the language code appears in the file wmf-config/InterwikiSortOrders.php in the operations/mediawiki-config repo. (Example: https://gerrit.wikimedia.org/r/359810)" is under Wikidata in Post-install section in https://wikitech.wikimedia.org/wiki/Add_a_wiki [11:44:06] Okay, leaving the patch as it is [11:44:15] (maybe this is the reason why svwiktionary is much smaller than others) [11:45:11] (03PS2) 10Reedy: Add 4 new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450006 [11:45:26] The swedes don't care? ;) [11:45:36] (03PS6) 10Reedy: Initial configuration for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442871 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [11:45:42] (03CR) 10Reedy: [C: 032] Initial configuration for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442871 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [11:46:27] I mean, the strange ordering. Nobody understands it so nobody modifies it :P [11:46:57] (03Merged) 10jenkins-bot: Initial configuration for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442871 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [11:52:02] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 929.14 seconds [11:52:03] (03PS4) 10Muehlenhoff: Validate SSH keys in account cross check [puppet] - 10https://gerrit.wikimedia.org/r/420810 (https://phabricator.wikimedia.org/T189890) [11:52:46] (03CR) 10Muehlenhoff: [C: 032] Validate SSH keys in account cross check [puppet] - 10https://gerrit.wikimedia.org/r/420810 (https://phabricator.wikimedia.org/T189890) (owner: 10Muehlenhoff) [11:55:19] Urbanecm: If you do that last one, we can then deploy them all in one go [11:55:30] The last one ==? [11:56:33] You mean, the wikimania wiki? [11:56:34] Reedy, ^^ [11:58:21] (03CR) 10jenkins-bot: Initial configuration for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442871 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [11:59:29] Yeah [12:00:02] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 608.16 seconds [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180802T1200) [12:00:36] (03PS3) 10Urbanecm: Initial configuration for wikimania2019wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445765 (https://phabricator.wikimedia.org/T199509) [12:00:37] done ^^ [12:00:50] (fixing commit message...) [12:00:58] (03PS4) 10Urbanecm: Initial configuration for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445765 (https://phabricator.wikimedia.org/T199509) [12:01:01] done finally Reedy ^^ [12:02:10] (03PS5) 10Reedy: Initial configuration for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445765 (https://phabricator.wikimedia.org/T199509) (owner: 10Urbanecm) [12:02:13] (03CR) 10Reedy: [C: 032] Initial configuration for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445765 (https://phabricator.wikimedia.org/T199509) (owner: 10Urbanecm) [12:03:28] (03Merged) 10jenkins-bot: Initial configuration for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445765 (https://phabricator.wikimedia.org/T199509) (owner: 10Urbanecm) [12:04:46] (03PS3) 10Reedy: Add 4 new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450006 [12:04:51] (03CR) 10Reedy: [C: 032] Add 4 new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450006 (owner: 10Reedy) [12:06:06] (03PS2) 10Alexandros Kosiaris: ci: Put Blubber back on Docker integration agents [puppet] - 10https://gerrit.wikimedia.org/r/449804 (owner: 10Dduvall) [12:06:10] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ci: Put Blubber back on Docker integration agents [puppet] - 10https://gerrit.wikimedia.org/r/449804 (owner: 10Dduvall) [12:06:23] (03Merged) 10jenkins-bot: Add 4 new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450006 (owner: 10Reedy) [12:06:25] !log reedy@deploy1001 Synchronized static/images/project-logos/: New logos! (duration: 00m 57s) [12:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:04] woohoo new wiki time [12:07:31] !log reedy@deploy1001 Synchronized langlist: Add sat (duration: 00m 55s) [12:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:54] Yeah, even there were issues before SWAT revi :D [12:08:01] yay [12:08:10] * revi prepares for new userpages (I don't use globaluserpage) [12:08:30] What's wrong with globaluserpages, if I may ask? [12:08:46] I just don't like meta userpage displayed over other pages [12:08:49] !log reedy@deploy1001 Synchronized multiversion/MWMultiVersion.php: add id_internal (duration: 00m 55s) [12:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:04] and I don't want to make my meta userpage excessively long by noinclude and stuff [12:09:55] Ok, so nothing technical, at least [12:09:59] yeah [12:10:15] !log reedy@deploy1001 Synchronized dblists/: new wikis (duration: 00m 55s) [12:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:27] !log reedy@deploy1001 Synchronized wmf-config/: New wikis (duration: 00m 55s) [12:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:36] (03PS1) 10Urbanecm: Change id-private to id-internal in MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450008 (https://phabricator.wikimedia.org/T196747) [12:14:53] (03CR) 10jenkins-bot: Initial configuration for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445765 (https://phabricator.wikimedia.org/T199509) (owner: 10Urbanecm) [12:14:55] (03CR) 10jenkins-bot: Add 4 new wikis to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450006 (owner: 10Reedy) [12:14:56] ^^ Reedy, I'm sorry, but there's a mistake in MWMultiVersion.php :( ^^ [12:15:05] Haha [12:15:06] Go on? [12:15:20] I've uploaded a patch [12:15:23] You should go on and merge it :D [12:15:45] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: new wikis! [12:15:47] (03CR) 10Reedy: [C: 032] Change id-private to id-internal in MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450008 (https://phabricator.wikimedia.org/T196747) (owner: 10Urbanecm) [12:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:48] haha [12:16:15] thanks [12:16:25] hoping not too much bad things were caused by this [12:16:41] * Urbanecm hates when things changes after the initial configuration patch was uploaded [12:16:47] haha [12:16:54] worse thing that happens here is hte wiki won't work [12:17:02] we'll see [12:17:03] (03Merged) 10jenkins-bot: Change id-private to id-internal in MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450008 (https://phabricator.wikimedia.org/T196747) (owner: 10Urbanecm) [12:17:12] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [12:17:23] Honestly [12:18:05] 10Operations, 10TCB-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-07-17, 10WMDE-QWERTY-Sprint-2018-07-31: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 (10WMDE-Fisch) >>! In T199801#4469082, @MoritzMuehlenhoff wrote: > I've upgraded the wikidi... [12:18:22] Whoaa, I can see wikimania.wikimedia.org now :D [12:18:32] !log reedy@deploy1001 Synchronized multiversion/MWMultiVersion.php: fix id_internal (duration: 00m 55s) [12:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:48] And it's good to see wikimedia.cz (under my control as WMCZ's sysadmin) is working with no changes necessary [12:19:08] !log Wikis created T196748 T198401 T199599 [12:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:15] T196748: Prepare and check storage layer for id_internalwikimedia - https://phabricator.wikimedia.org/T196748 [12:19:16] T199599: Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 [12:19:16] T198401: Prepare and check storage layer for satwiki - https://phabricator.wikimedia.org/T198401 [12:19:36] (03PS1) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450009 [12:19:38] (03CR) 10Reedy: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450009 (owner: 10Reedy) [12:20:32] Reedy, can you please create me an account on id_internalwikimedia? I'm asked to import their private wiki, so I need credentials to log in :). [12:20:51] (03Merged) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450009 (owner: 10Reedy) [12:21:41] * Urbanecm should learn how his client works [12:21:48] I don't know how I exited the chan :D [12:21:49] !log reedy@deploy1001 Synchronized wmf-config/interwiki.php: Updating interwiki cache (duration: 02m 25s) [12:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:28] we no longer have wikimania20**wikis? heh [12:22:47] I actually think wikimania.org as a wiki would be better lol [12:22:54] wikimania.wikimedia.org [12:23:10] well...wikimania.org is redirecting to 2018 [12:23:12] should it? [12:23:13] yeah for now [12:23:19] indeed [12:23:27] Ok, just in case [12:23:57] wikimania2018 just ended (I'd say), let's wait for more time before redirecting (or make wikimania.wm.o to wikimania.org) [12:24:15] It'll be wikimania.org to wikimania.wm.o [12:24:36] I actually would like to see the other way around but that's just my pref so [12:24:51] Because we've just created the wiki there and because it is more similar to where wikis are (wikimediafoundation.org just moved to foundation.wikimedia.org) [12:25:17] and wmfwiki was actually on foundation.org for more than a decade iirc [12:25:22] I know [12:25:27] But it moved [12:25:37] yeah unfortunately :-( [12:29:42] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [12:30:44] (03PS2) 10Reedy: Update Foundation urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449909 (https://phabricator.wikimedia.org/T199812) [12:30:49] (03CR) 10Reedy: [C: 032] Update Foundation urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449909 (https://phabricator.wikimedia.org/T199812) (owner: 10Reedy) [12:31:05] (03CR) 10jenkins-bot: Change id-private to id-internal in MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450008 (https://phabricator.wikimedia.org/T196747) (owner: 10Urbanecm) [12:31:07] (03CR) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450009 (owner: 10Reedy) [12:31:29] Anything else to do with the wikis? [12:31:59] Wikidata [12:32:14] (03Merged) 10jenkins-bot: Update Foundation urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449909 (https://phabricator.wikimedia.org/T199812) (owner: 10Reedy) [12:32:17] Usually poke hoo for that [12:32:52] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.05 seconds [12:34:05] !log reedy@deploy1001 Synchronized wmf-config/missing.php: Update foundation urls (duration: 00m 55s) [12:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:54] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: update foundation url (duration: 00m 56s) [12:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:21] Urbanecm: I think just creating you an account left [12:44:15] Good. Waiting for credentials, then I'll do the import thing [12:44:51] Number of attached accounts: 872 [12:44:52] Lots of wikis [12:45:20] For me, only 744. [12:45:29] !log depooling elastic1030 (master struggling) [12:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:32] slacker ;) [12:48:11] (03CR) 10jenkins-bot: Update Foundation urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449909 (https://phabricator.wikimedia.org/T199812) (owner: 10Reedy) [12:48:14] !log Sanitize wikimaniawiki - T201001 [12:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:19] T201001: Prepare and check storage layer for wikimaniawiki - https://phabricator.wikimedia.org/T201001 [12:48:29] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10akosiaris) I took a shot at that. ``` ~ # lspci |grep net 3b:00.0 Ethernet controller: QLogic Corp. Device 8070 (rev 02) 3b:00.1 Ethernet controller: QLogic Corp. Device 8070 (rev 02)... [12:50:09] (03CR) 10Gehel: Add common base utility modules (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:51:32] !log banning elastic1030 (master struggling) [12:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:02] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2018), 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10akosiaris) [12:52:17] (03PS2) 10Alexandros Kosiaris: phabricator: Use the mysql native driver [puppet] - 10https://gerrit.wikimedia.org/r/443045 [12:52:29] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] phabricator: Use the mysql native driver [puppet] - 10https://gerrit.wikimedia.org/r/443045 (owner: 10Alexandros Kosiaris) [12:53:19] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1092 fully after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449943 (owner: 10Jcrespo) [12:54:00] (03PS2) 10Elukey: Import upstream version 2.2.3 [debs/archiva] - 10https://gerrit.wikimedia.org/r/449755 (https://phabricator.wikimedia.org/T192639) [12:54:31] Uh [12:54:33] Wtf is going on there [12:54:45] why is the wikimania wiki showing as wikimania2019.wikimedia.org in CA? [12:55:02] (03Merged) 10jenkins-bot: mariadb: Repool db1092 fully after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449943 (owner: 10Jcrespo) [12:55:58] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Migrate to apps/v1 API [deployment-charts] - 10https://gerrit.wikimedia.org/r/449458 (owner: 10Alexandros Kosiaris) [12:56:30] (03CR) 10Gergő Tisza: "> In order to simplify, let's just allow bureaucrats to add and remove this permission locally maybe? The 'danger' is who gets access to i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440676 (owner: 10Gergő Tisza) [12:57:21] !log reboot lvs3004 to SSBD-enabled microcode/kernel [12:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:32] Urbanecm: lol [12:57:33] reedy@deploy1001:/srv/mediawiki-staging$ grep 2019 wmf-config/InitialiseSettings.php [12:57:33] 'wikimaniawiki' => '//wikimania2019.wikimedia.org', [12:57:33] 'wikimaniawiki' => 'https://wikimania2019.wikimedia.org', [12:57:46] (03Abandoned) 10Alexandros Kosiaris: Phabricator: Use mysqlnd [puppet] - 10https://gerrit.wikimedia.org/r/442829 (owner: 1020after4) [12:57:47] That seems like a mistake... [12:57:54] * Urbanecm is going to upload a patch [12:57:55] fixing [12:58:00] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10Ottomata) > so the producer can simply send plain messages and they would be compressed on the f... [12:58:15] (03PS1) 10Reedy: Remove 2019 from wikimaniawiki url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450016 [12:58:20] Thanks [12:58:26] (03CR) 10Reedy: [C: 032] Remove 2019 from wikimaniawiki url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450016 (owner: 10Reedy) [12:58:48] !log Sanitize zhwikiversity satwiki T199599 T198401 [12:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:54] T199599: Prepare and check storage layer for zhwikiversity - https://phabricator.wikimedia.org/T199599 [12:58:54] T198401: Prepare and check storage layer for satwiki - https://phabricator.wikimedia.org/T198401 [12:59:43] (03Merged) 10jenkins-bot: Remove 2019 from wikimaniawiki url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450016 (owner: 10Reedy) [13:00:20] jynus: I just pulled your change onto deploy1001 [13:01:33] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: fix wikimaniawiki wgserver and wgcanonicalserver (duration: 00m 56s) [13:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:23] (03PS1) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450018 [13:02:25] (03PS1) 10Muehlenhoff: Enable microcode updates for authdns servers [puppet] - 10https://gerrit.wikimedia.org/r/450017 [13:02:27] (03CR) 10Reedy: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450018 (owner: 10Reedy) [13:03:02] (03Abandoned) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450018 (owner: 10Reedy) [13:05:15] (03CR) 10jenkins-bot: mariadb: Repool db1092 fully after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449943 (owner: 10Jcrespo) [13:05:17] (03CR) 10jenkins-bot: Remove 2019 from wikimaniawiki url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450016 (owner: 10Reedy) [13:09:39] marostegui, if I should add #data-services instead of #cloud-services, lemme know. I'm just following https://wikitech.wikimedia.org/wiki/Add_a_wiki [13:09:53] (i mean, to tasks like T201001) [13:09:54] T201001: Prepare and check storage layer for wikimaniawiki - https://phabricator.wikimedia.org/T201001 [13:10:45] (03PS1) 10Vgutierrez: varnish: get rid of AES128-SHA redirection to /sec-warning [puppet] - 10https://gerrit.wikimedia.org/r/450020 (https://phabricator.wikimedia.org/T192555) [13:12:12] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet operation_type=run_podsandbox https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:12:22] Urbanecm: I think that is what they do lately, but probably bd808 knows best [13:12:57] !log bounce traffic from lvs3002 to lvs3004 (SSBD-enabled) [13:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:22] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:14:32] PROBLEM - pybal on lvs3002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [13:14:52] PROBLEM - PyBal connections to etcd on lvs3002 is CRITICAL: CRITICAL: 0 connections established with conf1003.eqiad.wmnet:2379 (min=12) [13:14:59] lvs3002 is me ^ [13:15:04] arg [13:15:10] you almost killed me :P [13:15:12] PROBLEM - PyBal backends health check on lvs3002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [13:15:15] sorry! [13:16:03] ACKNOWLEDGEMENT - PyBal backends health check on lvs3002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 Ema traffic temporarily switched to lvs3004 [13:16:03] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs3002 is CRITICAL: CRITICAL: 0 connections established with conf1003.eqiad.wmnet:2379 (min=12) Ema traffic temporarily switched to lvs3004 [13:16:03] ACKNOWLEDGEMENT - pybal on lvs3002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Ema traffic temporarily switched to lvs3004 [13:16:34] Reedy: sorry, I got distracted [13:16:43] I can deploy now mine? [13:16:45] Sure [13:16:46] I'm done [13:16:52] thanks, Reedy [13:16:54] Just wanted to let you know it was there etc [13:17:02] you did well, I get distracted [13:17:04] thank you [13:17:22] and I would be confused if maybe I had deployed it already [13:17:48] most our changes are very compatible forward and backward, but we definitely do not want it hanging [13:18:57] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1092 fully (duration: 00m 56s) [13:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:22] Reedy: Urbanecm: not sure if something that interest you, but we will move several large s3 wikis to s5 at some point [13:19:40] probably about due :) [13:19:59] AND I proposed to have a very low resource section with just enough HA "s0" for closed wikis and other weird stuff [13:20:10] (regarding storage) [13:20:29] things that are not normally read, and just kept for consistency cross-sections [13:20:36] Thank you for the info, jynus [13:21:26] T184805 [13:21:26] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [13:21:44] enwikivoyage, cebwiki, shwiki, srwiki & mgwiktionary, in addition to labswiki [13:25:27] (03PS1) 10Reedy: Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450023 [13:25:41] arbitary sizes are arbitary [13:27:31] (03PS2) 10Reedy: Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450023 [13:27:46] (03CR) 10Reedy: [C: 032] Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450023 (owner: 10Reedy) [13:28:56] (03Merged) 10jenkins-bot: Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450023 (owner: 10Reedy) [13:30:05] !log reedy@deploy1001 Synchronized dblists/: Update size dblists (duration: 00m 55s) [13:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:44] (03PS4) 10Marostegui: mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) [13:31:01] (03CR) 10Ema: [C: 031] Enable microcode updates for authdns servers [puppet] - 10https://gerrit.wikimedia.org/r/450017 (owner: 10Muehlenhoff) [13:33:16] 10Operations, 10Puppet: Stop introducing new code expanded from erb templates - https://phabricator.wikimedia.org/T200984 (10herron) fully support this In addition to checking file extensions we could also check for presence of a shebang `#!` on the first line of `.erb` files. [13:34:25] 10Operations, 10Scap (Scap3-MediaWiki-MVP): Move scap target configuration to etcd - https://phabricator.wikimedia.org/T115899 (10Joe) We ended up generating the dsh lists in production from etcd, which is ok as a solution without asking scap to know about its details. I think we can close this ticket. [13:34:31] 10Operations, 10Scap (Scap3-MediaWiki-MVP): Move scap target configuration to etcd - https://phabricator.wikimedia.org/T115899 (10Joe) 05Open>03Resolved [13:34:33] 10Operations: Update dsh node groups from puppet - https://phabricator.wikimedia.org/T80395 (10Joe) [13:37:50] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler03/11972/" [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [13:37:56] (03PS1) 10Fdans: Remove all geowiki references from puppet [puppet] - 10https://gerrit.wikimedia.org/r/450025 (https://phabricator.wikimedia.org/T190059) [13:38:17] !log reboot lvs3002 to SSBD-enabled microcode/kernel [13:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:31] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={create_container,pull_image,run_podsandbox,start_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:40:11] (03PS5) 10Jcrespo: mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [13:41:14] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Helm test failing for CI namespace - https://phabricator.wikimedia.org/T199489 (10akosiaris) I did some manual testing btw, I am guessing this is the error ``` servicechecker.CheckError: Generic connection error: HTTPConnectionPool(host=... [13:43:41] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikimedia.conf, wikimania.conf [puppet] - 10https://gerrit.wikimedia.org/r/449722 (https://phabricator.wikimedia.org/T196968) [13:44:13] (03PS1) 10Filippo Giunchedi: logstash: remove multiline filter [puppet] - 10https://gerrit.wikimedia.org/r/450026 (https://phabricator.wikimedia.org/T200960) [13:44:15] (03PS1) 10Filippo Giunchedi: logstash: use default number of queue workers [puppet] - 10https://gerrit.wikimedia.org/r/450027 (https://phabricator.wikimedia.org/T200960) [13:44:17] (03PS1) 10Filippo Giunchedi: logstash: default to 4MB receive buffer [puppet] - 10https://gerrit.wikimedia.org/r/450028 (https://phabricator.wikimedia.org/T200960) [13:44:43] (03CR) 10jerkins-bot: [V: 04-1] logstash: remove multiline filter [puppet] - 10https://gerrit.wikimedia.org/r/450026 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [13:44:59] (03CR) 10jerkins-bot: [V: 04-1] logstash: use default number of queue workers [puppet] - 10https://gerrit.wikimedia.org/r/450027 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [13:45:25] (03CR) 10jerkins-bot: [V: 04-1] logstash: default to 4MB receive buffer [puppet] - 10https://gerrit.wikimedia.org/r/450028 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [13:45:34] (03PS2) 10Filippo Giunchedi: logstash: remove multiline filter [puppet] - 10https://gerrit.wikimedia.org/r/450026 (https://phabricator.wikimedia.org/T200960) [13:45:36] !log bounce traffic back from lvs3004 to lvs3002 [13:45:36] (03PS2) 10Filippo Giunchedi: logstash: use default number of queue workers [puppet] - 10https://gerrit.wikimedia.org/r/450027 (https://phabricator.wikimedia.org/T200960) [13:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:39] (03PS2) 10Filippo Giunchedi: logstash: default to 4MB receive buffer [puppet] - 10https://gerrit.wikimedia.org/r/450028 (https://phabricator.wikimedia.org/T200960) [13:45:40] (03CR) 10Muehlenhoff: [C: 032] Enable microcode updates for authdns servers [puppet] - 10https://gerrit.wikimedia.org/r/450017 (owner: 10Muehlenhoff) [13:45:51] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:45:52] RECOVERY - pybal on lvs3002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:46:27] (03PS2) 10BBlack: use one-packet-scheduler for most logstash UDPs [puppet] - 10https://gerrit.wikimedia.org/r/449913 (https://phabricator.wikimedia.org/T200960) [13:46:42] RECOVERY - PyBal backends health check on lvs3002 is OK: PYBAL OK - All pools are healthy [13:47:53] (03PS1) 10BBlack: cp1075-99: add to hieradata, conftool-data, acls [puppet] - 10https://gerrit.wikimedia.org/r/450029 (https://phabricator.wikimedia.org/T195923) [13:52:51] (03CR) 10jenkins-bot: Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450023 (owner: 10Reedy) [13:55:30] (03PS6) 10Marostegui: mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) [13:57:26] hello... [13:57:43] just curious if the ops-private@ email address is useful or monitored? [13:58:42] debt: both apparently :) [13:59:04] well, I've sent a couple emails there and haven't gotten any responses...so...yeah [13:59:15] I dunno if it's open send [13:59:28] it's not [13:59:33] I didn't get a bounce back, so wasn't sure [13:59:47] it's a group-internal list, it drops rather than bounces probably [13:59:51] it's private in both directions :) [13:59:59] ahh, bblack [14:00:01] thanks [14:00:14] is there a better email address to use? [14:00:34] It seems email software failed [14:00:36] I think [14:00:37] I need to check on the API QPS increases [14:00:53] oh no, I'm wrong, your email did go through, gmail just failed at searching [14:00:54] Yeah, they got it.. And alex replied. But it didn't include your email in the to field :P [14:00:57] ahh [14:01:01] dang [14:01:07] that is an email fail [14:01:23] can someone forward to me, please? [14:01:34] but either way, I think the more-appropriate venue is usually ops@ [14:01:43] (which is more-public and others can subscribe to, etc) [14:02:00] ok, good to know, bblack - thanks. I wasn't sure how public this request needed to be (QPS limits) [14:02:08] yeah that makes it tricky [14:02:42] suggestions? [14:02:58] security usually reaches to only NDA people [14:03:07] I understand this is not really security related [14:03:28] right, not totally security related. it could be, maybe, if the limits are breached. [14:03:37] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Helm test failing for CI namespace - https://phabricator.wikimedia.org/T199489 (10akosiaris) And it fails because it tries to connect to http://{{ template "wmf.releasename" . }}:{{ .Values.main_app.port }} per ``` {{- define "wmf.ap... [14:03:43] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:03:46] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) There's no simple way to start the stretch installer with a more recent kernel. Some options were discussed in this recent talk at DebConf: https://meetings-archive.d... [14:04:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:04:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:04:32] debt: you should have mail [14:05:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:05:14] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:05:30] Reedy: yup, akosiaris sent it to me directly. :) yay! [14:06:14] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:07:19] Reedy, akosiaris, bblack - I've responded to the limit increase and I'll use the ops-private@ when they let me know of the next increase. It should happen every two-ish weeks. [14:07:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:07:31] thanks for your help! [14:08:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:10:04] (03CR) 10BBlack: [C: 032] cp1075-99: add to hieradata, conftool-data, acls [puppet] - 10https://gerrit.wikimedia.org/r/450029 (https://phabricator.wikimedia.org/T195923) (owner: 10BBlack) [14:10:42] uh we had a brief 503 spike [14:10:48] yeah I was about to say that [14:10:53] from 13:59 to 14:02 [14:10:55] I'm assuming the k8s thing is related? [14:11:00] no it is not [14:11:21] hmmm [14:11:25] mostly text, but upload was affected too [14:11:45] text on all dcs [14:11:57] I was looking at it [14:12:06] to see which layer was it [14:12:14] upload I see like 0.225 (same for misc) but text peaks at 18 rps [14:12:25] so this is almost exclusively text [14:12:46] I agree with that [14:13:44] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10RStallman-legalteam) NDAs for Lea Voget and Birgit Müller are signed and on file. [14:13:44] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:13:54] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:14:53] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:15:44] losts of errors on db1104 db1119, that normally means api overload [14:16:08] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikimedia.conf, wikimania.conf [puppet] - 10https://gerrit.wikimedia.org/r/449722 (https://phabricator.wikimedia.org/T196968) [14:16:44] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=1533218291642&to=1533218558734 [14:17:00] I don't see any obvious traffic-level pattern to those 5xx yet [14:17:06] (e.g. ips or URIs, etc) [14:17:15] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert wikimedia.conf, wikimania.conf [puppet] - 10https://gerrit.wikimedia.org/r/449722 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:17:27] mediawiki is much more sensitive than traffic, it needs way less hits [14:18:23] what I mean is, within those ~18 rps of 5xx, the patterns look relatively-random (not all clustered up on a certain pathname or hostname or API call or all from one client IP, etc) [14:18:33] cr1-eqiad has this in the logs: [14:18:36] BGP Session Flap: 208.80.154.201 (AS64700) [14:18:47] <_joe_> I'm merging a delicate apache change [14:18:50] <_joe_> FYI [14:19:06] <_joe_> it should be a noop and has been tested, but one never knows 100% [14:21:30] 10Operations, 10vm-requests: eqiad: (1) VM request for Archiva - https://phabricator.wikimedia.org/T200895 (10elukey) [14:21:54] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [14:22:25] interesting [14:22:35] <_joe_> jynus: I guess that's not expected [14:23:07] Why OTRS said (The MariaDB server is running with the --read-only option so it cannot execute this statement, SQL: 'UPDATE ticket SET queue_id = ?, change_time = '2018-08-02 14:21:43' , change_by = ...)? [14:23:14] Can't login to ticket.wikimedia.org (related to ^?) [14:23:16] that should have put gerrit in and otrs in read only [14:23:33] there is an ongoing outage, trying to see why [14:23:41] (03PS7) 10Marostegui: mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) [14:23:42] I am here [14:23:49] kk great it's known [14:23:57] (Y) [14:24:21] <_joe_> I'm here too, if I'm needed [14:24:29] I see db1065 up [14:24:45] are we maybe having network hiccups? [14:25:15] I could fix it, but I don't want to lose any data, I want to be sure everthing is ok [14:25:17] <_joe_> jynus: what's the problem? [14:25:25] the proxy detected db1065 as down [14:25:27] db1065 is up and mysql wasn't restarted, the mysql uptime is quite high [14:25:30] and it failed over [14:25:48] <_joe_> jynus: in read-only mode I guess? [14:25:53] yes [14:26:12] yes, confirmed db1117:3322 is read_only=1 [14:26:16] (03CR) 10Gehel: Add common base utility modules (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:26:20] (03CR) 10Marostegui: "test" [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [14:26:34] mmm why can I write in gerrit ^ ? [14:26:48] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10Patch-For-Review, and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Nikerabbit) [14:27:03] I see no useful logs [14:27:18] and the other proxy didn't com'plain [14:27:22] I say we reload the proxy [14:27:23] !log restarting elastic1030 to trigger a master election [14:27:24] interesting dbproxy1007 sees db1065 as up [14:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:32] jynus: I will do that [14:27:41] maybe the network error happened on the proxy only [14:28:00] dbproxy1002 reloaded [14:28:24] <_joe_> you have no log directive for haproxy on dbproxy1002, it's not easy to figure out what happened [14:28:44] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [14:29:16] (03PS3) 10Andrew Bogott: labs-ip-alias-dump: Update to work with pdns-recursor v4.x [puppet] - 10https://gerrit.wikimedia.org/r/449627 (https://phabricator.wikimedia.org/T200294) [14:29:45] I swear it used to have one [14:30:05] I saw the logs registering in the past X detected as down after Y attempts [14:30:56] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10Patch-For-Review, and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Nikerabbit) While testing this index patch, I not... [14:31:51] (03PS1) 10Urbanecm: Remove noratelimit from epcoordinator group on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450035 (https://phabricator.wikimedia.org/T201010) [14:32:10] <_joe_> jynus: maybe there is and I'm just not finding it [14:32:13] revi: Alaa is everthing ok now? [14:32:19] lemme see [14:32:28] Yes [14:32:41] yup, LGTM [14:32:44] _joe_: so if we don't, that is something to fix [14:32:46] thank you! [14:32:58] Thanks all [14:32:59] but I am mostly worried about random connection errors happing today [14:33:02] and write works well [14:33:29] revi: the explanation is on error, we go read only- and then check everthing is ok before reenabling writes [14:33:37] sorry for the disruption [14:34:23] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:41:47] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10akosiaris) >>! In T196477#4472436, @MoritzMuehlenhoff wrote: > There's no simple way to start the stretch installer with a more recent kernel. Some options were discussed in this recent... [14:46:22] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) @akosiaris @MoritzMuehlenhoff yes we do have 2x1GB NIC' on the server. since the server is in a rack with 10G switch, we can use a 1000base-T-SEP copper adapter to connect one... [14:49:16] (03PS1) 10Fdans: Remove geowiki cron jobs and make puppet delete related files/dirs [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) [14:49:48] (03CR) 10jerkins-bot: [V: 04-1] Remove geowiki cron jobs and make puppet delete related files/dirs [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans) [14:49:53] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) >>! In T196477#4472512, @akosiaris wrote: >>>! In T196477#4472436, @MoritzMuehlenhoff wrote: >> There's no simple way to start the stretch installer with a more recen... [14:50:40] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Helm test failing for CI namespace - https://phabricator.wikimedia.org/T199489 (10thcipriani) Ah ha! Thanks for the explanation. That makes sense since minikube uses kube-dns out of the box. Are we waiting for CoreDNS or something else? [14:51:50] (03PS8) 10Marostegui: mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) [14:52:31] (03PS2) 10Fdans: Remove geowiki cron jobs and make puppet delete related files/dirs [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) [14:52:48] <_joe_> please let's not merge any change for now [14:53:46] _joe_: yessir [14:55:34] (03CR) 10Marostegui: "Works: https://puppet-compiler.wmflabs.org/compiler02/11975/" [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [14:58:21] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10akosiaris) >>! In T196477#4472530, @MoritzMuehlenhoff wrote: >>>! In T196477#4472512, @akosiaris wrote: >>>>! In T196477#4472436, @MoritzMuehlenhoff wrote: >>> There's no simple way to... [14:58:26] marostegui you can write because gerrit partly does not use a db anymore. [14:58:37] it uses notedb for changes / accounts. [14:59:30] marostegui: paladox sorry, I was going to mention that [14:59:37] but was busy at the time [14:59:41] ok [14:59:49] thanks guys :) [15:01:16] the only thing the read only mode would affect is groups (and anything non changes wise i think and accounts). [15:09:57] (03PS1) 10Andrew Bogott: wmcs eqiad1: set up glance syncing within the eqiad1 glance hosts [puppet] - 10https://gerrit.wikimedia.org/r/450041 [15:10:33] (03CR) 10jerkins-bot: [V: 04-1] wmcs eqiad1: set up glance syncing within the eqiad1 glance hosts [puppet] - 10https://gerrit.wikimedia.org/r/450041 (owner: 10Andrew Bogott) [15:12:36] (03PS2) 10Andrew Bogott: wmcs eqiad1: set up glance syncing within the eqiad1 glance hosts [puppet] - 10https://gerrit.wikimedia.org/r/450041 [15:16:29] (03CR) 10Andrew Bogott: [C: 032] wmcs eqiad1: set up glance syncing within the eqiad1 glance hosts [puppet] - 10https://gerrit.wikimedia.org/r/450041 (owner: 10Andrew Bogott) [15:17:30] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10mobrovac) >>! In T200215#4472284, @Ottomata wrote: > Eric can correct me if I'm wrong, but I bel... [15:17:34] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) >>! In T196477#4472573, @akosiaris wrote: > Unless we also upgrade to a kernel from say `stretch-backports` (4.16+94~bpo9+1 from what I see currently), yes it does.... [15:19:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10ssastry) >>! In T191362#4470247, @RobH wrote: > @ssastry: I'm assigning this to you for feedback, please confirm this host is no longer used and can be decommissioned. (Then assign... [15:23:42] 10Operations, 10LDAP, 10Patch-For-Review: add ssh key comparison to cross-validate-accounts.py - https://phabricator.wikimedia.org/T189890 (10MoritzMuehlenhoff) 05Open>03Resolved This has now been added to the daily account consistency check. [15:23:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10ssastry) >>! In T191362#4472636, @ssastry wrote: >>>! In T191362#4470247, @RobH wrote: >> @ssastry: I'm assigning this to you for feedback, please confirm this host is no longer used... [15:26:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10RobH) So promethium/WMF3571 was purchased in January of 2013. It is very old, and very out of warranty. If this host is going to continue to be used for work, we should look at rep... [15:29:37] 10Operations, 10ops-codfw, 10DBA: db2061 disk with predictive failure - https://phabricator.wikimedia.org/T200059 (10Papaul) a:05Papaul>03Marostegui @Marostegui disk replacement complete [15:32:04] 10Operations: Include ADD operation in memcached stats and grafana dashboard - https://phabricator.wikimedia.org/T201016 (10aaron) [15:32:06] 10Operations, 10ops-codfw, 10DBA: db2061 disk with predictive failure - https://phabricator.wikimedia.org/T200059 (10Marostegui) Thanks! ``` physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Rebuilding) ``` [15:32:07] (03CR) 10Vgutierrez: "partial response.. gerrit interface is going crazy :)" (0317 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [15:32:37] (03PS15) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [15:32:50] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Papaul) @robh we do have a 12 disks decom on site. (db2013) [15:34:03] (03CR) 10jerkins-bot: [V: 04-1] WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [15:37:38] (03PS7) 10Jcrespo: Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/449742 (https://phabricator.wikimedia.org/T156924) [15:37:40] (03PS1) 10Jcrespo: Upgrade check_mariadb.py to the latest WMFMariaDB version [puppet] - 10https://gerrit.wikimedia.org/r/450046 [15:38:17] (03CR) 10jerkins-bot: [V: 04-1] Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/449742 (https://phabricator.wikimedia.org/T156924) (owner: 10Jcrespo) [15:38:34] (03CR) 10jerkins-bot: [V: 04-1] Upgrade check_mariadb.py to the latest WMFMariaDB version [puppet] - 10https://gerrit.wikimedia.org/r/450046 (owner: 10Jcrespo) [15:50:08] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Helm test failing for CI namespace - https://phabricator.wikimedia.org/T199489 (10dduvall) >>! In T199489#4472369, @akosiaris wrote: > I did some manual testing btw, I am guessing this is the error > > ``` > servicechecker.CheckError: Ge... [15:50:40] (03PS4) 10Addshore: Do not leak local $wgWBShared… variables to the global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 (owner: 10Thiemo Kreuz (WMDE)) [15:51:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10ssastry) >>! In T191362#4472656, @RobH wrote: > Is it projected to need this system for another year? Yes, at least. I am happy to explore getting a true labs VM for this. Will chat... [15:53:06] !log milimetric@deploy1001 Started deploy [analytics/refinery@5e5b5a9]: Quick fix for sqoop script [15:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:46] (03CR) 10Jcrespo: "Marostegui- this could be the foundations for the read only check, try it on some hosts and see what you think." [puppet] - 10https://gerrit.wikimedia.org/r/450046 (owner: 10Jcrespo) [15:57:59] 10Operations, 10netops: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147 (10Papaul) [15:59:53] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:00:04] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:00:04] godog, moritzm, and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180802T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:40] (03PS2) 10Jcrespo: Upgrade check_mariadb.py to the latest WMFMariaDB version [puppet] - 10https://gerrit.wikimedia.org/r/450046 [16:00:58] !log milimetric@deploy1001 Finished deploy [analytics/refinery@5e5b5a9]: Quick fix for sqoop script (duration: 07m 52s) [16:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:24] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:04:43] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:07:31] (03PS1) 10Jforrester: Follow-up 4c97a86fe8: Add wikimania.wikimedia.org to CORS origins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450051 [16:08:15] Reedy: Tsk, you forgot ^^^ [16:08:57] (And later we need to import each of the closed Wikimanias into a bespoke namespace for each, but that can wait.) [16:09:09] yeah, later ;P [16:11:57] (03PS5) 10Volans: Add common base utility modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) [16:12:07] (03CR) 10Volans: "Thanks for the replies, see inline" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:12:45] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) Related link (co-indidentally from today!) wrt steps needed in d-i to support installing from backports: https://lists.debian.org/debian-boot/2018/08/msg00015.html [16:14:57] 10Operations, 10Wikimedia-General-or-Unknown: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10akosiaris) >>! In T200690#4468603, @Tgr wrote: > So yeah, apparently SSH uses a non-login shell when you give it a command to execute, and there is no easy way around it; Correc... [16:15:53] RECOVERY - Device not healthy -SMART- on db2061 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2061&var-datasource=codfw%2520prometheus%252Fops [16:17:05] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) 05Open>03stalled Stalling this, might happen again and upstream likely will have mitigations in linux 4.19 [16:19:36] 10Operations, 10DBA, 10monitoring: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10jcrespo) [16:22:48] (03CR) 10Reedy: [C: 031] Follow-up 4c97a86fe8: Add wikimania.wikimedia.org to CORS origins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450051 (owner: 10Jforrester) [16:26:13] (03CR) 10Herron: [C: 031] logstash: default to 4MB receive buffer [puppet] - 10https://gerrit.wikimedia.org/r/450028 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [16:29:05] (03PS1) 10Cmjohnson: Adding mgmt dns for 2 spare servers [dns] - 10https://gerrit.wikimedia.org/r/450057 (https://phabricator.wikimedia.org/T196697) [16:29:14] (03CR) 10Herron: [C: 031] logstash: remove multiline filter [puppet] - 10https://gerrit.wikimedia.org/r/450026 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [16:29:47] 10Operations, 10Wikimedia-General-or-Unknown: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10Tgr) I can set umask in my local bashrc. It just seems nice to prevent wasting deployer time in the future whenever someone connects in a nonstandard way (as a wrong umask only ge... [16:32:45] (03PS3) 10Herron: logstash: use default number of queue workers [puppet] - 10https://gerrit.wikimedia.org/r/450027 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [16:33:09] (03CR) 10Herron: [C: 031] "LGTM -- just added a note in the commit msg about what the default value is" [puppet] - 10https://gerrit.wikimedia.org/r/450027 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [16:37:18] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) Ok I think I have finally get something :) So I left tcpdump to capture ipv6 traffic excluding some "known" IPs like puppetmas... [16:38:41] (03CR) 10Dzahn: "did you also run the manual commands to re-generate zones? apparently you did :)" [dns] - 10https://gerrit.wikimedia.org/r/442867 (https://phabricator.wikimedia.org/T198400) (owner: 10Urbanecm) [16:39:37] (03CR) 10Gehel: [C: 031] "All good!" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:39:39] 10Operations, 10Wikimedia-General-or-Unknown: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10akosiaris) >>! In T200690#4472902, @Tgr wrote: > I can set umask in my local bashrc. Yes, that would work too. > It just seems nice to prevent wasting deployer time in the futu... [16:42:26] 10Operations, 10WMF-Communications, 10Wikimedia-Apache-configuration, 10wikimediafoundation.org: Update redirect for jobs.wikimedia.org - https://phabricator.wikimedia.org/T200951 (10Aklapper) [16:50:49] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) Tried to find all the occurrences of webproxy and added the related https configuration, let's see if things will change! [16:53:35] i noticed an issue on officewiki where if you try to access a page as a non-logged in user, eg https://office.wikimedia.org/wiki/Learning_and_Development/Leadership_Framework, you get redirected to the login screen per usual. then once you login, you get redirected to the homepage rather than to the original page you were trying to access. note this only appears to be an issue on desktop - mobile seems to handle the redirect after login [16:53:35] correctly. [16:54:05] im happy to log this in phabricator but wasnt able to quickly figure out where the appropriate place is [16:54:22] 10Operations, 10ops-codfw, 10cloud-services-team, 10netops: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10Papaul) Port information labtestnet2002 rack B1 ge-1/0/16 labtestnet2003 rack B1 ge1/0/17 ''' [edit interfaces interface-range cloud-instance-po... [16:56:04] RECOVERY - configured eth on labtestnet2002 is OK: OK - interfaces up [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180802T1700). [17:02:24] 10Operations, 10ops-codfw, 10cloud-services-team, 10netops: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10Papaul) a:05Papaul>03Andrew @Andrew both ports are up and in the cloud-instance-ports interfaces ranges. Please check if everything looks good, y... [17:03:01] 10Operations, 10Wikimedia-General-or-Unknown: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10Tgr) Scap is not the problem, git commands are. Although it seems fair to assume that anyone who does git writes on the deploy host will also run scap at some point... [17:03:07] (03PS1) 10Gehel: elasticsearch: migrate relforge to stretch [puppet] - 10https://gerrit.wikimedia.org/r/450060 (https://phabricator.wikimedia.org/T193649) [17:05:55] (03PS1) 10Dzahn: posgresql::backup: fix "Unterminated quoted string" in cron tab [puppet] - 10https://gerrit.wikimedia.org/r/450061 (https://phabricator.wikimedia.org/T190184) [17:10:11] (03PS1) 10Gehel: elasticsearch: migrate codfw cluster to Stretch and RAID0 [puppet] - 10https://gerrit.wikimedia.org/r/450062 (https://phabricator.wikimedia.org/T193649) [17:10:25] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) [17:10:53] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:14:09] (03PS1) 10Gehel: elasticsearch: migrate eqiad cluster to Stretch and RAID0 [puppet] - 10https://gerrit.wikimedia.org/r/450064 (https://phabricator.wikimedia.org/T193649) [17:14:23] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:22:17] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1076.eqiad.wmnet'] ``` The log can be found in `/var/log/w... [17:27:49] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Lost access to archiva - https://phabricator.wikimedia.org/T200954 (10Smalyshev) Thanks, I can log in now! [17:29:55] (03PS1) 10BBlack: cp1076: add macaddr [puppet] - 10https://gerrit.wikimedia.org/r/450065 [17:30:13] (03CR) 10BBlack: [V: 032 C: 032] cp1076: add macaddr [puppet] - 10https://gerrit.wikimedia.org/r/450065 (owner: 10BBlack) [17:38:14] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:38:42] (03PS5) 10Bstorm: WIP tooforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 [17:39:46] (03CR) 10jerkins-bot: [V: 04-1] WIP tooforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 (owner: 10Bstorm) [17:43:09] (03PS6) 10Bstorm: WIP tooforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 [17:43:53] (03CR) 10jerkins-bot: [V: 04-1] WIP tooforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 (owner: 10Bstorm) [17:45:04] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:57:38] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10Ottomata) I mentioned this to Marko in IRC, but I'm not sure if his previous statement is quite... [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180802T1800). [18:00:04] stephanebisson and Amir1: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:11] خ/ [18:00:13] o/ [18:00:17] mine is not testable [18:01:55] hello [18:02:18] mine is also not really testable [18:02:37] I guess I can SWAT today [18:02:46] SWAT away! [18:04:05] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449756 (owner: 10Sbisson) [18:06:13] https://integration.wikimedia.org/zuul/ it's not triggered [18:06:15] let's rebase [18:06:23] (03PS2) 10Ladsgroup: Fix a typo in ORES models config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449756 (owner: 10Sbisson) [18:06:35] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Bouncing the network ports of elastic1049 and elastic1038, solved the issue. [18:06:39] (03CR) 10Ladsgroup: [C: 032] Fix a typo in ORES models config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449756 (owner: 10Sbisson) [18:07:01] nice find [18:07:27] now it's there [18:08:13] (03Merged) 10jenkins-bot: Fix a typo in ORES models config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449756 (owner: 10Sbisson) [18:10:44] (03PS2) 10Dzahn: posgresql::backup: fix "Unterminated quoted string" in cron tab [puppet] - 10https://gerrit.wikimedia.org/r/450061 (https://phabricator.wikimedia.org/T190184) [18:10:47] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:449756|Fix a typo in ORES models config]] (duration: 00m 57s) [18:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:12] stephanebisson: ^ It's live [18:11:34] Amir1: Thanks. [18:14:10] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1076.eqiad.wmnet'] ``` and were **ALL** successful. [18:15:14] !log un-banning and repooling elastic1030 - T201039 [18:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:18] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [18:16:35] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10mobrovac) Indeed, the discussion is probably out of the scope of this ticket. That said, it wou... [18:18:05] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10Ottomata) {meme, src=votecat} [18:20:07] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.15/extensions/ORES/maintenance/PurgeScoreCache.php: SWAT: [[gerrit:450066|Join decomposition on maintenance/PurgeScoreCache.php (T200680)]] (duration: 00m 57s) [18:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:11] T200680: Run PurgeScoreCache.php on all wikis that have ORES enabled - https://phabricator.wikimedia.org/T200680 [18:20:17] PROBLEM - Disk space on cp1075 is CRITICAL: DISK CRITICAL - free space: /srv/nvme0n1p1 324 MB (0% inode=99%) [18:20:31] !log Morning SWAT is done [18:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:04] bblack: ^^^ FYI [18:21:06] (03CR) 10jenkins-bot: Fix a typo in ORES models config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449756 (owner: 10Sbisson) [18:21:18] !log ladsgroup@mwmaint1001:~$ mwscript extensions/ORES/maintenance/PurgeScoreCache.php --wiki=wikidatawiki and --old (T200680) [18:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:45] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10phuedx) Ping @mobrovac. Have you made any progress on the RESTBase module that will split traffic between the two backend services? [18:21:56] volans: it's not in service yet, but that check is "wrong", we must have a whitelist somewhere that ignores the cache partitions that needs updating... [18:22:29] yeah I thought so, that's why the FYI ;) [18:23:08] any idea where such a whitelist is? [18:23:51] ah I found it [18:24:10] ack [18:24:24] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Pchelolo) >>! In T186748#4473732, @phuedx wrote: > Ping @mobrovac. Have you made any progress on the RESTBase module that will split traff... [18:25:02] (03PS7) 10Bstorm: WIP tooforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 [18:25:43] (03CR) 10jerkins-bot: [V: 04-1] WIP tooforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 (owner: 10Bstorm) [18:25:57] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10phuedx) :thumbsup: Noted. Thanks, @Pchelolo! [18:28:01] (03PS8) 10Bstorm: WIP tooforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 [18:31:19] (03PS1) 10BBlack: base check_disk_options: exclude /srv/nvme mounts [puppet] - 10https://gerrit.wikimedia.org/r/450073 (https://phabricator.wikimedia.org/T195923) [18:32:57] (03CR) 10BBlack: [C: 032] base check_disk_options: exclude /srv/nvme mounts [puppet] - 10https://gerrit.wikimedia.org/r/450073 (https://phabricator.wikimedia.org/T195923) (owner: 10BBlack) [18:34:36] RECOVERY - Disk space on cp1075 is OK: DISK OK [18:38:52] (03CR) 10Dzahn: [C: 032] posgresql::backup: fix "Unterminated quoted string" in cron tab [puppet] - 10https://gerrit.wikimedia.org/r/450061 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [18:39:09] (03PS3) 10Dzahn: posgresql::backup: fix "Unterminated quoted string" in cron tab [puppet] - 10https://gerrit.wikimedia.org/r/450061 (https://phabricator.wikimedia.org/T190184) [18:39:56] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0 [18:41:34] (03CR) 10Ottomata: [C: 031] Import upstream version 2.2.3 [debs/archiva] - 10https://gerrit.wikimedia.org/r/449755 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [18:42:15] (03CR) 10Dzahn: [V: 032 C: 032] posgresql::backup: fix "Unterminated quoted string" in cron tab [puppet] - 10https://gerrit.wikimedia.org/r/450061 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [18:46:17] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [18:48:46] PROBLEM - Device not healthy -SMART- on cp1075 is CRITICAL: cluster=cache_text device=nvme0n1 instance=cp1075:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp1075&var-datasource=eqiad%2520prometheus%252Fops [18:53:45] (03PS2) 10Dzahn: etcd: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449891 [18:55:40] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10EBernhardson) >>! In T200215#4471773, @mobrovac wrote: > I assume the task description implies t... [18:56:25] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629 (10matmarex) [18:58:22] (03PS1) 10Thcipriani: Beta: deployment-deploy02 is deployment host [puppet] - 10https://gerrit.wikimedia.org/r/450078 (https://phabricator.wikimedia.org/T192561) [18:58:24] (03PS1) 10Thcipriani: Beta: remove deployment-{tin,mira} [puppet] - 10https://gerrit.wikimedia.org/r/450079 (https://phabricator.wikimedia.org/T192561) [18:59:16] (03CR) 10Thcipriani: [C: 04-1] "Will manually apply shortly" [puppet] - 10https://gerrit.wikimedia.org/r/450079 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [18:59:29] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10Ottomata) @EBernhardson, do you have an idea of how large your individual messages will be? I k... [19:00:04] twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Americas version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180802T1900). [19:01:53] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10EBernhardson) >>! In T200215#4474031, @Ottomata wrote: > @EBernhardson, do you have an idea of h... [19:02:41] (03CR) 10Elukey: [C: 032] Import upstream version 2.2.3 [debs/archiva] - 10https://gerrit.wikimedia.org/r/449755 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [19:07:52] !log T191061 is free of blockers, proceeding with the train deployment: group2 wikis to 1.32.0-wmf.15 [19:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:56] T191061: 1.32.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T191061 [19:08:00] (03PS1) 10Elukey: Release 2.2.3-1 [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/450081 (https://phabricator.wikimedia.org/T192639) [19:08:25] (03PS1) 1020after4: all wikis to 1.32.0-wmf.15 refs T191061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450082 [19:08:27] (03CR) 1020after4: [C: 032] all wikis to 1.32.0-wmf.15 refs T191061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450082 (owner: 1020after4) [19:08:42] 10Operations, 10Maps, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Mholloway) [19:09:58] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.15 refs T191061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450082 (owner: 1020after4) [19:10:47] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.15 refs T191061 [19:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:56] 10Operations, 10ops-eqiad, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10Cmjohnson) [19:17:59] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Pchelolo) https://github.com/wikimedia/restbase/pull/1043 [19:18:48] 10Operations, 10ops-eqiad, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10Cmjohnson) a:05Cmjohnson>03RobH This is ready for install, assigning to @robh for help. [19:19:17] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10Cmjohnson) [19:19:19] 10Operations, 10Maps, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Mholloway) Now that the Beta Cluster is back in good working order with both Jessie (deployment-maps03) and Stretch (deployment-m... [19:20:08] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10Cmjohnson) a:05Cmjohnson>03RobH This servers is ready for install, assigning to @robh for help with installation [19:21:23] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10Cmjohnson) these servers are ready for install, assigning to @robh for help. [19:22:57] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.15 refs T191061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450082 (owner: 1020after4) [19:23:13] (03CR) 10Ottomata: [C: 031] "woohoooo" [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/450081 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [19:24:55] (03CR) 10Chad: [C: 031] "I love you." [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/450081 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [19:28:10] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Cmjohnson) [19:29:29] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Cmjohnson) a:05Andrew>03RobH These servers are racked and cabled to both NICS eth0 is in cloud-hosts vlan eth1 is in cloud-ins... [19:31:21] (03CR) 10Dzahn: [C: 032] etcd: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449891 (owner: 10Dzahn) [19:32:31] Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.48.25) [19:32:43] saw a few of these in fatalmonitor ... [19:32:53] (03PS9) 10Bstorm: WIP toolforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 [19:33:16] Query: SELECT user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount FROM `user` WHERE user_id = X LIMIT 1 FOR UPDATE [19:35:55] (03PS1) 10BBlack: cp1075-90: add rest of macaddrs [puppet] - 10https://gerrit.wikimedia.org/r/450087 (https://phabricator.wikimedia.org/T195923) [19:37:40] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10RobH) I don't see any mention of what OS to use, however nearly all of cloud is on jessie. Additionally, it seems... [19:37:52] twentyafterfour: I'll have a patch in a minute [19:38:20] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10RobH) [19:38:55] legoktm: thanks [19:39:24] twentyafterfour: https://gerrit.wikimedia.org/r/450088 [19:41:35] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Cmjohnson) [19:41:43] legoktm: cherry-picked to wmf.15 https://gerrit.wikimedia.org/r/450090/ [19:41:50] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Cmjohnson) [19:42:31] twentyafterfour: lgtm [19:42:34] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Cmjohnson) a:05Cmjohnson>03RobH This is ready to be installed, assigning to @robh for help finishing the installation. [19:42:51] thanks legoktm! deploying as soon as I can get it through CI [19:43:05] (03CR) 10BBlack: [C: 032] cp1075-90: add rest of macaddrs [puppet] - 10https://gerrit.wikimedia.org/r/450087 (https://phabricator.wikimedia.org/T195923) (owner: 10BBlack) [19:43:28] ugh: 20 min remaining for gate-and-submit-swat [19:43:55] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10Cmjohnson) [19:44:02] I think the jobs themselves are that slow [19:44:35] 10Operations, 10ContentTranslation, 10Language-Team, 10WorkType-Maintenance: Apertium leaves a ton of stale processes, consumes all the available memory - https://phabricator.wikimedia.org/T107270 (10Petar.petkovic) [19:44:45] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10Cmjohnson) a:03RobH Assigning to @robh to help with final stage of installation. [19:44:46] yeah the swat queue isn't backed up... I don't understand having tests run that long, surely there's a better way [19:47:40] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1077.eqiad.wmnet', 'cp1078.eqiad.wmnet', 'cp1079.eqiad.wmn... [19:54:11] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698 (10Cmjohnson) [19:54:17] (03PS1) 10RobH: setup of cloudelastic100[1-4].wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/450092 (https://phabricator.wikimedia.org/T194186) [19:55:04] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698 (10Cmjohnson) This is racked but I am getting a link issue and will need to check the cable. [19:55:28] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10Cmjohnson) [19:56:08] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10Cmjohnson) a:05Cmjohnson>03RobH assigning to @robh to help complete the installation. [19:56:29] 10Operations, 10ops-eqiad, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Cmjohnson) [19:56:57] 10Operations, 10ops-eqiad, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Cmjohnson) a:05Cmjohnson>03RobH assigning to @robh to help complete the install [19:57:16] (03PS2) 10Cmjohnson: Adding mgmt dns for 2 spare servers [dns] - 10https://gerrit.wikimedia.org/r/450057 (https://phabricator.wikimedia.org/T196697) [19:57:28] (03CR) 10RobH: [C: 032] setup of cloudelastic100[1-4].wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/450092 (https://phabricator.wikimedia.org/T194186) (owner: 10RobH) [19:57:37] (03PS2) 10RobH: setup of cloudelastic100[1-4].wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/450092 (https://phabricator.wikimedia.org/T194186) [20:01:43] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for 2 spare servers [dns] - 10https://gerrit.wikimedia.org/r/450057 (https://phabricator.wikimedia.org/T196697) (owner: 10Cmjohnson) [20:05:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0 [20:09:39] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:10] PROBLEM - Host cp1079 is DOWN: PING CRITICAL - Packet loss = 100% [20:10:10] PROBLEM - Host cp1081 is DOWN: PING CRITICAL - Packet loss = 100% [20:10:29] RECOVERY - Host cp1079 is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms [20:10:52] ignore those and related cp10xx alerts for now [20:10:56] I think reimage failed to set downtimes [20:10:59] RECOVERY - Host cp1081 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [20:11:00] PROBLEM - Host cp1077 is DOWN: PING CRITICAL - Packet loss = 100% [20:11:39] RECOVERY - Host cp1077 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [20:12:59] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1079 is CRITICAL: connect to address 10.64.16.22 and port 3122: Connection refused [20:12:59] PROBLEM - confd service on cp1079 is CRITICAL: NRPE: Command check_confd-state not defined [20:12:59] PROBLEM - HTTPS Unified ECDSA on cp1079 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:13:10] PROBLEM - Webrequests Varnishkafka log producer on cp1079 is CRITICAL: NRPE: Command check_varnishkafka-webrequest not defined [20:13:12] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp1081 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.frontend.vcl not defined [20:13:30] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp1081 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined [20:13:31] PROBLEM - eventlogging Varnishkafka log producer on cp1081 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined [20:13:50] PROBLEM - HTTPS Unified ECDSA on cp1077 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:14:00] PROBLEM - statsv Varnishkafka log producer on cp1081 is CRITICAL: NRPE: Command check_varnishkafka-statsv not defined [20:14:01] PROBLEM - Check systemd state on cp1077 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:14:01] PROBLEM - HTTPS Unified RSA on cp1077 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:14:19] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1079 is CRITICAL: connect to address 10.64.16.22 and port 3121: Connection refused [20:14:19] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1081 is CRITICAL: connect to address 10.64.16.24 and port 3126: Connection refused [20:14:20] PROBLEM - confd service on cp1077 is CRITICAL: NRPE: Command check_confd-state not defined [20:14:39] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1079 is CRITICAL: connect to address 10.64.16.22 and port 3120: Connection refused [20:14:39] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1081 is CRITICAL: connect to address 10.64.16.24 and port 3125: Connection refused [20:14:40] PROBLEM - Webrequests Varnishkafka log producer on cp1077 is CRITICAL: NRPE: Command check_varnishkafka-webrequest not defined [20:14:50] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1077 is CRITICAL: connect to address 10.64.0.132 and port 3124: Connection refused [20:14:50] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1079 is CRITICAL: connect to address 10.64.16.22 and port 3123: Connection refused [20:14:50] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1081 is CRITICAL: connect to address 10.64.16.24 and port 80: Connection refused [20:14:50] PROBLEM - Check systemd state on cp1079 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:14:50] PROBLEM - HTTPS Unified RSA on cp1079 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:15:20] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1077 is CRITICAL: connect to address 10.64.0.132 and port 3122: Connection refused [20:16:09] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1077 is CRITICAL: connect to address 10.64.0.132 and port 3123: Connection refused [20:16:10] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1081 is CRITICAL: connect to address 10.64.16.24 and port 3127: Connection refused [20:16:21] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10Cmjohnson) a:05Cmjohnson>03RobH assigning to @robh for help with the final installation [20:16:31] (03PS1) 10RobH: updating netboot.cfg for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/450096 (https://phabricator.wikimedia.org/T194186) [20:16:39] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1077 is CRITICAL: connect to address 10.64.0.132 and port 3125: Connection refused [20:16:40] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1079 is CRITICAL: connect to address 10.64.16.22 and port 3124: Connection refused [20:16:40] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp1077 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined [20:16:40] PROBLEM - eventlogging Varnishkafka log producer on cp1077 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined [20:16:56] (03CR) 10RobH: [C: 032] updating netboot.cfg for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/450096 (https://phabricator.wikimedia.org/T194186) (owner: 10RobH) [20:17:08] (03CR) 10Dduvall: [C: 031] "I pinged the folks in #jenkins IRC and the author of the shared pipeline library plugin confirmed that the master workspace is used for ch" [puppet] - 10https://gerrit.wikimedia.org/r/449769 (https://phabricator.wikimedia.org/T200953) (owner: 10Dduvall) [20:18:00] PROBLEM - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 [20:18:20] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1077 is CRITICAL: connect to address 10.64.0.132 and port 3126: Connection refused [20:18:20] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1079 is CRITICAL: connect to address 10.64.16.22 and port 3125: Connection refused [20:18:20] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp1077 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.frontend.vcl not defined [20:18:20] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp1079 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined [20:18:20] PROBLEM - eventlogging Varnishkafka log producer on cp1079 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined [20:19:38] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10EBernhardson) I ran a slightly longer test using 5M records, this allows it to run long enough t... [20:20:10] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1077 is CRITICAL: connect to address 10.64.0.132 and port 3127: Connection refused [20:20:10] PROBLEM - statsv Varnishkafka log producer on cp1077 is CRITICAL: NRPE: Command check_varnishkafka-statsv not defined [20:20:11] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1079 is CRITICAL: connect to address 10.64.16.22 and port 3126: Connection refused [20:20:11] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1081 is CRITICAL: connect to address 10.64.16.24 and port 3128: Connection refused [20:20:11] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp1079 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.frontend.vcl not defined [20:20:58] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Knowledge Integrity program - https://phabricator.wikimedia.org/T200924 (10herron) a:03herron [20:21:00] PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 [20:21:20] RECOVERY - HTTPS Unified RSA on cp1079 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345580 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2018-11-22 07:59:59 +0000 (expires in 111 days) [20:21:30] PROBLEM - puppet last run on cp1081 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 seconds ago with 3 failures. Failed resources (up to 3 shown): Service[varnishmtail],Exec[retry-load-new-vcl-file],Service[varnish-frontend] [20:21:39] RECOVERY - HTTPS Unified ECDSA on cp1077 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345565 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 111 days) [20:21:40] RECOVERY - HTTPS Unified ECDSA on cp1079 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345560 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 111 days) [20:21:49] RECOVERY - HTTPS Unified RSA on cp1077 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345560 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2018-11-22 07:59:59 +0000 (expires in 111 days) [20:21:50] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1077 is CRITICAL: connect to address 10.64.0.132 and port 80: Connection refused [20:21:50] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1079 is CRITICAL: connect to address 10.64.16.22 and port 3127: Connection refused [20:21:50] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1081 is CRITICAL: connect to address 10.64.16.24 and port 3120: Connection refused [20:21:50] PROBLEM - statsv Varnishkafka log producer on cp1079 is CRITICAL: NRPE: Command check_varnishkafka-statsv not defined [20:22:50] PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 [20:22:50] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [20:22:53] jouncebot: now [20:22:53] For the next 0 hour(s) and 37 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180802T1900) [20:22:56] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10Ottomata) Hm ok. We can probably handle that in main-eqiad, but it would be very bursty and dom... [20:23:10] RECOVERY - Webrequests Varnishkafka log producer on cp1079 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [20:23:11] RECOVERY - confd service on cp1077 is OK: OK - confd is active [20:23:11] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp1081 is OK: No errors detected [20:23:19] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.001 second response time [20:23:19] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [20:23:19] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.000 second response time [20:23:20] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp1077 is OK: No errors detected [20:23:20] RECOVERY - eventlogging Varnishkafka log producer on cp1077 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf [20:23:28] !log twentyafterfour@deploy1001 Synchronized php-1.32.0-wmf.15/extensions/VisualEditor/includes/ApiVisualEditorEdit.php: sync https://gerrit.wikimedia.org/r/450090/ to unbreak prod refs T201083 T191061 (duration: 00m 49s) [20:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:34] T201083: InvalidArgumentException from line 58 of includes/libs/EasyDeflate.php: Data does not begin with deflated prefix - https://phabricator.wikimedia.org/T201083 [20:23:34] T191061: 1.32.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T191061 [20:23:39] RECOVERY - statsv Varnishkafka log producer on cp1077 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf [20:23:40] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp1081 is OK: No errors detected [20:23:40] RECOVERY - eventlogging Varnishkafka log producer on cp1081 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf [20:23:42] RECOVERY - Webrequests Varnishkafka log producer on cp1077 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [20:23:43] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp1079 is OK: No errors detected [20:23:44] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [20:23:44] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.002 second response time [20:24:09] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [20:24:09] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.001 second response time [20:24:09] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp1077 is OK: No errors detected [20:24:09] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp1079 is OK: No errors detected [20:24:09] RECOVERY - eventlogging Varnishkafka log producer on cp1079 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf [20:24:11] RECOVERY - confd service on cp1079 is OK: OK - confd is active [20:24:11] RECOVERY - statsv Varnishkafka log producer on cp1081 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf [20:24:19] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.001 second response time [20:24:19] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [20:24:19] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.001 second response time [20:24:19] RECOVERY - statsv Varnishkafka log producer on cp1079 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf [20:24:40] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.001 second response time [20:24:40] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [20:24:50] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.001 second response time [20:25:00] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.002 second response time [20:25:09] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [20:25:09] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [20:25:09] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.000 second response time [20:25:20] PROBLEM - puppet last run on cp1079 is CRITICAL: Return code of 255 is out of bounds [20:26:20] PROBLEM - Host cp1081 is DOWN: PING CRITICAL - Packet loss = 100% [20:26:20] PROBLEM - Host cp1079 is DOWN: PING CRITICAL - Packet loss = 100% [20:26:39] PROBLEM - Host cp1077 is DOWN: PING CRITICAL - Packet loss = 100% [20:26:59] twentyafterfour: will I be stepping on your toes if I run a scap3 deploy? [20:27:12] bd808: nope [20:27:18] sweet [20:28:09] (03PS1) 10RobH: fixing new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/450144 (https://phabricator.wikimedia.org/T194186) [20:28:09] RECOVERY - Host cp1077 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [20:28:14] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Knowledge Integrity program - https://phabricator.wikimedia.org/T200924 (10herron) 05Open>03Resolved Hi @Samwalton9, the list `knowledgeintegrity@lists.wikimedia.org` has been created and the mailing list system should have sent you the credentials... [20:28:19] RECOVERY - Host cp1079 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:28:20] RECOVERY - Host cp1081 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [20:28:30] RECOVERY - Check systemd state on cp1079 is OK: OK - running: The system is fully operational [20:28:39] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 497 bytes in 0.001 second response time [20:28:40] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [20:28:40] !log bd808@deploy1001 Started deploy [striker/deploy@2329901]: Update Striker to 2329901 (T177407, T198076, T190543) [20:28:45] (03CR) 10RobH: [C: 032] fixing new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/450144 (https://phabricator.wikimedia.org/T194186) (owner: 10RobH) [20:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:48] T177407: https://toolsadmin.wikimedia.org/tools/create/ returns 403 rather than redirecting to login - https://phabricator.wikimedia.org/T177407 [20:28:48] T190543: Update UI to use term "Wikimedia developer account" - https://phabricator.wikimedia.org/T190543 [20:28:49] T198076: CI tests failing for labs/striker due to "upstream library changes and loose specification of the dependencies in the requirements.txt file" - https://phabricator.wikimedia.org/T198076 [20:28:49] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [20:28:50] RECOVERY - Check systemd state on cp1077 is OK: OK - running: The system is fully operational [20:29:10] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [20:30:06] !log bd808@deploy1001 Finished deploy [striker/deploy@2329901]: Update Striker to 2329901 (T177407, T198076, T190543) (duration: 01m 26s) [20:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:33] RECOVERY - puppet last run on cp1079 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [20:31:42] RECOVERY - puppet last run on cp1081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:32:37] 10Operations, 10ops-eqiad: asw2-a-eqiad VC link down - https://phabricator.wikimedia.org/T201095 (10ayounsi) p:05Triage>03High [20:33:38] 10Operations, 10vm-requests: eqiad: (1) VM request for Archiva - https://phabricator.wikimedia.org/T200895 (10herron) a:03herron [20:34:43] PROBLEM - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 [20:35:13] PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 [20:35:32] PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 [20:37:58] 10Operations, 10netops: Add virtual chassis port status alerting - https://phabricator.wikimedia.org/T201097 (10ayounsi) p:05Triage>03Normal [20:38:45] 10Operations, 10ops-eqiad: asw2-a-eqiad VC link down - https://phabricator.wikimedia.org/T201095 (10ayounsi) [20:38:45] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad VC link down - https://phabricator.wikimedia.org/T201095 (10ayounsi) [20:47:20] (03PS1) 10BBlack: smart-data-dump: support nvme [puppet] - 10https://gerrit.wikimedia.org/r/450148 (https://phabricator.wikimedia.org/T195923) [20:47:22] (03CR) 10Thcipriani: [C: 031] "looks good to me, will require a jenkins restart" [puppet] - 10https://gerrit.wikimedia.org/r/449769 (https://phabricator.wikimedia.org/T200953) (owner: 10Dduvall) [20:48:58] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1080.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['cp1080.eqiad.wmnet'] ``` [20:51:08] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1083.eqiad.wmnet', 'cp1084.eqiad.wmnet', 'cp1085.eqiad.wmn... [20:52:02] (03CR) 10BBlack: [C: 032] smart-data-dump: support nvme [puppet] - 10https://gerrit.wikimedia.org/r/450148 (https://phabricator.wikimedia.org/T195923) (owner: 10BBlack) [20:54:39] PROBLEM - Device not healthy -SMART- on cp1081 is CRITICAL: cluster=cache_text device=nvme0n1 instance=cp1081:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp1081&var-datasource=eqiad%2520prometheus%252Fops [21:03:32] PROBLEM - Device not healthy -SMART- on cp1079 is CRITICAL: cluster=cache_text device=nvme0n1 instance=cp1079:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp1079&var-datasource=eqiad%2520prometheus%252Fops [21:06:09] the smart dump output is actually correct and healthy now [21:06:32] but it involved changing the name of the device, so now it still has a stuck/stale state of unhealthy for the old name :P [21:07:06] 08Warning Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Inbound interface errors [21:12:38] PROBLEM - Host cp1086 is DOWN: PING CRITICAL - Packet loss = 100% [21:13:08] PROBLEM - Host cp1088 is DOWN: PING CRITICAL - Packet loss = 100% [21:13:17] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:48] PROBLEM - Host cp1084 is DOWN: PING CRITICAL - Packet loss = 100% [21:13:48] PROBLEM - Host cp1087 is DOWN: PING CRITICAL - Packet loss = 100% [21:13:57] RECOVERY - Host cp1086 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [21:14:17] RECOVERY - Host cp1087 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [21:14:18] RECOVERY - Host cp1088 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [21:14:38] RECOVERY - Host cp1084 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [21:14:38] PROBLEM - Host cp1090 is DOWN: PING CRITICAL - Packet loss = 100% [21:14:48] PROBLEM - Host cp1083 is DOWN: PING CRITICAL - Packet loss = 100% [21:15:17] RECOVERY - Host cp1090 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [21:15:47] RECOVERY - Host cp1083 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [21:15:48] PROBLEM - HTTPS Unified RSA on cp1084 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:16:23] yeah apparently there's some race conditions in the downtime stuff :) [21:16:37] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp1086 is CRITICAL: connect to address 10.64.32.70 and port 3128: Connection refused [21:16:38] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1087 is CRITICAL: connect to address 10.64.48.101 and port 3125: Connection refused [21:16:38] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp1087 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined [21:16:38] PROBLEM - Varnish traffic logger - varnishmedia on cp1086 is CRITICAL: NRPE: Command check_varnishmedia not defined [21:16:38] PROBLEM - eventlogging Varnishkafka log producer on cp1087 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined [21:16:40] PROBLEM - HTTPS Unified RSA on cp1087 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:17:17] PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:17:17] PROBLEM - confd service on cp1084 is CRITICAL: NRPE: Command check_confd-state not defined [21:17:17] PROBLEM - Check systemd state on cp1084 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:17:38] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp1084 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined [21:17:57] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp1084 is CRITICAL: connect to address 10.64.32.68 and port 3124: Connection refused [21:17:58] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1087 is CRITICAL: connect to address 10.64.48.101 and port 3124: Connection refused [21:18:17] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp1084 is CRITICAL: connect to address 10.64.32.68 and port 3123: Connection refused [21:18:17] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1087 is CRITICAL: connect to address 10.64.48.101 and port 3123: Connection refused [21:18:17] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp1088 is CRITICAL: connect to address 10.64.48.102 and port 80: Connection refused [21:18:28] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1083 is CRITICAL: connect to address 10.64.32.67 and port 3121: Connection refused [21:18:28] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp1084 is CRITICAL: connect to address 10.64.32.68 and port 3126: Connection refused [21:18:28] PROBLEM - Webrequests Varnishkafka log producer on cp1083 is CRITICAL: NRPE: Command check_varnishkafka-webrequest not defined [21:18:29] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp1084 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.frontend.vcl not defined [21:18:29] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp1086 is CRITICAL: connect to address 10.64.32.70 and port 3120: Connection refused [21:18:29] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1087 is CRITICAL: connect to address 10.64.48.101 and port 3126: Connection refused [21:18:29] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp1088 is CRITICAL: connect to address 10.64.48.102 and port 3128: Connection refused [21:18:29] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp1087 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.frontend.vcl not defined [21:18:30] PROBLEM - Varnish traffic logger - varnishmedia on cp1088 is CRITICAL: NRPE: Command check_varnishmedia not defined [21:18:58] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1083 is CRITICAL: connect to address 10.64.32.67 and port 3128: Connection refused [21:18:58] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp1090 is CRITICAL: connect to address 10.64.48.104 and port 80: Connection refused [21:19:47] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1083 is CRITICAL: connect to address 10.64.32.67 and port 3120: Connection refused [21:19:48] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp1084 is CRITICAL: connect to address 10.64.32.68 and port 3125: Connection refused [21:20:18] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1083 is CRITICAL: connect to address 10.64.32.67 and port 3122: Connection refused [21:20:18] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp1084 is CRITICAL: connect to address 10.64.32.68 and port 3127: Connection refused [21:20:18] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp1086 is CRITICAL: connect to address 10.64.32.70 and port 3121: Connection refused [21:20:18] PROBLEM - confd service on cp1083 is CRITICAL: NRPE: Command check_confd-state not defined [21:20:18] PROBLEM - HTTPS Unified ECDSA on cp1083 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:20:18] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1087 is CRITICAL: connect to address 10.64.48.101 and port 3127: Connection refused [21:20:18] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp1088 is CRITICAL: connect to address 10.64.48.102 and port 3120: Connection refused [21:20:19] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp1090 is CRITICAL: connect to address 10.64.48.104 and port 3128: Connection refused [21:20:19] PROBLEM - statsv Varnishkafka log producer on cp1087 is CRITICAL: NRPE: Command check_varnishkafka-statsv not defined [21:20:20] PROBLEM - Varnish traffic logger - varnishmedia on cp1090 is CRITICAL: NRPE: Command check_varnishmedia not defined [21:20:57] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Halfak) @Marostegui, essentially, we need JADE things to [be wiki pages](https://www.mediawiki.org/wiki/Everything_... [21:21:07] PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 [21:21:18] RECOVERY - HTTPS Unified RSA on cp1084 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345571 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2018-11-22 07:59:59 +0000 (expires in 111 days) [21:21:28] RECOVERY - HTTPS Unified ECDSA on cp1083 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345562 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 111 days) [21:21:47] RECOVERY - confd service on cp1084 is OK: OK - confd is active [21:21:48] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp1084 is OK: No errors detected [21:21:48] RECOVERY - Varnish traffic logger - varnishmedia on cp1088 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishmedia, UID = 0 (root) [21:22:07] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1083 is CRITICAL: connect to address 10.64.32.67 and port 3123: Connection refused [21:22:07] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp1084 is CRITICAL: connect to address 10.64.32.68 and port 80: Connection refused [21:22:07] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp1086 is CRITICAL: connect to address 10.64.32.70 and port 3122: Connection refused [21:22:07] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1087 is CRITICAL: connect to address 10.64.48.101 and port 80: Connection refused [21:22:07] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp1088 is CRITICAL: connect to address 10.64.48.102 and port 3121: Connection refused [21:22:08] PROBLEM - Webrequests Varnishkafka log producer on cp1086 is CRITICAL: NRPE: Command check_varnishkafka-webrequest not defined [21:22:09] PROBLEM - Check systemd state on cp1083 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:22:09] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp1090 is CRITICAL: connect to address 10.64.48.104 and port 3120: Connection refused [21:22:10] PROBLEM - HTTPS Unified RSA on cp1083 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:22:10] PROBLEM - HTTPS Unified ECDSA on cp1086 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:22:11] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp1084 is OK: No errors detected [21:22:37] RECOVERY - Varnish traffic logger - varnishmedia on cp1090 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishmedia, UID = 0 (root) [21:22:48] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [21:22:57] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [21:22:57] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp1084 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.001 second response time [21:22:58] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp1087 is OK: No errors detected [21:22:58] RECOVERY - Webrequests Varnishkafka log producer on cp1083 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [21:23:17] RECOVERY - HTTPS Unified RSA on cp1083 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345452 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2018-11-22 07:59:59 +0000 (expires in 111 days) [21:23:18] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp1087 is OK: No errors detected [21:23:18] RECOVERY - eventlogging Varnishkafka log producer on cp1087 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf [21:23:27] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [21:23:27] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp1084 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.000 second response time [21:23:27] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.001 second response time [21:23:27] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp1088 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [21:23:27] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 217 bytes in 0.001 second response time [21:23:28] RECOVERY - HTTPS Unified RSA on cp1087 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345450 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2018-11-22 07:59:59 +0000 (expires in 111 days) [21:23:47] RECOVERY - confd service on cp1083 is OK: OK - confd is active [21:23:48] RECOVERY - statsv Varnishkafka log producer on cp1087 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf [21:23:57] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp1086 is CRITICAL: connect to address 10.64.32.70 and port 3123: Connection refused [21:23:57] PROBLEM - Check systemd state on cp1086 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:24:18] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp1084 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.000 second response time [21:24:18] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.001 second response time [21:24:28] RECOVERY - Webrequests Varnishkafka log producer on cp1086 is OK: PROCS OK: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [21:24:29] RECOVERY - HTTPS Unified ECDSA on cp1086 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345386 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 111 days) [21:24:32] (03CR) 10Paladox: "This will be done next monday after getting approval by releng." [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox) [21:24:37] RECOVERY - Varnish traffic logger - varnishmedia on cp1086 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishmedia, UID = 0 (root) [21:24:37] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp1084 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [21:24:38] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp1088 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [21:24:38] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.001 second response time [21:24:47] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.005 second response time [21:24:47] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp1084 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.003 second response time [21:24:47] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [21:24:47] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp1088 is OK: HTTP OK: HTTP/1.1 200 OK - 217 bytes in 0.000 second response time [21:25:18] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.001 second response time [21:25:18] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp1084 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.000 second response time [21:25:18] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp1086 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [21:25:18] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.001 second response time [21:25:18] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp1088 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [21:25:18] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.003 second response time [21:25:18] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.001 second response time [21:25:19] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.001 second response time [21:25:28] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.001 second response time [21:25:48] PROBLEM - Check systemd state on cp1088 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:25:58] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp1086 is OK: HTTP OK: HTTP/1.1 200 OK - 217 bytes in 0.000 second response time [21:26:18] RECOVERY - Check systemd state on cp1086 is OK: OK - unknown: The operational state could not be determined, due to lack of resources or another error cause. [21:26:31] (03PS1) 10Herron: dns: reserve public IPv4 and set forward/reverse dns for archiva1001 [dns] - 10https://gerrit.wikimedia.org/r/450154 (https://phabricator.wikimedia.org/T200895) [21:26:38] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp1086 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [21:28:28] PROBLEM - Host cp1090 is DOWN: PING CRITICAL - Packet loss = 100% [21:28:35] (03CR) 10Herron: [C: 032] dns: reserve public IPv4 and set forward/reverse dns for archiva1001 [dns] - 10https://gerrit.wikimedia.org/r/450154 (https://phabricator.wikimedia.org/T200895) (owner: 10Herron) [21:28:39] (03PS1) 10RobH: new cloudelastic to use stretch [puppet] - 10https://gerrit.wikimedia.org/r/450155 [21:28:48] PROBLEM - Host cp1084 is DOWN: PING CRITICAL - Packet loss = 100% [21:28:48] PROBLEM - Host cp1086 is DOWN: PING CRITICAL - Packet loss = 100% [21:28:48] PROBLEM - Host cp1088 is DOWN: PING CRITICAL - Packet loss = 100% [21:28:57] PROBLEM - Host cp1083 is DOWN: PING CRITICAL - Packet loss = 100% [21:28:58] PROBLEM - Host cp1087 is DOWN: PING CRITICAL - Packet loss = 100% [21:29:37] RECOVERY - Host cp1090 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [21:29:38] RECOVERY - Device not healthy -SMART- on cp1075 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp1075&var-datasource=eqiad%2520prometheus%252Fops [21:29:48] RECOVERY - Host cp1083 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [21:29:48] RECOVERY - Host cp1084 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [21:29:57] RECOVERY - Host cp1086 is UP: PING WARNING - Packet loss = 58%, RTA = 0.26 ms [21:29:58] RECOVERY - Host cp1087 is UP: PING WARNING - Packet loss = 28%, RTA = 0.14 ms [21:29:58] RECOVERY - Check systemd state on cp1084 is OK: OK - running: The system is fully operational [21:30:26] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp1086 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [21:30:26] RECOVERY - Host cp1088 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [21:31:06] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp1086 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.000 second response time [21:31:07] RECOVERY - Check systemd state on cp1087 is OK: OK - running: The system is fully operational [21:31:10] (03CR) 10RobH: [C: 032] new cloudelastic to use stretch [puppet] - 10https://gerrit.wikimedia.org/r/450155 (owner: 10RobH) [21:31:44] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) a:05BBlack>03Cmjohnson Most of these are installed now, but 2x have initial hardware issues: * cp1080 - Reports uncorrectably-bad DIMM in slot A5 on bootup... [21:32:07] PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 [21:33:51] 10Operations, 10vm-requests, 10Patch-For-Review: eqiad: (1) VM request for Archiva - https://phabricator.wikimedia.org/T200895 (10herron) I've kicked off the instance creation for this VM just now. Ganeti is estimating a few hours to synchronize the new DRBD devices, so I'll let that run in the background a... [21:34:42] XioNoX: ok to remove possible network issues from the channel topic at this point? [21:35:04] herron: yep, thx [21:35:25] cool np [21:35:29] waiting for 4 more ipsec failures to show up so I can ack all the things and clean up [21:35:37] Or replace with impossible network issues [21:35:46] haha [21:36:53] 10Operations, 10vm-requests, 10Patch-For-Review, 10User-herron: eqiad: (1) VM request for Archiva - https://phabricator.wikimedia.org/T200895 (10herron) p:05Triage>03Normal [21:37:06] 08̶W̶a̶r̶n̶i̶n̶g Device asw2-c-eqiad.mgmt.eqiad.wmnet recovered from Inbound interface errors [21:39:07] PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 [21:42:06] RECOVERY - Check systemd state on cp1083 is OK: OK - running: The system is fully operational [21:42:06] RECOVERY - Check systemd state on cp1088 is OK: OK - running: The system is fully operational [21:47:35] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10RobH) [21:47:51] ACKNOWLEDGEMENT - IPsec on cp1075 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3031_v4, cp3031_v6 Brandon Black T200806 [21:47:51] ACKNOWLEDGEMENT - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 Brandon Black T200806 [21:47:51] ACKNOWLEDGEMENT - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 Brandon Black T200806 [21:47:51] ACKNOWLEDGEMENT - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 Brandon Black T200806 [21:47:51] ACKNOWLEDGEMENT - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 Brandon Black T200806 [21:47:51] ACKNOWLEDGEMENT - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 Brandon Black T200806 [21:47:51] ACKNOWLEDGEMENT - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 Brandon Black T200806 [21:48:25] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10RobH) a:05RobH>03Gehel @gehel & @EBernhardson: I'm assinging this to @gehel as the SRE team member involved with this project, for se... [21:48:37] 10Operations, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10RobH) [21:50:08] ACKNOWLEDGEMENT - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 84 connecting: cp1080_v4, cp1080_v6 Brandon Black https://phabricator.wikimedia.org/T195923#4474777 [21:50:08] ACKNOWLEDGEMENT - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 84 connecting: cp1080_v4, cp1080_v6 Brandon Black https://phabricator.wikimedia.org/T195923#4474777 [21:50:08] ACKNOWLEDGEMENT - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 84 connecting: cp1080_v4, cp1080_v6 Brandon Black https://phabricator.wikimedia.org/T195923#4474777 [21:50:08] ACKNOWLEDGEMENT - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 84 connecting: cp1080_v4, cp1080_v6 Brandon Black https://phabricator.wikimedia.org/T195923#4474777 [21:50:08] ACKNOWLEDGEMENT - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 84 connecting: cp1080_v4, cp1080_v6 Brandon Black https://phabricator.wikimedia.org/T195923#4474777 [21:50:13] way to go icinga-wm [21:55:06] RECOVERY - Device not healthy -SMART- on cp1081 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp1081&var-datasource=eqiad%2520prometheus%252Fops [21:55:30] (03PS1) 10BBlack: late_command: fix for jessie lacking nvme-cli [puppet] - 10https://gerrit.wikimedia.org/r/450156 [21:56:16] (03CR) 10BBlack: [C: 032] late_command: fix for jessie lacking nvme-cli [puppet] - 10https://gerrit.wikimedia.org/r/450156 (owner: 10BBlack) [22:03:35] RECOVERY - Device not healthy -SMART- on cp1079 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp1079&var-datasource=eqiad%2520prometheus%252Fops [22:05:48] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), and 2 others: Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10EBernhardson) I've added some rate limiting and tested it set to 1k messages/s: >>! In T200215#4474426, @... [22:06:15] (03PS2) 10Dzahn: elasticsearch: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449892 [22:10:10] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @Krinkle Good answer, both in terms of the information that I now have,... [22:11:44] !log importing python3-jsonlogger_0.1.9 into stretch-wikimedia [22:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:45] PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sde1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sde1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [22:20:36] RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational [22:21:05] !log webperf2002 - systemctl start proc-sys-fs-binfmt_misc.automount to fix "Check systemd degraded" Icinga check [22:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:23] (03PS1) 10RobH: adding authdns1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/450163 (https://phabricator.wikimedia.org/T196693) [22:26:23] (03CR) 10RobH: [C: 032] adding authdns1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/450163 (https://phabricator.wikimedia.org/T196693) (owner: 10RobH) [22:28:07] (03PS10) 10Bstorm: WIP toolforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 [22:29:53] (03CR) 10Dzahn: [C: 032] elasticsearch: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449892 (owner: 10Dzahn) [22:30:02] (03PS3) 10Dzahn: elasticsearch: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449892 [22:33:51] 10Operations, 10ops-eqiad, 10DNS, 10Traffic: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10RobH) [22:42:35] (03CR) 10Alex Monk: "I already uploaded this commit" [puppet] - 10https://gerrit.wikimedia.org/r/450078 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [22:44:26] !log replacing package python3-jsonlogger with python3-json-logger in stretch-wikimedia [22:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:47] (03PS2) 10Dzahn: Beta: deployment-deploy02 is deployment host [puppet] - 10https://gerrit.wikimedia.org/r/450078 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [22:49:31] (03CR) 10Dzahn: [C: 032] Beta: deployment-deploy02 is deployment host [puppet] - 10https://gerrit.wikimedia.org/r/450078 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [22:54:13] (03CR) 10Alex Monk: "... really?" [puppet] - 10https://gerrit.wikimedia.org/r/450078 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [22:57:28] (03CR) 10Thcipriani: "> ... really?" [puppet] - 10https://gerrit.wikimedia.org/r/450078 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [22:57:37] (03CR) 10Thcipriani: Beta: remove deployment-{tin,mira} [puppet] - 10https://gerrit.wikimedia.org/r/450079 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [22:59:24] 10Operations, 10Wikimedia-General-or-Unknown: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10Dzahn) p:05High>03Normal [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180802T2300). [23:00:05] SMalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:05] I'll deploy [23:03:12] !log adding package anycast-healthchecker to stretch-wikimedia [23:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:09] 10Operations, 10Cloud-VPS, 10cloud-services-team: labvirt1009 has high CPU, disk I/O and skyrocketted load - https://phabricator.wikimedia.org/T200888 (10hashar) a:03Andrew Thanks :] [23:05:14] 10Operations, 10Cloud-VPS, 10cloud-services-team: labvirt1009 has high CPU, disk I/O and skyrocketted load - https://phabricator.wikimedia.org/T200888 (10hashar) 05Open>03Resolved [23:07:08] (03CR) 10Hashar: [C: 031] ":]" [puppet] - 10https://gerrit.wikimedia.org/r/449769 (https://phabricator.wikimedia.org/T200953) (owner: 10Dduvall) [23:07:44] I'm here [23:08:20] MaxSem: I think my patch is the only one? [23:08:29] yup [23:08:41] ok :) [23:15:29] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.53 seconds [23:17:22] MaxSem: is it still in progress or done? [23:17:24] 10Operations, 10DNS, 10Traffic: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10RobH) a:05RobH>03None [23:18:06] finally merged... [23:18:17] 10Operations, 10DNS, 10Traffic: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10RobH) So this is ready for someone in #traffic to take over, and migrate authoritative dns services from radon.wikimedia.org. Then we can decom old system radon. [23:18:20] ah, cool [23:18:57] SMalyshev: pulled on mwdebug1002 [23:19:05] checking [23:21:04] wow it took a minute to load page from mwdebug... it is sloooow [23:23:36] (03CR) 10Dzahn: "no diff: http://puppet-compiler.wmflabs.org/11980/" [puppet] - 10https://gerrit.wikimedia.org/r/449350 (owner: 10Dzahn) [23:24:21] MaxSem: everything seems to be working as it should be [23:26:00] !log maxsem@deploy1001 Synchronized php-1.32.0-wmf.15/extensions/WikimediaEvents/: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/450080/ (duration: 00m 49s) [23:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:03] SMalyshev: ^ [23:26:53] thanks! [23:34:03] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Krinkle) For crons of this kind, we tend to use `foreachwiki`, or `mwscriptwikise...