[00:22:01] (03PS1) 10Andrew Bogott: realm: add regexes for detecting the new Neutron cloud VM IP ranges. [puppet] - 10https://gerrit.wikimedia.org/r/445045 [00:22:39] (03CR) 10jerkins-bot: [V: 04-1] realm: add regexes for detecting the new Neutron cloud VM IP ranges. [puppet] - 10https://gerrit.wikimedia.org/r/445045 (owner: 10Andrew Bogott) [00:23:28] (03PS2) 10Andrew Bogott: realm: add regexes for detecting the new Neutron cloud VM IP ranges. [puppet] - 10https://gerrit.wikimedia.org/r/445045 [00:25:25] (03CR) 10Andrew Bogott: "I'm sure we need this but I'd appreciate it if someone would double-check these IP ranges and regexps." [puppet] - 10https://gerrit.wikimedia.org/r/445045 (owner: 10Andrew Bogott) [00:27:17] Reedy: thcipriani: Deploying or planning to deploy any MW code in the next hour? [00:27:25] nope [00:27:31] Was gonna roll out a few config cleanups if not. [00:28:44] * Krinkle staging on deploy1001/mwdebug1002 [00:29:41] (03PS1) 10Nuria: Revert "role::common::aqs: update druid mediawiki's datasource" [puppet] - 10https://gerrit.wikimedia.org/r/445046 [00:29:55] (03CR) 10Krinkle: [C: 032] Cleanup: Stop trying to set wgLocalTZOffset, it's wgLocalTZ*o*ffset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444665 (owner: 10Jforrester) [00:31:18] (03Merged) 10jenkins-bot: Cleanup: Stop trying to set wgLocalTZOffset, it's wgLocalTZ*o*ffset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444665 (owner: 10Jforrester) [00:31:35] (03CR) 10jenkins-bot: Cleanup: Stop trying to set wgLocalTZOffset, it's wgLocalTZ*o*ffset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444665 (owner: 10Jforrester) [00:34:27] PROBLEM - nutcracker process on scb2006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:34:57] PROBLEM - Check size of conntrack table on scb2006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:35:06] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I1aad80c3bcb - Remove setting of wgLocalTZOffset (duration: 00m 57s) [00:35:07] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain} [00:35:07] In the News content for unsupported language (with aggregated=true)) timed out before a response was received [00:35:08] PROBLEM - pdfrender on scb2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:17] PROBLEM - SSH on scb2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:18] PROBLEM - apertium apy on scb2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:18] PROBLEM - eventstreams on scb2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:27] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:35:30] (03CR) 10Krinkle: [C: 032] "Before/After:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444665 (owner: 10Jforrester) [00:35:35] (03CR) 10Krinkle: [C: 032] Clean up wgLocaltimezone (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444940 (owner: 10Krinkle) [00:35:37] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:35:37] RECOVERY - nutcracker process on scb2006 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [00:35:58] RECOVERY - Check size of conntrack table on scb2006 is OK: OK: nf_conntrack is 5 % full [00:36:07] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [00:36:08] RECOVERY - pdfrender on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.078 second response time [00:36:08] RECOVERY - SSH on scb2006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [00:36:17] RECOVERY - apertium apy on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.084 second response time [00:36:17] RECOVERY - eventstreams on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.110 second response time [00:36:28] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [00:37:23] (03Merged) 10jenkins-bot: Clean up wgLocaltimezone (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444940 (owner: 10Krinkle) [00:40:25] (03CR) 10Krinkle: [C: 032] Clean up wgLocaltimezone (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444941 (owner: 10Krinkle) [00:40:31] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Ia017dea257d - Clean up wgLocaltimezone (duration: 00m 56s) [00:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:34] (03CR) 10jerkins-bot: [V: 04-1] Clean up wgLocaltimezone (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444941 (owner: 10Krinkle) [00:40:38] (03CR) 10jenkins-bot: Clean up wgLocaltimezone (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444940 (owner: 10Krinkle) [00:40:46] (03PS2) 10Krinkle: Clean up wgLocaltimezone (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444941 [00:40:59] (03CR) 10Krinkle: [C: 032] Clean up wgLocaltimezone (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444941 (owner: 10Krinkle) [00:42:34] (03Merged) 10jenkins-bot: Clean up wgLocaltimezone (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444941 (owner: 10Krinkle) [00:44:27] (03CR) 10jenkins-bot: Clean up wgLocaltimezone (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444941 (owner: 10Krinkle) [00:44:46] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I3d02da810 - Clean up wgLocaltimezone (duration: 00m 56s) [00:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:57] (03CR) 10Krinkle: Hygiene: remove unsued MFForceSecureLogin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444591 (owner: 10Pmiazga) [00:45:09] (03PS2) 10Krinkle: Remove unused $tmarray variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444946 (https://phabricator.wikimedia.org/T189966) [00:45:18] (03CR) 10Krinkle: [C: 032] Remove unused $tmarray variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444946 (https://phabricator.wikimedia.org/T189966) (owner: 10Krinkle) [00:47:02] (03Merged) 10jenkins-bot: Remove unused $tmarray variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444946 (https://phabricator.wikimedia.org/T189966) (owner: 10Krinkle) [00:49:07] (03CR) 10jenkins-bot: Remove unused $tmarray variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444946 (https://phabricator.wikimedia.org/T189966) (owner: 10Krinkle) [00:59:24] 10Operations, 10Wikispeech, 10Wikispeech-WMSE: TTS server deployment strategy - https://phabricator.wikimedia.org/T193072 (10bd808) Deployment via containers to our production Kubernetes cluster is probably an option. The containers themselves will need to be built using the tooling that #operations and #rel... [01:01:30] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I0c621368d6 - Remove unused tmarray variables (duration: 00m 56s) [01:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:19] (03CR) 10Krinkle: [C: 031] Timeless is enabled everywhere, remove config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444776 (owner: 10Reedy) [01:12:36] (03PS1) 10Krinkle: Remove wgGadgetsCacheType setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445049 [01:13:14] (03CR) 10Krinkle: [C: 032] Remove wgGadgetsCacheType setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445049 (owner: 10Krinkle) [01:22:34] (03PS2) 10Krinkle: Remove wgGadgetsCacheType setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445049 [01:22:36] (03CR) 10Krinkle: [C: 032] Remove wgGadgetsCacheType setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445049 (owner: 10Krinkle) [01:24:12] (03Merged) 10jenkins-bot: Remove wgGadgetsCacheType setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445049 (owner: 10Krinkle) [01:27:36] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I423f8fac62 - Remove wgGadgetsCacheType setting (duration: 00m 54s) [01:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:55] (03CR) 10jenkins-bot: Remove wgGadgetsCacheType setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445049 (owner: 10Krinkle) [01:52:36] * Krinkle no longer staging deployment [02:06:52] (03CR) 10Krinkle: "Might need to coordinate with https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/443645/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444775 (owner: 10Reedy) [02:32:24] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.10) (duration: 12m 37s) [02:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:22] 10Operations, 10Analytics, 10EventBus, 10JobRunner-Service, and 3 others: Stop and remove old job runners - https://phabricator.wikimedia.org/T198220 (10Krinkle) [03:05:14] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.12) (duration: 15m 35s) [03:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:17] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[create_shapelines-gis-coastlines],Exec[create_shapelines-gis-land_polygons] [03:15:38] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Wed Jul 11 03:15:37 UTC 2018 (duration 10m 23s) [03:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:07] PROBLEM - Check whether ferm is active by checking the default input chain on scb2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:21:17] PROBLEM - SSH on scb2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:21:18] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v [03:21:18] the News content for unsupported language (with aggregated=true)) timed out before a response was received [03:21:27] PROBLEM - nutcracker process on scb2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:21:37] PROBLEM - apertium apy on scb2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:21:37] PROBLEM - eventstreams on scb2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:21:48] PROBLEM - Disk space on scb2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:21:57] PROBLEM - pdfrender on scb2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:22:27] PROBLEM - MD RAID on scb2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:22:37] RECOVERY - nutcracker process on scb2002 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [03:22:37] RECOVERY - apertium apy on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.074 second response time [03:22:48] RECOVERY - Disk space on scb2002 is OK: DISK OK [03:22:57] RECOVERY - pdfrender on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.074 second response time [03:23:08] RECOVERY - Check whether ferm is active by checking the default input chain on scb2002 is OK: OK ferm input default policy is set [03:23:18] RECOVERY - SSH on scb2002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [03:23:27] RECOVERY - MD RAID on scb2002 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [03:23:28] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [03:23:47] RECOVERY - eventstreams on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.093 second response time [03:27:17] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 942.29 seconds [03:55:57] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 250.95 seconds [04:22:45] (03CR) 10Krinkle: "Worth giving a try on beta puppetmaster as well?" [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [04:39:51] (03PS1) 10Krinkle: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) [04:40:38] (03CR) 10jerkins-bot: [V: 04-1] webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [04:42:40] (03PS10) 10Krinkle: webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) [04:42:42] (03PS8) 10Krinkle: webperf: Rename webperf profiles for clarity [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314) [04:42:44] (03PS7) 10Krinkle: webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) [04:42:46] (03PS7) 10Krinkle: mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762 [04:42:48] (03PS5) 10Krinkle: webperf: Enable xenondata_host on perfsite in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/443764 (https://phabricator.wikimedia.org/T195312) [04:42:50] (03PS5) 10Krinkle: webperf: Split Redis from the rest of the arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) [04:42:52] (03PS2) 10Krinkle: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) [05:00:12] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445067 (https://phabricator.wikimedia.org/T146591) [05:02:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445067 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:04:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445067 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:05:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1094 for alter table (duration: 01m 07s) [05:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:34] !log Deploy schema change on db1094 T146591 T197891 T196379 [05:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:39] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [05:05:39] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [05:05:40] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [05:06:48] !log Optimize wbc_entity_usage on arwiki cawiki huwiki rowiki ukwiki on db1094 - T187521 [05:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:51] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [05:07:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445067 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:08:04] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445068 [05:17:28] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445068 (owner: 10Marostegui) [05:19:16] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445068 (owner: 10Marostegui) [05:19:28] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445068 (owner: 10Marostegui) [05:20:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1094 after alter table (duration: 00m 56s) [05:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:36] !log Deploy schema change on db1090 T146591 T197891 T196379 [05:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:41] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [05:20:41] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [05:20:42] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [05:23:34] !log Optimize wbc_entity_usage on arwiki cawiki huwiki rowiki ukwiki on db1090 - T187521 [05:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:37] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [05:25:24] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445069 (https://phabricator.wikimedia.org/T146591) [05:28:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445069 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:29:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445069 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:30:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445069 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:30:54] (03PS1) 10Tim Starling: Fix ParserMigration: missing tidy config file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445070 (https://phabricator.wikimedia.org/T199293) [05:31:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1079 for alter table (duration: 00m 56s) [05:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:13] !log Deploy schema change on db1079 with replication, this will generate lag on s7 labs hosts T146591 T197891 T196379 [05:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:19] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [05:31:19] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [05:31:20] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [05:39:21] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445074 [05:49:10] !log Optimize wbc_entity_usage on arwiki cawiki huwiki rowiki ukwiki on db1079 with replication, this will generate lag on s7 labs hosts - T187521 [05:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:14] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [05:50:08] (03CR) 10Tim Starling: [C: 032] Fix ParserMigration: missing tidy config file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445070 (https://phabricator.wikimedia.org/T199293) (owner: 10Tim Starling) [05:52:00] (03Merged) 10jenkins-bot: Fix ParserMigration: missing tidy config file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445070 (https://phabricator.wikimedia.org/T199293) (owner: 10Tim Starling) [05:52:12] (03CR) 10jenkins-bot: Fix ParserMigration: missing tidy config file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445070 (https://phabricator.wikimedia.org/T199293) (owner: 10Tim Starling) [05:55:07] !log tstarling@deploy1001 Synchronized wmf-config/CommonSettings.php: fix logspam, ParserMigration breakage (duration: 00m 57s) [05:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:24] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445074 (owner: 10Marostegui) [06:04:09] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445074 (owner: 10Marostegui) [06:05:19] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1079 after alter table (duration: 00m 57s) [06:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:01] (03PS1) 10Marostegui: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445075 (https://phabricator.wikimedia.org/T146591) [06:07:53] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445074 (owner: 10Marostegui) [06:09:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445075 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [06:10:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445075 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [06:12:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445075 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [06:12:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1086 for alter table (duration: 00m 57s) [06:12:07] !log Deploy schema change on db1086 T146591 T197891 T196379 [06:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:14] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [06:12:14] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [06:12:15] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [06:12:32] !log Optimize wbc_entity_usage on arwiki cawiki huwiki rowiki ukwiki on db1086 - T187521 [06:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:36] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [06:12:50] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445077 [06:23:01] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445077 (owner: 10Marostegui) [06:24:53] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445077 (owner: 10Marostegui) [06:26:26] !log Deploy schema change on db1062 (s7 primary master) T146591 T197891 T196379 [06:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:30] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [06:26:30] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [06:26:31] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [06:27:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1086 after alter table (duration: 00m 55s) [06:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:59] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445077 (owner: 10Marostegui) [06:30:30] !log Optimize wbc_entity_usage on arwiki cawiki huwiki rowiki ukwiki on db1062 (s7 primary master) - T187521 [06:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:34] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [06:36:50] (03PS1) 10Elukey: role::cache::canary: move eventlogging vk instance to Jumbo [puppet] - 10https://gerrit.wikimedia.org/r/445078 (https://phabricator.wikimedia.org/T175461) [06:37:12] ema --^ [06:38:45] (03CR) 10Elukey: [C: 032] role::cache::canary: move eventlogging vk instance to Jumbo [puppet] - 10https://gerrit.wikimedia.org/r/445078 (https://phabricator.wikimedia.org/T175461) (owner: 10Elukey) [06:40:55] !log Deploy schema change on s3 codfw master (db2043) with replication, this will generate lag on s3 codfw T146591 T197891 T196379 [06:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:01] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [06:41:01] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [06:41:01] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [06:43:07] elukey: \o/ [06:52:27] (03CR) 10Muehlenhoff: "@Andre: This fell through the cracks as sudo changes need to be proposed via a task using the Ops-Access-Requests project, otherwise they'" [puppet] - 10https://gerrit.wikimedia.org/r/441012 (owner: 10Aklapper) [06:52:39] (03PS2) 10Muehlenhoff: Phab: Allow aklapper to purge user caches [puppet] - 10https://gerrit.wikimedia.org/r/441012 (owner: 10Aklapper) [06:59:01] !log Optimize wbc_entity_usage on bewiki cewiki dawiki hywiki ttwiki on db2043 (s3 codfw master), this will generate lag on s3 codfw - T187521 [06:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:05] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [07:00:58] (03CR) 10Muehlenhoff: [C: 032] Phab: Allow aklapper to purge user caches [puppet] - 10https://gerrit.wikimedia.org/r/441012 (owner: 10Aklapper) [07:01:29] (03Abandoned) 10Muehlenhoff: oresweb: Switch to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366811 (owner: 10Muehlenhoff) [07:07:00] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) [07:12:45] (03PS5) 10Elukey: Add HDFS whitelist path to EventLoggingSanitization job [puppet] - 10https://gerrit.wikimedia.org/r/442121 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns) [07:15:40] !log Deploy schema change on dbstore1002:s3 T146591 T197891 T196379 [07:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:46] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [07:15:46] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [07:15:46] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [07:16:08] (03CR) 10Elukey: [C: 032] Revert "role::common::aqs: update druid mediawiki's datasource" [puppet] - 10https://gerrit.wikimedia.org/r/445046 (owner: 10Nuria) [07:16:19] (03CR) 10Elukey: "Wrong +2 :)" [puppet] - 10https://gerrit.wikimedia.org/r/445046 (owner: 10Nuria) [07:16:33] (03CR) 10Elukey: [C: 032] Add HDFS whitelist path to EventLoggingSanitization job [puppet] - 10https://gerrit.wikimedia.org/r/442121 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns) [07:20:39] (03PS2) 10Elukey: Set contact_group to admins for main MirrorMaker alerts [puppet] - 10https://gerrit.wikimedia.org/r/444925 (owner: 10Ottomata) [07:21:02] (03PS1) 10Ema: varnish: start without specifying a VCL file [puppet] - 10https://gerrit.wikimedia.org/r/445081 (https://phabricator.wikimedia.org/T164609) [07:21:34] (03CR) 10Elukey: [C: 032] Set contact_group to admins for main MirrorMaker alerts [puppet] - 10https://gerrit.wikimedia.org/r/444925 (owner: 10Ottomata) [07:28:07] PROBLEM - Check systemd state on cp1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:28:19] that's me ^ [07:32:12] RECOVERY - Check systemd state on cp1008 is OK: OK - running: The system is fully operational [07:39:36] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445083 (https://phabricator.wikimedia.org/T146591) [07:41:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445083 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [07:43:20] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445083 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [07:43:36] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445083 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [07:44:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 for alter table (duration: 00m 56s) [07:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:07] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Apart from several comments inline, my main problem with this patch is that it's very large and yet incomplete, thus it's hard to review a" (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson) [07:50:14] !log Deploy schema change on db1078 T146591 T197891 T196379 [07:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:19] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [07:50:20] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [07:50:20] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [07:51:29] !log Optimize wbc_entity_usage on bewiki cewiki dawiki hywiki ttwiki on dbstore1002:s3, db1078 - T187521 [07:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:32] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [07:52:44] (03PS2) 10Elukey: Revert "role::common::aqs: update druid mediawiki's datasource" [puppet] - 10https://gerrit.wikimedia.org/r/445046 (owner: 10Nuria) [07:53:27] (03CR) 10Elukey: [C: 032] Revert "role::common::aqs: update druid mediawiki's datasource" [puppet] - 10https://gerrit.wikimedia.org/r/445046 (owner: 10Nuria) [07:55:19] 10Operations, 10Security: Disable agent forwarding to important hosts - https://phabricator.wikimedia.org/T198138 (10MoritzMuehlenhoff) Is there a technical reason to only do it selectively for a few WMCS hosts and not in general? Or in other words, is there any known component (like something in Toolforge or... [07:57:09] !log roll restart of aqs on aqs* to rollback the druid config [07:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:39] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445090 [07:59:08] 10Operations, 10Wikidata: Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC - https://phabricator.wikimedia.org/T198049 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:59:10] 10Operations, 10Security: Disable agent forwarding to important hosts - https://phabricator.wikimedia.org/T198138 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:00:00] 10Operations, 10Goal: Perform a datacenter switchover - https://phabricator.wikimedia.org/T199073 (10jcrespo) [08:00:29] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445090 (owner: 10Marostegui) [08:02:37] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445090 (owner: 10Marostegui) [08:02:37] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445090 (owner: 10Marostegui) [08:04:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 after alter table (duration: 00m 57s) [08:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:01] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445092 (https://phabricator.wikimedia.org/T146591) [08:10:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445092 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [08:12:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445092 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [08:12:55] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445092 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [08:13:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 for alter table (duration: 00m 56s) [08:13:55] !log Deploy schema change on db1077 with replication, this will generate lag on s3 labs hosts T146591 T197891 T196379 [08:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:00] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [08:14:00] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [08:14:00] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [08:14:27] !log Optimize wbc_entity_usage on bewiki cewiki dawiki hywiki ttwiki on db1077 with replication, this will generate lag on s3 labs hosts - T187521 [08:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:31] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [08:16:08] (03PS2) 10Ema: varnish: startup process for multiple VCL files [puppet] - 10https://gerrit.wikimedia.org/r/445081 (https://phabricator.wikimedia.org/T164609) [08:16:22] (03PS4) 10Jonas Kress (WMDE): Add monthly storage schema for graphite [puppet] - 10https://gerrit.wikimedia.org/r/443370 (https://phabricator.wikimedia.org/T193641) [08:22:50] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove pear packages from MW Application Servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [08:23:04] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445093 [08:25:33] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445093 (owner: 10Marostegui) [08:26:11] (03PS2) 10Muehlenhoff: Stop using DSA/DSS host keys [puppet] - 10https://gerrit.wikimedia.org/r/438190 [08:27:08] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445093 (owner: 10Marostegui) [08:27:21] (03CR) 10Volans: "Small comment on the reimage library, looks good otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/444839 (owner: 10Giuseppe Lavagetto) [08:27:44] (03PS3) 10Ema: varnish: startup process for multiple VCL files [puppet] - 10https://gerrit.wikimedia.org/r/445081 (https://phabricator.wikimedia.org/T164609) [08:28:02] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445093 (owner: 10Marostegui) [08:28:39] (03CR) 10Ema: [C: 032] varnish: startup process for multiple VCL files [puppet] - 10https://gerrit.wikimedia.org/r/445081 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [08:28:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 after alter table (duration: 00m 56s) [08:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:53] (03PS4) 10Alexandros Kosiaris: grafana-admin: Redirect to to grafana.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/444622 [08:35:14] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] grafana-admin: Redirect to to grafana.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/444622 (owner: 10Alexandros Kosiaris) [08:37:50] !log reboot ms-be1040 to run hardware diagnostics - T199198 [08:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:54] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [08:39:00] (03PS3) 10Alexandros Kosiaris: proton: Add discovery hiera [puppet] - 10https://gerrit.wikimedia.org/r/444225 (https://phabricator.wikimedia.org/T186748) [08:40:52] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:40:55] (03PS2) 10Alexandros Kosiaris: proton: Add proton.discovery.wmnet RR [dns] - 10https://gerrit.wikimedia.org/r/444226 (https://phabricator.wikimedia.org/T186748) [08:43:39] (03CR) 10Alexandros Kosiaris: "done" [dns] - 10https://gerrit.wikimedia.org/r/444226 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [08:43:47] (03CR) 10Alexandros Kosiaris: "done" [puppet] - 10https://gerrit.wikimedia.org/r/444225 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [08:48:46] (03PS2) 10Jcrespo: mariadb: Depool db1086 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444869 [08:50:21] (03PS3) 10Muehlenhoff: Stop using DSA/DSS host keys [puppet] - 10https://gerrit.wikimedia.org/r/438190 [08:50:37] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Seems overall correct, but the puppet compiler shows:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441894 (owner: 10EBernhardson) [08:51:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "sorry, forgot to add a link to the compiler run:" [puppet] - 10https://gerrit.wikimedia.org/r/441894 (owner: 10EBernhardson) [08:52:29] (03CR) 10Muehlenhoff: [C: 032] Stop using DSA/DSS host keys [puppet] - 10https://gerrit.wikimedia.org/r/438190 (owner: 10Muehlenhoff) [08:55:56] (03CR) 10Alexandros Kosiaris: [C: 032] proton: Add discovery hiera [puppet] - 10https://gerrit.wikimedia.org/r/444225 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [08:56:14] (03PS4) 10Alexandros Kosiaris: proton: Add discovery hiera [puppet] - 10https://gerrit.wikimedia.org/r/444225 (https://phabricator.wikimedia.org/T186748) [08:56:17] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] proton: Add discovery hiera [puppet] - 10https://gerrit.wikimedia.org/r/444225 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [08:57:11] (03CR) 10Alexandros Kosiaris: [C: 032] proton: Add proton.discovery.wmnet RR [dns] - 10https://gerrit.wikimedia.org/r/444226 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [08:57:47] 10Operations, 10ops-eqiad, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832 (10jcrespo) BTW, I can still see on racktables a labsdb1002-array1- not sure if a mistake on the application or it really is still there on reality,... [09:01:29] 10Operations, 10ops-eqiad, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832 (10Marostegui) As per the steps completed above, looks like 1001 and 1003 are down but not unracked. [09:05:37] (03CR) 10Addshore: [C: 031] Enable FileExporter for sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444901 (https://phabricator.wikimedia.org/T198594) (owner: 10WMDE-Fisch) [09:06:33] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: clean labvirt1021 resources [dns] - 10https://gerrit.wikimedia.org/r/445100 (https://phabricator.wikimedia.org/T199107) [09:07:03] !log installing remaining php7 security updates for stretch [09:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:39] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: clean labvirt1021 resources [dns] - 10https://gerrit.wikimedia.org/r/445100 (https://phabricator.wikimedia.org/T199107) (owner: 10Arturo Borrero Gonzalez) [09:08:16] there is a pending patch merge in ns0.wikimedia.org [09:08:29] :? [09:09:01] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think I found the reason of the duplicate declaration in nginx." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441894 (owner: 10EBernhardson) [09:09:02] https://www.irccloud.com/pastebin/yBBNcDKO/ [09:09:26] ^ akosiaris [09:09:32] yup [09:11:32] PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:12:05] yup I know [09:12:17] shall I merge it? [09:12:36] it won't work [09:12:41] so don't [09:12:45] spews out [09:12:55] # error: plugin_geoip: Invalid resource name 'disc-proton' detected from zonefile lookup [09:12:55] # error: Name 'proton.discovery.wmnet.': resolver plugin 'geoip' rejected resource name 'disc-proton' [09:12:55] # fatal: rfc1035: Cannot load zonefile 'wmnet', failing [09:13:00] trying to figure it out [09:13:01] ok, the thing is I have a pending patch [09:13:13] ok thanks akosiaris let me know if I can be of any help [09:13:14] yeah the labvirt one [09:13:16] I saw [09:13:55] ah I think I see why [09:14:53] akosiaris: I'm around if needed ;) [09:15:00] (03PS10) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748) [09:15:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [09:15:31] (03CR) 10Addshore: [C: 04-1] Add monthly storage schema for graphite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443370 (https://phabricator.wikimedia.org/T193641) (owner: 10Jonas Kress (WMDE)) [09:15:46] (03PS4) 10Arturo Borrero Gonzalez: install_server: partman: refresh labvirt-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/444837 [09:16:50] (03CR) 10Arturo Borrero Gonzalez: [C: 032] install_server: partman: refresh labvirt-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/444837 (owner: 10Arturo Borrero Gonzalez) [09:18:59] (03PS1) 10Muehlenhoff: Update Cumin alias for snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/445103 [09:19:29] (03PS2) 10Muehlenhoff: Update Cumin alias for snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/445103 [09:20:11] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - proton_24766: Servers proton2001.codfw.wmnet are marked down but pooled [09:20:12] (03PS1) 10Urbanecm: Use bewikibooks.png in wgLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445104 (https://phabricator.wikimedia.org/T189218) [09:20:20] expected ^ [09:20:24] it's proton being pooled [09:20:37] trying to figure out why the ProxyFetch check doesn't work [09:20:47] we have way too much intertwined puppet, dns, lvs configuration [09:21:11] I had to merge an LVS hieradata change to get DNS updates working again [09:21:37] arturo: ok fixed. FYI, ns0 is at the US, it's probably better for you that are EU based to use ns2 (ssh can be a bit faster) [09:21:41] RECOVERY - puppet last run on radon is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:21:43] (03CR) 10Muehlenhoff: [C: 032] Update Cumin alias for snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/445103 (owner: 10Muehlenhoff) [09:21:57] arturo: your change has been merged as well [09:22:42] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([proton2001.codfw.wmnet]) [09:22:51] thanks akosiaris [09:22:59] (03PS1) 10Elukey: statistics::user: set http[s] proxy in git config [puppet] - 10https://gerrit.wikimedia.org/r/445106 (https://phabricator.wikimedia.org/T198623) [09:24:54] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1086 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444869 (owner: 10Jcrespo) [09:25:11] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall seems correct, but too many things done at once; in particular I think prometheus::define_config should be discussed in a separate" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441321 (owner: 10EBernhardson) [09:26:30] (03Merged) 10jenkins-bot: mariadb: Depool db1086 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444869 (owner: 10Jcrespo) [09:26:55] (03CR) 10Elukey: [C: 032] statistics::user: set http[s] proxy in git config [puppet] - 10https://gerrit.wikimedia.org/r/445106 (https://phabricator.wikimedia.org/T198623) (owner: 10Elukey) [09:27:01] (03PS2) 10Elukey: statistics::user: set http[s] proxy in git config [puppet] - 10https://gerrit.wikimedia.org/r/445106 (https://phabricator.wikimedia.org/T198623) [09:27:03] (03CR) 10Elukey: [V: 032 C: 032] statistics::user: set http[s] proxy in git config [puppet] - 10https://gerrit.wikimedia.org/r/445106 (https://phabricator.wikimedia.org/T198623) (owner: 10Elukey) [09:27:53] (03CR) 10jenkins-bot: mariadb: Depool db1086 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444869 (owner: 10Jcrespo) [09:28:46] (03PS1) 10Alexandros Kosiaris: proton: Fix the LVS checks to use /_info [puppet] - 10https://gerrit.wikimedia.org/r/445107 (https://phabricator.wikimedia.org/T186748) [09:29:20] (03CR) 10Alexandros Kosiaris: [C: 032] proton: Fix the LVS checks to use /_info [puppet] - 10https://gerrit.wikimedia.org/r/445107 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [09:30:24] !log ppchelko@deploy1001 Started deploy [restbase/deploy@353eca3] (dev-cluster): Upgrade cassandra driver to 3.5.0 T169009 [09:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:28] T169009: Cassandra Node.JS driver v3.2.2 issues - https://phabricator.wikimedia.org/T169009 [09:30:40] PROBLEM - Host proton.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [09:30:58] <_joe_> akosiaris: someone's calling ^^ [09:31:07] ocg [09:31:21] <_joe_> akosiaris: that's you, right? [09:31:26] PROBLEM - HHVM rendering on mw2145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:29] is it maintenance or something else? [09:31:48] new service [09:31:52] ok [09:31:56] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [09:32:06] finally pybal is happy [09:32:14] <_joe_> jynus: it pages the first day as to set your expectations [09:32:21] :) [09:32:25] RECOVERY - HHVM rendering on mw2145 is OK: HTTP OK: HTTP/1.1 200 OK - 76458 bytes in 0.297 second response time [09:32:34] lol [09:32:37] hahahahaa [09:32:46] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [09:33:18] :-) [09:33:21] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1086 (duration: 00m 57s) [09:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:44] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@353eca3] (dev-cluster): Upgrade cassandra driver to 3.5.0 T169009 (duration: 04m 20s) [09:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:22] PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 41 connections established with conf1001.eqiad.wmnet:2379 (min=42) [09:36:51] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/444909 (owner: 10Giuseppe Lavagetto) [09:37:16] (03PS1) 10Vgutierrez: site: reimage baham as spare server [puppet] - 10https://gerrit.wikimedia.org/r/445109 (https://phabricator.wikimedia.org/T199247) [09:38:40] !log ppchelko@deploy1001 Started deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 1. Only codfw. T169009 [09:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:43] T169009: Cassandra Node.JS driver v3.2.2 issues - https://phabricator.wikimedia.org/T169009 [09:41:53] !log ppchelko@deploy1001 deploy aborted: Upgrade cassandra driver to 3.5.0. Part 1. Only codfw. T169009 (duration: 03m 13s) [09:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:24] (03PS1) 10Elukey: Add profile::analytics::cluster::gitconfig to stat and notebooks [puppet] - 10https://gerrit.wikimedia.org/r/445111 (https://phabricator.wikimedia.org/T198623) [09:43:06] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::cluster::gitconfig to stat and notebooks [puppet] - 10https://gerrit.wikimedia.org/r/445111 (https://phabricator.wikimedia.org/T198623) (owner: 10Elukey) [09:43:40] RECOVERY - Host proton.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [09:44:28] !log ppchelko@deploy1001 Started deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 1. Only codfw. Take 2, check timed out. T169009 [09:44:29] (03CR) 10Volans: [C: 031] "Compiler looks happy, LGTM:" [puppet] - 10https://gerrit.wikimedia.org/r/444908 (owner: 10Giuseppe Lavagetto) [09:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:31] (03PS6) 10Mark Bergsma: Clarify interface of buildProtocol and setEnabledAddressFamilies [debs/pybal] - 10https://gerrit.wikimedia.org/r/433736 [09:44:31] T169009: Cassandra Node.JS driver v3.2.2 issues - https://phabricator.wikimedia.org/T169009 [09:44:33] (03PS5) 10Mark Bergsma: Fix BGP collision detection [debs/pybal] - 10https://gerrit.wikimedia.org/r/434161 [09:44:35] (03PS6) 10Mark Bergsma: Add tests that emulate client or server sessions initial connection [debs/pybal] - 10https://gerrit.wikimedia.org/r/434162 [09:44:37] (03PS5) 10Mark Bergsma: Move FSM connect state handling to the FSM itself [debs/pybal] - 10https://gerrit.wikimedia.org/r/434163 [09:44:39] (03PS3) 10Mark Bergsma: Implement BGP FSM events 4 and 5 (passive start) [debs/pybal] - 10https://gerrit.wikimedia.org/r/436297 [09:44:41] (03PS3) 10Mark Bergsma: Implement BGP FSM event 14 [debs/pybal] - 10https://gerrit.wikimedia.org/r/436298 [09:44:43] (03PS3) 10Mark Bergsma: Correct incoming connection interaction with BGP FSM [debs/pybal] - 10https://gerrit.wikimedia.org/r/436299 [09:45:12] * vgutierrez hides [09:45:16] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational [09:45:30] :) [09:45:30] (03PS2) 10Elukey: Add profile::analytics::cluster::gitconfig to stat and notebooks [puppet] - 10https://gerrit.wikimedia.org/r/445111 (https://phabricator.wikimedia.org/T198623) [09:46:01] (03CR) 10Mark Bergsma: [C: 032] Clarify interface of buildProtocol and setEnabledAddressFamilies [debs/pybal] - 10https://gerrit.wikimedia.org/r/433736 (owner: 10Mark Bergsma) [09:46:11] (03PS5) 10Jonas Kress (WMDE): Add monthly storage schema for graphite [puppet] - 10https://gerrit.wikimedia.org/r/443370 (https://phabricator.wikimedia.org/T193641) [09:46:42] (03Merged) 10jenkins-bot: Clarify interface of buildProtocol and setEnabledAddressFamilies [debs/pybal] - 10https://gerrit.wikimedia.org/r/433736 (owner: 10Mark Bergsma) [09:46:58] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/11773/" [puppet] - 10https://gerrit.wikimedia.org/r/445111 (https://phabricator.wikimedia.org/T198623) (owner: 10Elukey) [09:48:02] (03CR) 10Mark Bergsma: [C: 032] Fix BGP collision detection [debs/pybal] - 10https://gerrit.wikimedia.org/r/434161 (owner: 10Mark Bergsma) [09:48:06] (03PS1) 10Arturo Borrero Gonzalez: install_server: rename labvirt1022 to cloudvirt1022 [puppet] - 10https://gerrit.wikimedia.org/r/445114 (https://phabricator.wikimedia.org/T199202) [09:48:50] (03Merged) 10jenkins-bot: Fix BGP collision detection [debs/pybal] - 10https://gerrit.wikimedia.org/r/434161 (owner: 10Mark Bergsma) [09:51:26] RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 42 connections established with conf1001.eqiad.wmnet:2379 (min=42) [09:51:34] (03CR) 10Mark Bergsma: [C: 032] Add tests that emulate client or server sessions initial connection [debs/pybal] - 10https://gerrit.wikimedia.org/r/434162 (owner: 10Mark Bergsma) [09:52:10] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: rename labvirt1022.eqiad.wmnet to cloudvirt1022.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/445115 (https://phabricator.wikimedia.org/T199202) [09:52:12] (03Merged) 10jenkins-bot: Add tests that emulate client or server sessions initial connection [debs/pybal] - 10https://gerrit.wikimedia.org/r/434162 (owner: 10Mark Bergsma) [09:52:16] (03CR) 10Arturo Borrero Gonzalez: [C: 032] install_server: rename labvirt1022 to cloudvirt1022 [puppet] - 10https://gerrit.wikimedia.org/r/445114 (https://phabricator.wikimedia.org/T199202) (owner: 10Arturo Borrero Gonzalez) [09:52:42] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: rename labvirt1022.eqiad.wmnet to cloudvirt1022.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/445115 (https://phabricator.wikimedia.org/T199202) (owner: 10Arturo Borrero Gonzalez) [09:55:42] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 1. Only codfw. Take 2, check timed out. T169009 (duration: 11m 14s) [09:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:45] T169009: Cassandra Node.JS driver v3.2.2 issues - https://phabricator.wikimedia.org/T169009 [09:56:58] (03CR) 10Volans: [C: 04-1] "Few comments to avoid to hardcode so many things." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/444910 (owner: 10Giuseppe Lavagetto) [10:01:42] (03PS1) 10Jcrespo: mariadb: Fully depool db1086 (last depool didn't depool api) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445116 [10:01:58] (03PS2) 10Jcrespo: mariadb: Fully depool db1086 (last depool didn't depool api) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445116 [10:03:46] (03CR) 10Vgutierrez: [C: 032] site: reimage baham as spare server [puppet] - 10https://gerrit.wikimedia.org/r/445109 (https://phabricator.wikimedia.org/T199247) (owner: 10Vgutierrez) [10:03:54] (03PS2) 10Vgutierrez: site: reimage baham as spare server [puppet] - 10https://gerrit.wikimedia.org/r/445109 (https://phabricator.wikimedia.org/T199247) [10:07:20] heads up, I'm taking graphite200[12] out of service shortly and thus restarting carbon-c-relay, there will be a minute gap in graphite metrics [10:08:08] !log reimage baham as spare system - T199247 [10:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:12] T199247: Decommission baham - https://phabricator.wikimedia.org/T199247 [10:08:49] (03PS3) 10Giuseppe Lavagetto: apache-fast-test: exit with non-zero exit codes if some issues are found [puppet] - 10https://gerrit.wikimedia.org/r/444839 [10:08:51] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::web_testing: create profile [puppet] - 10https://gerrit.wikimedia.org/r/444908 [10:08:53] (03PS2) 10Giuseppe Lavagetto: apache-fast-test: read files from the tests directory as a fallback [puppet] - 10https://gerrit.wikimedia.org/r/444909 [10:08:55] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::web_testing: add script for deploying apache changes [puppet] - 10https://gerrit.wikimedia.org/r/444910 [10:08:59] (03CR) 10Giuseppe Lavagetto: profile::mediawiki::web_testing: add script for deploying apache changes (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/444910 (owner: 10Giuseppe Lavagetto) [10:09:32] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.10/includes/libs/rdbms/ChronologyProtector.php: 552dbdbef6faaa7e91d2e5fe027ad698ae96e544 (duration: 00m 59s) [10:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:59] PROBLEM - Host ns1-v6 is DOWN: PING CRITICAL - Packet loss = 100% [10:10:10] ^^ that's me [10:10:18] (03PS3) 10Filippo Giunchedi: graphite: take graphite200[12] out of service [puppet] - 10https://gerrit.wikimedia.org/r/444217 (https://phabricator.wikimedia.org/T196483) [10:11:31] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.12/includes/libs/rdbms/ChronologyProtector.php: ee89660a174364f71ddc65a45086efc3c786fb7a (duration: 00m 57s) [10:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:46] (03PS4) 10Giuseppe Lavagetto: apache-fast-test: exit with non-zero exit codes if some issues are found [puppet] - 10https://gerrit.wikimedia.org/r/444839 [10:12:23] (03CR) 10Filippo Giunchedi: [C: 032] graphite: take graphite200[12] out of service [puppet] - 10https://gerrit.wikimedia.org/r/444217 (https://phabricator.wikimedia.org/T196483) (owner: 10Filippo Giunchedi) [10:12:44] (03Abandoned) 10Muehlenhoff: switch mw-maintenance server from terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/431039 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [10:13:09] RECOVERY - Check systemd state on ms-be1039 is OK: OK - running: The system is fully operational [10:14:21] (03CR) 10Jcrespo: [C: 032] mariadb: Fully depool db1086 (last depool didn't depool api) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445116 (owner: 10Jcrespo) [10:16:02] (03Merged) 10jenkins-bot: mariadb: Fully depool db1086 (last depool didn't depool api) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445116 (owner: 10Jcrespo) [10:17:23] (03Abandoned) 10Muehlenhoff: switch mw_maintenance server to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/441346 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [10:17:55] (03PS5) 10Giuseppe Lavagetto: apache-fast-test: exit with non-zero exit codes if some issues are found [puppet] - 10https://gerrit.wikimedia.org/r/444839 [10:17:58] RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational [10:18:08] (03CR) 10jenkins-bot: mariadb: Fully depool db1086 (last depool didn't depool api) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445116 (owner: 10Jcrespo) [10:18:57] (03CR) 10Giuseppe Lavagetto: [C: 032] apache-fast-test: exit with non-zero exit codes if some issues are found [puppet] - 10https://gerrit.wikimedia.org/r/444839 (owner: 10Giuseppe Lavagetto) [10:19:13] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::web_testing: create profile [puppet] - 10https://gerrit.wikimedia.org/r/444908 [10:19:35] (03CR) 10Elukey: "Looks very promising to me. I asked to Joe the possibility to rollout the change incrementally via batch/sleep rather than a simple puppet" [puppet] - 10https://gerrit.wikimedia.org/r/444910 (owner: 10Giuseppe Lavagetto) [10:20:00] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::web_testing: create profile [puppet] - 10https://gerrit.wikimedia.org/r/444908 (owner: 10Giuseppe Lavagetto) [10:20:19] (03PS3) 10Giuseppe Lavagetto: apache-fast-test: read files from the tests directory as a fallback [puppet] - 10https://gerrit.wikimedia.org/r/444909 [10:22:01] (03CR) 10Giuseppe Lavagetto: [C: 032] apache-fast-test: read files from the tests directory as a fallback [puppet] - 10https://gerrit.wikimedia.org/r/444909 (owner: 10Giuseppe Lavagetto) [10:22:03] (03Abandoned) 10Muehlenhoff: mw_maintenace: remove temp change for wikidata crons [puppet] - 10https://gerrit.wikimedia.org/r/441381 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [10:23:31] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully depool db1086 (duration: 00m 56s) [10:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:07] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10jcrespo) I think the proposed plan ha deep architecture problems at storage layer, so we should discuss in dep... [10:34:08] (03PS1) 10Muehlenhoff: Remove terbium from allowed hosts/ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/445118 (https://phabricator.wikimedia.org/T192092) [10:35:49] !log installing tiff security updates on jessie hosts [10:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:01] (03CR) 10Volans: profile::mediawiki::web_testing: add script for deploying apache changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/444910 (owner: 10Giuseppe Lavagetto) [10:45:11] !log disabled puppet in install1002 for a debian installer live hack (T199202) [10:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:15] T199202: labvirt1022: reallocate from eqiad1 as cloudvirt1022 - https://phabricator.wikimedia.org/T199202 [10:46:58] (03CR) 10Muehlenhoff: [C: 032] Remove terbium from allowed hosts/ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/445118 (https://phabricator.wikimedia.org/T192092) (owner: 10Muehlenhoff) [10:54:53] (03PS1) 10Ema: network::constants: define and use cache_text [puppet] - 10https://gerrit.wikimedia.org/r/445126 (https://phabricator.wikimedia.org/T164609) [10:55:47] (03PS1) 10Muehlenhoff: Update allowed hosts for tcpircbot (terbium -> mwmaint1001) [puppet] - 10https://gerrit.wikimedia.org/r/445127 [10:57:44] (03CR) 10Ema: "pcc lgtm https://puppet-compiler.wmflabs.org/compiler02/11775/" [puppet] - 10https://gerrit.wikimedia.org/r/445126 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [10:58:05] !log ppchelko@deploy1001 Started deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 2. Everywhere. T169009 [10:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:08] T169009: Cassandra Node.JS driver v3.2.2 issues - https://phabricator.wikimedia.org/T169009 [10:58:09] (03CR) 10Muehlenhoff: [C: 032] Update allowed hosts for tcpircbot (terbium -> mwmaint1001) [puppet] - 10https://gerrit.wikimedia.org/r/445127 (owner: 10Muehlenhoff) [10:58:57] (03PS1) 10Gergő Tisza: Do not set deprecated value for $wgExternalDiffEngine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445128 [10:59:29] (03PS2) 10Gergő Tisza: Do not set deprecated value for $wgExternalDiffEngine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445128 [11:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180711T1100). [11:00:05] CFisch_WMDE, raynor, Pchelolo, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:15] here [11:00:23] o/ [11:00:50] \o/ [11:00:51] o/, I'm here and I can deploy my changes by myself, I can go last as I have two config changes + 3 cleanups [11:00:59] 5 in total [11:01:16] config cleanup are noop [11:01:52] 0/ [11:02:03] (03PS1) 10Jcrespo: mariadb: Disable notifications for db1086 [puppet] - 10https://gerrit.wikimedia.org/r/445129 [11:02:13] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 63.66, 41.48, 28.16 [11:02:27] I can SWAT today [11:02:37] CFisch_WMDE: you are not a deployer, correct? [11:02:37] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Disable notifications for db1086 [puppet] - 10https://gerrit.wikimedia.org/r/445129 (owner: 10Jcrespo) [11:02:49] zeljkof: Still not, no [11:03:11] !log restart HHVM on mw1282 [11:03:11] Pchelolo: should I deploy your commit? [11:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:29] Amir1: go ahead with your change while I get ready :) [11:03:36] cool [11:03:40] zeljkof: as always, better you :) mine can go last, it's a code change, takes time [11:03:57] raynor: if you are willing to go last, stand by :) [11:04:08] :) [11:04:15] I'll wait [11:04:26] Pchelolo: I'll review and merge it right now, it takes time to merge, and can be deployed in parallel with config changes [11:04:31] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444555 (https://phabricator.wikimedia.org/T194165) (owner: 10Ladsgroup) [11:04:33] kk [11:05:10] !log stop db1086 for reimage [11:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:15] raynor: please note that 6 patches is the average, and there are 8 for this window, so not all of them might fit, but you can stay longer and finish [11:05:39] train is in two hours, and I'm the conductor (driver?) [11:06:07] I have two patches that require a proper testing, and 3 patches are the config cleanups [11:06:13] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 2. Everywhere. T169009 (duration: 08m 07s) [11:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:16] T169009: Cassandra Node.JS driver v3.2.2 issues - https://phabricator.wikimedia.org/T169009 [11:06:26] !log ppchelko@deploy1001 Started deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 2. Everywhere.Take 2, feeds timed out. T169009 [11:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:29] (03PS1) 10Arturo Borrero Gonzalez: Revert "install_server: partman: refresh labvirt-ssd partman recipe" [puppet] - 10https://gerrit.wikimedia.org/r/445130 [11:06:49] (03PS2) 10Ladsgroup: Write to change_tag_def and the new column in change_tag in Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444555 (https://phabricator.wikimedia.org/T194165) [11:07:12] (03CR) 10Ladsgroup: [C: 032] Write to change_tag_def and the new column in change_tag in Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444555 (https://phabricator.wikimedia.org/T194165) (owner: 10Ladsgroup) [11:07:14] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Revert "install_server: partman: refresh labvirt-ssd partman recipe" [puppet] - 10https://gerrit.wikimedia.org/r/445130 (owner: 10Arturo Borrero Gonzalez) [11:08:14] (03PS2) 10Jcrespo: mariadb: Disable notifications for db1086 [puppet] - 10https://gerrit.wikimedia.org/r/445129 [11:08:18] !log re-enable puppet in install1002 [11:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:30] (03Merged) 10jenkins-bot: Write to change_tag_def and the new column in change_tag in Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444555 (https://phabricator.wikimedia.org/T194165) (owner: 10Ladsgroup) [11:08:44] (03CR) 10jenkins-bot: Write to change_tag_def and the new column in change_tag in Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444555 (https://phabricator.wikimedia.org/T194165) (owner: 10Ladsgroup) [11:09:13] (03CR) 10Jcrespo: [C: 032] mariadb: Disable notifications for db1086 [puppet] - 10https://gerrit.wikimedia.org/r/445129 (owner: 10Jcrespo) [11:09:22] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 2. Everywhere.Take 2, feeds timed out. T169009 (duration: 02m 56s) [11:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:25] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 19.79, 31.16, 29.69 [11:09:47] (03PS1) 10Ema: cache_canary: add config-master testing purposes [puppet] - 10https://gerrit.wikimedia.org/r/445133 (https://phabricator.wikimedia.org/T164609) [11:10:21] (03PS2) 10Ema: cache_canary: add config-master for testing purposes [puppet] - 10https://gerrit.wikimedia.org/r/445133 (https://phabricator.wikimedia.org/T164609) [11:10:56] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:444555|Set $wgChangeTagsSchemaMigrationStage to write both for Wikisource (T194165)]] (duration: 00m 58s) [11:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:00] T194165: Start writing to change_tag_def in production - https://phabricator.wikimedia.org/T194165 [11:11:05] (03CR) 10Ema: [C: 032] cache_canary: add config-master for testing purposes [puppet] - 10https://gerrit.wikimedia.org/r/445133 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [11:11:22] CFisch_WMDE: please stand by, you are next, after Amir1 is done [11:11:34] zeljkof: Alright! [11:11:35] zeljkof: mine is just done now [11:12:06] !log start of ladsgroup@mwmaint1001:~$ foreachwikiindblist wikisource populateChangeTagDef.php [11:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:17] Amir1: great! taking over swat then [11:12:48] (03PS2) 10Zfilipin: Enable FileExporter for sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444901 (https://phabricator.wikimedia.org/T198594) (owner: 10WMDE-Fisch) [11:12:58] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444901 (https://phabricator.wikimedia.org/T198594) (owner: 10WMDE-Fisch) [11:14:07] 10Operations: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10akosiaris) >>! In T198939#4408846, @Volans wrote: >>>! In T198939#4408774, @akosiaris wrote: >> There is a query tab but I get >> >> ``` >> What you were looking for has been disabled by the administrator. >> ``` > > We decide... [11:14:32] (03Merged) 10jenkins-bot: Enable FileExporter for sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444901 (https://phabricator.wikimedia.org/T198594) (owner: 10WMDE-Fisch) [11:15:44] PROBLEM - Disk space on labsdb1006 is CRITICAL: DISK CRITICAL - free space: / 662 MB (1% inode=98%) [11:16:01] CFisch_WMDE: 444901 is at mwdebug1002, please test and let me know if I can deploy it [11:16:20] thanks I will do [11:16:42] CFisch_WMDE: sorry, just noticed, forgot something, not there yet, in a minute [11:17:07] ha and I wondered ^^ [11:17:36] it's there now [11:17:46] sorry, head full of train this week :) [11:17:54] PROBLEM - Disk space on labsdb1006 is CRITICAL: DISK CRITICAL - free space: / 1653 MB (3% inode=98%) [11:18:04] (03CR) 10jenkins-bot: Enable FileExporter for sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444901 (https://phabricator.wikimedia.org/T198594) (owner: 10WMDE-Fisch) [11:18:28] CFisch_WMDE: it's at mwdebug now [11:18:54] RECOVERY - Disk space on labsdb1006 is OK: DISK OK [11:19:18] I will look [11:19:19] Pchelolo: please stand by, your commit is merged, you are after CFisch_WMDE [11:19:38] ok ok [11:19:46] zeljkof: 1002 you said right? [11:20:05] CFisch_WMDE: yes, it's always mwdebug1002 [11:20:11] hmmm I don't see the beta feature in the list ... [11:20:24] might be something with the new preferences I wonder [11:20:37] let me double check if I did everything right... [11:20:43] but I think I did this time [11:21:26] hmm should be fine to roll out anyway ... [11:22:39] CFisch_WMDE: all good on my side, ok, deploying then [11:22:54] yepp thanks [11:23:52] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:444901|Enable FileExporter for sourceswiki (T198594)]] (duration: 00m 56s) [11:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:56] T198594: request to enable Move Files to Commons feature in sourceswiki - https://phabricator.wikimedia.org/T198594 [11:24:03] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::web_testing: add script for deploying apache changes [puppet] - 10https://gerrit.wikimedia.org/r/444910 [11:24:11] <_joe_> volans, elukey ^^ [11:24:15] CFisch_WMDE deployed, please check and thanks for deploying with #releng :) [11:24:19] <_joe_> I won't spend more time on this bash version [11:24:31] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::web_testing: add script for deploying apache changes [puppet] - 10https://gerrit.wikimedia.org/r/444910 (owner: 10Giuseppe Lavagetto) [11:24:43] hallo [11:24:43] what happened, it doubled in size in the last half an hour :D [11:24:46] <_joe_> and the better version will only come once we have the python automation [11:24:51] + [11:24:53] +1 [11:25:11] (03PS4) 10Giuseppe Lavagetto: profile::mediawiki::web_testing: add script for deploying apache changes [puppet] - 10https://gerrit.wikimedia.org/r/444910 [11:25:17] <_joe_> poor jenkins was confused [11:26:49] On terbium I periodically ran queries on the wikishared database, using `sql wikishared`. Now on mwmaint1001, if I try to run this command, I get: "Error: unable to get reader index". Not sure if it's because of the change of the sql command itself or because of the migration to mwmaint1001. [11:27:13] `sql enwiki` works as expected, but not `sql wikishared`. [11:28:01] (03CR) 10Gehel: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/444610 (owner: 10EBernhardson) [11:28:40] (03PS6) 10Gehel: Enable kafka poller on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/444265 (owner: 10Smalyshev) [11:29:19] zeljkof - FYI pretty big storm is coming here, I'm on laptop and my network is backed up by UPS + LTE connection. I'll be around but I might drop out for a minute or two [11:29:29] raynor: ok, good luck [11:29:52] Pchelolo: your commit is at mwdebug1002, please test and let me know if I can deploy it [11:30:08] zeljkof: kk, I need 5 mins to test it [11:30:11] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/444910 (owner: 10Giuseppe Lavagetto) [11:30:16] Pchelolo: ok [11:30:28] raynor: I will be done in about 5 minutes [11:30:45] (03CR) 10Gehel: [C: 032] "puppet compiler looks good" [puppet] - 10https://gerrit.wikimedia.org/r/444265 (owner: 10Smalyshev) [11:32:34] ok [11:36:17] SMalyshev: ^ wdqs10(09|10) are now using kafka poller, data is flowing nicely, but wdqs1009 reports negative polling progress [11:36:45] zeljkof: ok, it's not making it worse, so it's fine to proceed [11:36:56] Pchelolo: ok, deploying [11:36:58] :) [11:38:06] !log zfilipin@deploy1001 Synchronized php-1.32.0-wmf.12/extensions/EventBus/: SWAT: [[gerrit:445117|Properly handle null content format]] (duration: 00m 59s) [11:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:24] 10Operations, 10DBA, 10Scap: "sql wikishared" doesn't work on mwmaint1001 - https://phabricator.wikimedia.org/T199316 (10Amire80) [11:38:45] Pchelolo: deployed, please check and thanks for deploying with #releng ;) [11:38:52] raynor: the swat is yours! [11:38:53] thank you zeljkof [11:39:00] Good luck for Croatia today :) [11:39:15] ok, thank you [11:39:26] 10Operations, 10Security: Disable agent forwarding to important hosts - https://phabricator.wikimedia.org/T198138 (10chasemp) >>! In T198138#4414875, @MoritzMuehlenhoff wrote: > Is there a technical reason to only do it selectively for a few WMCS hosts and not in general? Or in other words, is there any known... [11:39:37] Pchelolo: thanks! :D [11:39:43] (03PS2) 10Pmiazga: Enable page previews for all new editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444906 (https://phabricator.wikimedia.org/T197719) [11:40:14] (03CR) 10Pmiazga: [C: 032] Enable page previews for all new editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444906 (https://phabricator.wikimedia.org/T197719) (owner: 10Pmiazga) [11:40:18] SMalyshev: you probably ran the tests on wdqs1009, so it is using the time of the last test as a starting point. And thus a bit backed up. I'm letting it catch up on its own (that's actually a nice test). [11:41:52] (03Merged) 10jenkins-bot: Enable page previews for all new editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444906 (https://phabricator.wikimedia.org/T197719) (owner: 10Pmiazga) [11:42:08] (03CR) 10jenkins-bot: Enable page previews for all new editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444906 (https://phabricator.wikimedia.org/T197719) (owner: 10Pmiazga) [11:42:57] 10Operations: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10Volans) Ack @akosiaris, I've opened https://github.com/voxpupuli/puppetboard/issues/475 for now, I'll see if I can find the time to send a patch, doesn't seem overcomplicated to just filter the queryable endpoints. [11:47:40] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: clean labvirt1022 resources [dns] - 10https://gerrit.wikimedia.org/r/445141 (https://phabricator.wikimedia.org/T199202) [11:49:39] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: clean labvirt1022 resources [dns] - 10https://gerrit.wikimedia.org/r/445141 (https://phabricator.wikimedia.org/T199202) (owner: 10Arturo Borrero Gonzalez) [11:51:23] (03CR) 10Daniel Kinzler: [C: 031] "Agree with intent. No idea if this may blow up for some arcane reason." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445128 (owner: 10Gergő Tisza) [11:53:11] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:444906|Enable page previews for all new editors (T197719)]] (duration: 00m 56s) [11:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:14] T197719: Deploy new "PopupsOptInStateForNewAccounts" page previews configuration to all interested projects - https://phabricator.wikimedia.org/T197719 [11:55:25] PROBLEM - Nginx local proxy to apache on mw2262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:56:15] (03PS2) 10Pmiazga: Scrub ambox images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445036 (owner: 10Jdlrobson) [11:56:25] RECOVERY - Nginx local proxy to apache on mw2262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.191 second response time [11:56:28] (03CR) 10Pmiazga: [C: 032] Scrub ambox images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445036 (owner: 10Jdlrobson) [11:56:51] zeljkof - I know there is a no-deps window, can I proceed with my patches? [11:57:12] I'm finishing the second patch (that I have to test), and then 3 noop patches [11:57:26] that I also have to test but those should be pretty quick one [11:58:06] (03Merged) 10jenkins-bot: Scrub ambox images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445036 (owner: 10Jdlrobson) [11:58:21] (03CR) 10jenkins-bot: Scrub ambox images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445036 (owner: 10Jdlrobson) [11:59:52] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1086 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/445142 [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180711T1200) [12:01:52] (03PS2) 10Jcrespo: mariadb: Reenable notifications on db1086 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/445142 [12:03:13] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db1086 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/445142 (owner: 10Jcrespo) [12:05:41] (03PS2) 10Pmiazga: Hygiene: Remove unsued VectorExperimentalPrintStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444586 [12:05:51] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:445036|Remove ambox images on mobile wikis (T191303)]] (duration: 00m 57s) [12:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:54] T191303: Mobile page issues - visual styling changes - https://phabricator.wikimedia.org/T191303 [12:07:14] (03CR) 10Pmiazga: [C: 032] Hygiene: Remove unsued VectorExperimentalPrintStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444586 (owner: 10Pmiazga) [12:08:44] (03Merged) 10jenkins-bot: Hygiene: Remove unsued VectorExperimentalPrintStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444586 (owner: 10Pmiazga) [12:08:51] (03PS1) 10Jcrespo: mariadb: Repool db1086 after reimage with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445144 [12:08:56] (03CR) 10jenkins-bot: Hygiene: Remove unsued VectorExperimentalPrintStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444586 (owner: 10Pmiazga) [12:10:41] (03PS2) 10Pmiazga: Hygiene: remove unused MinervaDownloadIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444592 [12:11:28] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:444586|Cleanup: Remove unsued VectorExperimentalPrintStyles]] (duration: 00m 56s) [12:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:48] (03CR) 10Pmiazga: "nope, https is no forced by default (using core), the MFForceSecureLogin is not used at all." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444591 (owner: 10Pmiazga) [12:13:10] jouncebot: now [12:13:10] For the next 0 hour(s) and 46 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180711T1200) [12:13:13] jouncebot: next [12:13:13] In 0 hour(s) and 46 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180711T1300) [12:13:42] I'm almost done [12:13:47] waiting for CI ;( [12:13:55] raynor: sorry, just saw your ping, sure, go ahead :) [12:14:39] (03CR) 10Pmiazga: [C: 032] Hygiene: remove unused MinervaDownloadIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444592 (owner: 10Pmiazga) [12:16:11] (03Merged) 10jenkins-bot: Hygiene: remove unused MinervaDownloadIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444592 (owner: 10Pmiazga) [12:17:11] (03PS1) 10Filippo Giunchedi: phabricator: bump request rate_limits [puppet] - 10https://gerrit.wikimedia.org/r/445145 [12:18:15] (03CR) 10jenkins-bot: Hygiene: remove unused MinervaDownloadIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444592 (owner: 10Pmiazga) [12:19:22] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:444592|Cleanup: Remove unused MinervaDownloadIcon]] (duration: 00m 57s) [12:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:27] (03CR) 10Pmiazga: [C: 032] Hygiene: remove unsued MFForceSecureLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444591 (owner: 10Pmiazga) [12:21:00] (03Merged) 10jenkins-bot: Hygiene: remove unsued MFForceSecureLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444591 (owner: 10Pmiazga) [12:21:08] (03CR) 10Reedy: Remove pear packages from MW Application Servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [12:22:45] (03CR) 10jenkins-bot: Hygiene: remove unsued MFForceSecureLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444591 (owner: 10Pmiazga) [12:23:38] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:444591|Cleanup: Remove unsued MFForceSecureLogin]] (duration: 00m 56s) [12:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:29] !log EU SWAT finished [12:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:45] ok, I'm done, Reedy, zeljkof sorry it took so long [12:24:55] raynor: no problemo [12:25:09] waiting for CI really builds up the pressure, you just wait, and wait, and wait [12:25:12] there is time for logs to cool down before the train :) [12:25:23] btw, zeljkof could you check the logs, the fatalmonitor [12:25:29] hasharAway is working on CI, should be better soon [12:25:32] Hmm. Do I deploy a couple of changes now... [12:25:33] I'm wondering, we have 24 errors of nothing [12:26:16] I'm just wondering whats that [12:27:42] raynor: I don't think it's important, unless it's triple digit I ignore them :) [12:27:54] ;) [12:28:00] noted, thanks [12:28:29] ok, I'm around, I just need another cup of coffee [12:35:14] (03PS3) 10Reedy: Make the 'affcomusergroup' require users to be logged in to send it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918 [12:35:21] (03CR) 10Reedy: [C: 032] Make the 'affcomusergroup' require users to be logged in to send it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918 (owner: 10Reedy) [12:36:37] (03Merged) 10jenkins-bot: Make the 'affcomusergroup' require users to be logged in to send it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918 (owner: 10Reedy) [12:38:10] (03CR) 10jenkins-bot: Make the 'affcomusergroup' require users to be logged in to send it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918 (owner: 10Reedy) [12:40:28] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: ContactPage new config (duration: 00m 57s) [12:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:38] !log reedy@deploy1001 Synchronized wmf-config/MetaContactPages.php: ContactPage new config (duration: 00m 56s) [12:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:40] (03PS1) 10Muehlenhoff: Reimage wasat with stretch and rename to mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/445149 [12:43:10] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.12/extensions/ContactPage: New config! (duration: 00m 57s) [12:43:15] 10Operations, 10Research, 10SRE-Access-Requests: Request access to data for citation usage research - https://phabricator.wikimedia.org/T198662 (10DarTar) @RobH this is approved on my end. @Miriam is the official point of contact on this collaboration. @RStallman-legalteam: the spreadsheet doesn't mention t... [12:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:58] Hmm. That technically needs a full scap [12:44:00] Bah to that [12:44:20] 10Operations, 10Research, 10SRE-Access-Requests: Request access to data for citation usage research - https://phabricator.wikimedia.org/T198662 (10DarTar) My bad, Michele has a valid NDA in the sheet, see column I. [12:51:25] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Vgutierrez) @ayounsi could you enable lvs1015 network ports? thanks! [12:51:39] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 (10mark) We had a long and interesting discussion about this on [[ http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-traffic/20180711.txt | IRC ]]... [12:54:33] (03PS1) 10Muehlenhoff: Add Michele Catasta to users [puppet] - 10https://gerrit.wikimedia.org/r/445153 (https://phabricator.wikimedia.org/T198662) [12:54:35] (03PS1) 10Muehlenhoff: Add pirroh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/445154 (https://phabricator.wikimedia.org/T198662) [12:55:10] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Request access to data for citation usage research - https://phabricator.wikimedia.org/T198662 (10MoritzMuehlenhoff) [12:57:12] 10Operations, 10Traffic, 10UniversalLanguageSelector: ULS GeoIP should not use meta.wm.o/geoiplookup - https://phabricator.wikimedia.org/T143270 (10Petar.petkovic) [12:57:18] (03PS2) 10Muehlenhoff: Add Michele Catasta to users [puppet] - 10https://gerrit.wikimedia.org/r/445153 (https://phabricator.wikimedia.org/T198662) [12:57:46] (03PS5) 10Giuseppe Lavagetto: profile::mediawiki::web_testing: add script for deploying apache changes [puppet] - 10https://gerrit.wikimedia.org/r/444910 [12:58:42] <_joe_> what's up with jenkins? [12:58:52] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::web_testing: add script for deploying apache changes [puppet] - 10https://gerrit.wikimedia.org/r/444910 (owner: 10Giuseppe Lavagetto) [12:59:47] !log Remove unused grants from db1098:3316 [12:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] zeljkof: That opportune time is upon us again. Time for a MediaWiki train - European version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180711T1300). [13:00:32] jouncebot: are you reading my mind!?!1!1! :P [13:03:27] thcipriani: I got the scap update ready, will wait for the train to finish [13:04:27] boring [13:05:04] godog: awesome, thanks, sounds good! [13:05:58] Reedy: in your front yard? [13:08:13] (03PS4) 10Giuseppe Lavagetto: mediawiki_exp: unify the small private wikis definitions [puppet] - 10https://gerrit.wikimedia.org/r/444185 [13:08:35] (03PS1) 10Vgutierrez: site: add lvs1015 as spare system [puppet] - 10https://gerrit.wikimedia.org/r/445162 (https://phabricator.wikimedia.org/T184293) [13:09:25] (03PS1) 10Elukey: role::aqs: deploy new Druid config [puppet] - 10https://gerrit.wikimedia.org/r/445163 (https://phabricator.wikimedia.org/T199299) [13:09:33] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [13:09:58] (03CR) 10Alexandros Kosiaris: [C: 031] "I guess +1" [puppet] - 10https://gerrit.wikimedia.org/r/441131 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [13:11:33] zeljkof: are you operating the train today? please LMK when done [13:11:47] godog: I am, sure [13:11:53] (03CR) 10Joal: [C: 031] "Thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/445163 (https://phabricator.wikimedia.org/T199299) (owner: 10Elukey) [13:11:55] thanks! [13:12:05] (03CR) 10Elukey: [C: 032] role::aqs: deploy new Druid config [puppet] - 10https://gerrit.wikimedia.org/r/445163 (https://phabricator.wikimedia.org/T199299) (owner: 10Elukey) [13:12:29] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki_exp: unify the small private wikis definitions [puppet] - 10https://gerrit.wikimedia.org/r/444185 (owner: 10Giuseppe Lavagetto) [13:12:42] (03PS5) 10Giuseppe Lavagetto: mediawiki_exp: unify the small private wikis definitions [puppet] - 10https://gerrit.wikimedia.org/r/444185 [13:12:43] <_joe_> grrr [13:12:49] <_joe_> you merge-sniped me luca [13:13:02] ahahahah [13:13:44] !log upgrade apertium-fra-cat on scb boxes to 1.3.0~r84327-1+wmf1, T189076 [13:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:48] T189076: Update apertium-fra-cat MT pair - https://phabricator.wikimedia.org/T189076 [13:14:59] !log roll restart of aqs on aqs* to pick up the new Druid config [13:15:01] (03PS1) 10Zfilipin: group1 wikis to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445165 [13:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:03] (03CR) 10Zfilipin: [C: 032] group1 wikis to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445165 (owner: 10Zfilipin) [13:16:01] !log restart apertium-apy for apertium-fra-cat upgrade on scb boxes to 1.3.0~r84327-1+wmf1, T189076 [13:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:11] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445165 (owner: 10Zfilipin) [13:18:21] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445165 (owner: 10Zfilipin) [13:19:32] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10mark) >>! In T184293#4415691, @Vgutierrez wrote: > @ayounsi could you enable lvs1015 network ports? thanks! I added lvs1015 to interface-range LVS-balancer on asw2-c-eqia... [13:20:24] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) The thread on `linux-xfs` starts here: https://www.spinics.net/lists/linux-xfs/msg20592.html and the wrong free blocks count looks like it is due t... [13:21:23] (03CR) 10Alexandros Kosiaris: [C: 031] services: Define dc-pairs of the same service together [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443872 (owner: 10Krinkle) [13:24:19] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.12 [13:24:23] PROBLEM - Host ms-be1040 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:17] !log zfilipin@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.12 (duration: 00m 57s) [13:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:29] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: WDQS diskspace is low - https://phabricator.wikimedia.org/T196485 (10faidon) [13:26:45] ms-be1040 is me btw, or T199198 rather [13:26:46] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [13:27:34] hashar, thcipriani: looks like I'm done with train today? can it be that easy?! :) [13:28:16] godog: I think I'm done with train for today, this is my first ever week of train, so am not sure if there are more things to do :) [13:28:26] if all the logs looks clean, then yeah: it can be that easy :) [13:28:26] (03PS1) 10Giuseppe Lavagetto: Revert "mediawiki_exp: unify the small private wikis definitions" [puppet] - 10https://gerrit.wikimedia.org/r/445166 [13:28:33] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Revert "mediawiki_exp: unify the small private wikis definitions" [puppet] - 10https://gerrit.wikimedia.org/r/445166 (owner: 10Giuseppe Lavagetto) [13:28:34] hashar, thcipriani: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Wednesday:_group0_to_group1_deploy [13:28:58] I did a bit of doc refresh, but I get `fatal: Not a git repository (or any of the parent directories): .git` [13:29:23] should I run the script from `release/bin/deploy-promote`? [13:29:27] instead of from home? [13:29:53] hrm, yeah, inside "release" is where I always run it, this is a bug though [13:30:07] at the start of the script it checks to make sure it's the latest version of the script [13:30:13] also, group1 version did not change here https://tools.wmflabs.org/versions/ [13:30:53] yesyet [13:31:04] thcipriani: should I update the docs, or will you update the script? [13:31:06] https://en.wikibooks.org/wiki/Special:Version is the right version [13:31:14] I'll update the script [13:31:28] great, thanks, I'll remove the error message from the docs :) [13:32:10] zeljkof: ack, thanks! [13:32:40] thcipriani: I'll update scap [13:32:51] okie doke [13:33:45] (03CR) 10Vgutierrez: [C: 032] site: add lvs1015 as spare system [puppet] - 10https://gerrit.wikimedia.org/r/445162 (https://phabricator.wikimedia.org/T184293) (owner: 10Vgutierrez) [13:33:45] (03PS2) 10Filippo Giunchedi: Scap: Bump version to 3.8.4 [puppet] - 10https://gerrit.wikimedia.org/r/445031 (https://phabricator.wikimedia.org/T199283) (owner: 10Thcipriani) [13:33:49] (03PS2) 10Vgutierrez: site: add lvs1015 as spare system [puppet] - 10https://gerrit.wikimedia.org/r/445162 (https://phabricator.wikimedia.org/T184293) [13:34:04] (03CR) 10Filippo Giunchedi: [C: 032] Scap: Bump version to 3.8.4 [puppet] - 10https://gerrit.wikimedia.org/r/445031 (https://phabricator.wikimedia.org/T199283) (owner: 10Thcipriani) [13:35:05] (03PS3) 10Filippo Giunchedi: Scap: Bump version to 3.8.4 [puppet] - 10https://gerrit.wikimedia.org/r/445031 (https://phabricator.wikimedia.org/T199283) (owner: 10Thcipriani) [13:36:58] thcipriani: is there a time I need to keep an eye on the logs? minutes, hours, all day? :) [13:38:03] ah, found "if there is an unexplained error that occurs within 1 hour of a train deployment — always roll back the train" [13:38:10] so I guess about an hour [13:38:22] thcipriani: deploy1001 updated [13:38:39] (03CR) 10Vgutierrez: [C: 04-1] [WIP] get rid of openssl CLI usage (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [13:38:48] zeljkof: no set time, no. I'll generally keep a close watch for...some amount of time...an hour, sure, then I'll periodically check this channel and the logs to make sure nothing new has popped up [13:39:01] godog: cool, I'll run a test noop sync [13:39:23] (03PS5) 10Vgutierrez: [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 [13:40:25] (03CR) 10jerkins-bot: [V: 04-1] [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [13:40:43] can I do then some unrelated db pools/depools ? [13:41:20] (03PS5) 10Daimona Eaytoy: Enable $wgAbuseFilterProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423660 (https://phabricator.wikimedia.org/T191039) [13:41:22] (03PS2) 10Jcrespo: mariadb: Repool db1086 after reimage with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445144 [13:41:27] !log thcipriani@deploy1001 Synchronized README: noop: test scap 3.8.4-1 (duration: 00m 56s) [13:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:18] godog: canary checks still seem to work, so I didn't break anything there: looks good to me! I'll update the services task to let them know they should be unblocked. Thanks for the update and all the help! [13:44:42] thcipriani: soudns good, thanks! [13:45:14] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Services (blocked): Update Debian Package for Scap3 to 3.8.4-1 - https://phabricator.wikimedia.org/T199283 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This is completed [13:52:03] RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [13:52:48] vgutierrez: ^ [13:54:03] PROBLEM - Host conf1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:54:28] XioNoX: <3 [13:54:33] thx! [13:56:17] conf1005 seems up [13:56:36] was there maintenance on the mgmt port or on the network? [13:57:24] (03PS2) 10Ema: network::constants: define all caches, not only cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/445126 (https://phabricator.wikimedia.org/T164609) [13:58:26] XioNoX: can you see interface issues? [13:58:41] or a lots link or something [13:58:45] *lost [13:58:59] not that I'm aware of [13:59:20] We don't track mgmt ports so I don't know where it's plugged [13:59:23] ah [13:59:38] jynus: cmjohnson1 is your best bet here [14:00:26] jynus mgmt is a dumb switch...no maintenance on those [14:01:02] it's connected and green led on my end [14:01:17] mmm [14:01:34] it is not high priority as mgmt is not normally used [14:01:34] unable to do remote IPMI on my end :) [14:01:36] (03CR) 10Mark Bergsma: Correct incoming connection interaction with BGP FSM (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/436299 (owner: 10Mark Bergsma) [14:01:52] and seems it kinda times out after a while, doesn''t error immediately [14:02:16] for S&G's i changed the cable [14:02:42] no love [14:03:20] (03CR) 10Vgutierrez: [C: 031] Correct incoming connection interaction with BGP FSM [debs/pybal] - 10https://gerrit.wikimedia.org/r/436299 (owner: 10Mark Bergsma) [14:03:26] let's go over the planned maintenance, we can go back to this later [14:03:47] I can have a look [14:03:55] locally ipmi works, checking a bunch of things [14:04:39] on conf1005 is currently running zookeeper (and etcd but not used atm), not super urgent but in case of fire the mgmt interface is really needed :) [14:04:40] yeah it should work because it is most likely network (not sure which layer, as cris says link should be up) [14:04:55] elukey: I am not saying conf1005 is not important [14:05:02] we can power it down [14:05:06] and see if it comes back up [14:05:19] just that the mgmt doesn't need high avilability [14:06:25] (03CR) 10Volans: "Given that wasat is already passive and noone is using it we can simply disable puppet and already have it removed from all the places in " [puppet] - 10https://gerrit.wikimedia.org/r/445149 (owner: 10Muehlenhoff) [14:06:35] jynus ..no until you need it;-) [14:06:44] he he [14:06:49] jynus: and I haven't said anything about what you think about it, it was only a reminder in case people didn't know what's running on it :) [14:07:53] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Vgutierrez) From lldpcli everything looks good: ```name=lldpcli show neighbors root@lvs1015:~# lldpcli show neighbors | egrep "Interface|PortDescr" Interface: enp4s0f0,... [14:08:00] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler02/11778/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/445126 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [14:10:01] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Vgutierrez) [14:11:48] (03PS4) 10Giuseppe Lavagetto: mediawiki: move private wikis to a separate virtual host [puppet] - 10https://gerrit.wikimedia.org/r/444186 [14:15:15] !log reset iLO on hpiLO on conf1005 to see if it fixes the issues with the mgmt interface [14:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:24] RECOVERY - Host conf1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [14:15:55] elukey, jynus, cmjohnson1: seems back to work after a reset :) [14:15:58] 10Operations, 10Cloud-Services, 10Cloud-VPS: cronspam from labtestservices2001 /etc/dns-floating-ip-updater.py > /dev/null - https://phabricator.wikimedia.org/T152439 (10Andrew) 05Open>03Resolved a:03Andrew Pretty sure this is resolved. [14:16:05] awesome! Thx [14:16:24] volans: <3 [14:16:51] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@2e56855]: Move static blacklisting to change-prop T198386 [14:16:53] volans: what exactly do you mean with a reset- from comand line, reset managment? [14:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:55] T198386: Move static rerender blacklist from RESTBase to ChangeProp - https://phabricator.wikimedia.org/T198386 [14:17:30] jynus: I tried other usual culprits but were all ok (config wise), then I saw that ssh-ing to the mgmt interface was working [14:17:53] oh [14:17:58] so I logged there and reset the hpiLO with: 1) cd /map1 2) reset [14:18:02] so ssh was working but it was not responding to ping? [14:18:13] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@2e56855]: Move static blacklisting to change-prop T198386 (duration: 01m 22s) [14:18:15] ping was failing and ssh icinga was timing out, but I was able to login [14:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:27] thanks, volans [14:18:38] yw, I'll try to put the usual culprits in a wiki page [14:18:41] I am only asking to do the same if it hapens again anywhere [14:18:43] I'm starting to forget them [14:22:30] (03PS5) 10Giuseppe Lavagetto: mediawiki: move private wikis to a separate virtual host [puppet] - 10https://gerrit.wikimedia.org/r/444186 [14:22:58] (03PS2) 10Muehlenhoff: Reimage wasat with stretch and rename to mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/445149 (https://phabricator.wikimedia.org/T192092) [14:23:07] (03CR) 10Muehlenhoff: "Makes sense, I updated the patch" [puppet] - 10https://gerrit.wikimedia.org/r/445149 (https://phabricator.wikimedia.org/T192092) (owner: 10Muehlenhoff) [14:25:11] (03PS6) 10Vgutierrez: get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 [14:25:32] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: move private wikis to a separate virtual host [puppet] - 10https://gerrit.wikimedia.org/r/444186 (owner: 10Giuseppe Lavagetto) [14:26:19] (03CR) 10Vgutierrez: get rid of openssl CLI usage (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [14:27:45] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) @elukey I still see flows to gerrit (2620:0:861:3:208:80:154:85) https from: 2620:0:861:108:10:64:53:26 2620:0:861:108:10:64:... [14:37:06] (03PS7) 10Vgutierrez: get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 [14:37:13] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1086 after reimage with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445144 (owner: 10Jcrespo) [14:38:56] (03PS1) 10Giuseppe Lavagetto: mediawiki_test: fix boardgovcom vhost url [puppet] - 10https://gerrit.wikimedia.org/r/445178 [14:38:58] (03Merged) 10jenkins-bot: mariadb: Repool db1086 after reimage with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445144 (owner: 10Jcrespo) [14:39:19] (03CR) 10jenkins-bot: mariadb: Repool db1086 after reimage with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445144 (owner: 10Jcrespo) [14:39:34] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki_test: fix boardgovcom vhost url [puppet] - 10https://gerrit.wikimedia.org/r/445178 (owner: 10Giuseppe Lavagetto) [14:41:02] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1003 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [14:41:22] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1002 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [14:43:15] whatttt [14:43:24] mobrovac: --^ [14:45:16] 10Operations, 10Pybal, 10Traffic: Unhandled error stopping pybal: 'RunCommandMonitoringProtocol' object has no attribute 'checkCall' - https://phabricator.wikimedia.org/T157786 (10mark) 05Open>03Resolved a:03mark This has been addressed in acdd0ebf74e5dd9e06c3216b9a93063ab8e91574 [14:47:25] so I am seeing, on kafka1001 [14:47:25] KafkaTimeoutError: KafkaTimeoutError: Batch for TopicPartition(topic='eqiad.mediawiki.job.wikibase-addUsagesForPage' [14:47:38] PROBLEM - Kafka Broker Server on kafka1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [14:47:53] PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1 [14:48:03] womp womp [14:48:16] Jul 11 14:48:01 kafka1001 kafka-server-start[11221]: java.io.IOException: Too many open files [14:48:22] PROBLEM - Check systemd state on kafka1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:48:22] ouch [14:48:31] lovely [14:48:36] <_joe_> uhm what's up? [14:48:42] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1001 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [14:48:51] <_joe_> elukey: is there anything that could've caused this? [14:48:59] <_joe_> 1001 is kafka-main, right? [14:49:05] yes exactly [14:49:15] <_joe_> uh that's bad [14:49:23] I just checked on kafka, still need to figure out what happened [14:49:51] looks like to me so far kafka and eventbus are affected ? [14:49:59] https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-cluster=eventbus&var-kafka_broker=All&from=now-3h&to=now [14:50:01] <_joe_> LimitNOFILE=65536 [14:50:03] LimitNOFILE= until a root cause is needed [14:50:12] <_joe_> yeah I agree with jynus [14:50:13] that is quite big already [14:50:26] looking [14:50:30] although we have 200K in use for mysql and no issues [14:50:33] <_joe_> godog: changeprop and the jobqueue will be affected as well [14:51:12] PROBLEM - Check systemd state on kafka1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:51:17] sure _joe_ if you want to go ahead with limitnofile please do [14:51:18] PROBLEM - Kafka Broker Server on kafka1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [14:51:34] <_joe_> elukey: try restarting kafka maybe? [14:52:14] sure [14:52:19] !log restart kafka on kafka1001 [14:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:22] RECOVERY - Check systemd state on kafka1001 is OK: OK - running: The system is fully operational [14:52:24] <_joe_> elukey: if only I found where that service is defined [14:52:31] <_joe_> elukey: should I try 1002 as well? [14:52:37] RECOVERY - Kafka Broker Server on kafka1001 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [14:52:38] <_joe_> it's dead in the same way there too [14:52:52] RECOVERY - Check systemd state on kafka1002 is OK: OK - running: The system is fully operational [14:53:06] lets not monitor file usage [14:53:08] !log restart kafka on kafka1002 [14:53:09] *now [14:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:13] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1003 is CRITICAL: 692 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003 [14:53:17] in case it comes back quickly [14:53:18] _joe_ done, 1003 looks good [14:53:18] RECOVERY - Kafka Broker Server on kafka1002 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [14:53:38] _joe_ I was checking 1 by 1 [14:54:04] <_joe_> elukey: what's the module that installs kafka? [14:54:15] <_joe_> tell me it's not a submodule or I'll scream [14:54:18] I will check impact and recovery on higher level stats [14:54:19] nono :) [14:54:37] what happened here??? [14:54:51] _joe_ it should be confluent something (don't recall exactly now) [14:55:05] /nick GBirke_WMDE [14:55:05] Pchelolo: so kafka on 100[1,2] stopped working because of too many open files [14:55:14] and eventbus didn't like it [14:55:34] jobqueue event bus seems still dead [14:55:40] of course it didn't 0 events accepted [14:55:58] 0.3 jobs/s [14:56:01] for at least 10 mins [14:56:11] Pchelolo: kafka topics --describe shows a ton of topics like Topic:eqiad.change-prop.retry.change-prop.retry.change-prop.retry.mediawiki.job.crosswikiSuppressUser [14:56:18] is that normal? [14:56:20] maybe it takes some tiem to restart? [14:57:12] still no activity shown at https://grafana.wikimedia.org/dashboard/db/jobqueue-eventbus?orgId=1 [14:57:20] elukey: no it's not normal [14:57:28] lemme try something [14:57:29] !log rolling restart of eventbus on kafka100[1-3] [14:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:33] also load average on kafka100[123] is high now, though I think that's eventlogging [14:57:51] i.e. might be a consequence, not a cause [14:57:54] <_joe_> why eventlogging goes on kafka main? [14:58:04] <_joe_> eventlogging should be on -analytics [14:58:09] <_joe_> or I'm missing something? [14:58:25] _joe_ so eventus is an incarnation of eventlogging [14:58:39] different code paths but they are on the same repo [14:58:48] ah, my bad then, that's eventbus [14:58:51] !log shutting down furud to disconnect disk array shelves [14:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:55] (03PS1) 10Giuseppe Lavagetto: kafka::main: remove limits on open files temporarily [puppet] - 10https://gerrit.wikimedia.org/r/445183 [14:59:04] so the units are named with the eventlogging prefix historically [14:59:05] <_joe_> elukey: ^^ [14:59:06] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1002 is CRITICAL: /v1/schemas/{schema_uri} (Untitled test) timed out before a response was received: /v1/events (Produce a valid test event) timed out before a response was received [14:59:25] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1002 is CRITICAL: 206 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002 [14:59:47] <_joe_> godog: I think the high load is due to re-replication of lost partitions [14:59:57] I think processing is going up now? [15:00:13] at least cdnpurge [15:00:19] <_joe_> now thik if kafka-main mediated anything more urgent than the jobqueue [15:00:24] _joe_: I'm not sure, from top it was all python processes before the rolling restart [15:00:38] <_joe_> godog: uhm ok that was definitely eventbus [15:00:56] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [15:01:00] <_joe_> elukey: https://puppet-compiler.wmflabs.org/compiler02/11779/kafka1001.eqiad.wmnet/ [15:01:10] +2 [15:01:25] <_joe_> yeah, the issue is, that might restart kafka on the nodes [15:01:50] <_joe_> uhm, actually no [15:01:52] Just to decrease the load a bit I'll stop changeprop in eqiad [15:01:53] we'll do another roll restart, I'll make sure that the cluster is stable before doing it [15:01:54] <_joe_> you use systemd::service [15:02:03] it shouldn't auto restart [15:02:08] <_joe_> Pchelolo: changeprop, not cp-jobqueue [15:02:10] <_joe_> right? [15:02:14] _joe_: right [15:02:19] <_joe_> cool, makes sense [15:02:57] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11779/kafka1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/445183 (owner: 10Giuseppe Lavagetto) [15:03:08] so kafka1003 just died for java.lang.OutOfMemoryError: Java heap space [15:03:14] !log restart kafka on kafka1003 [15:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:26] PROBLEM - Varnishkafka Statsv Delivery Errors per second on einsteinium is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=statsv&var-host=All [15:03:27] load is still high [15:04:02] <_joe_> we're still nto producing messages if the kafka dashboard is correct [15:04:07] <_joe_> *not [15:04:15] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 36.92 ms [15:04:15] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventbus_8085: Servers kafka1002.eqiad.wmnet are marked down but pooled [15:04:19] on the other kafkas there are OOMs [15:04:31] <_joe_> elukey: what is happening now? [15:04:33] <_joe_> any idea? [15:04:35] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventbus_8085: Servers kafka1001.eqiad.wmnet are marked down but pooled [15:05:03] the jobque seems processing at 1/20 of the normal speed [15:05:37] _joe_ the only super weird thing that I saw running kafka topics --describe was a ton of topics with change-prop.retry repeated a lot of times [15:05:37] <_joe_> elukey: IO see things like [15:05:38] <_joe_> Jul 11 15:05:17 kafka1002 kafka-server-start[8393]: Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "zk-session-expiry-handler1-SendThread(conf1004.eqiad.wmnet:2181)" [15:05:44] but I am not sure if it is garbage or not [15:05:53] and now almost nothing [15:06:07] <_joe_> Jul 11 15:05:58 kafka1002 kafka-server-start[8393]: [2018-07-11 15:05:58,170] INFO [ReplicaFetcher replicaId=1002, leaderId=1001, fetcherId=3] Retrying leaderEpoch request for partition eqiad.change-prop.retry.change-prop.retry.change-prop.retry.mediawiki.job.HTMLCacheUpdate-0 as the leader reported an error: UNKNOWN_SERVER_ERROR (kafka.server.ReplicaFetcherThread) [15:06:18] <_joe_> wtf does this mean? [15:06:36] no idea [15:06:43] <_joe_> I think we should also page andrew? [15:06:58] <_joe_> elukey: tbh I'm no kafka expert at all [15:07:05] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kafka1001.eqiad.wmnet]) [15:07:06] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:07:12] I can see stuff like eqiad.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.cpjobqueue.retry.mediawiki.job.flaggedrevs_CacheUpdate-0 [15:07:16] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:07:17] that is clearly wrong [15:07:25] but I am not sure if it was already there or not [15:07:25] <_joe_> Pchelolo: let's stop cp-jobqueue as well please? [15:07:33] _joe_: kk doing [15:07:35] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kafka1002.eqiad.wmnet]) [15:07:44] <_joe_> we need to get kafka back to a state of sanity [15:07:46] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:07:48] <_joe_> first [15:08:16] yeah I agree [15:08:29] !log stop cpjobqueue in eqiad [15:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:16] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:09:42] database open connections seem high, which is strange given many operations are not currently being done [15:09:46] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [15:09:51] I am retrying to put kafka back restarting the daemons failed for OOM again [15:10:16] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1003 is CRITICAL: 2210 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003 [15:10:35] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1001 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [15:10:35] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [15:10:35] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1002 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [15:10:45] so 1003 is up [15:11:08] !log restart again kafka on kafka100[1,2] - failed for OOM [15:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:43] <_joe_> elukey: 1002 seemed to be recovering though [15:11:56] it showed a ton of OOM in the logs [15:12:05] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal [15:12:36] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [15:12:41] so one thing to check is what happened when kafka started misbehaving [15:12:59] <_joe_> elukey: let's first try to get it in a state of sanity [15:13:15] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on einsteinium is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:13:28] <_joe_> Jul 11 15:13:15 kafka1002 kafka-server-start[11736]: Caused by: java.io.FileNotFoundException: /srv/kafka/data/replication-offset-checkpoint (Too many open files) [15:13:40] <_joe_> elukey: let's do things in order [15:13:44] <_joe_> I'll start with 1002 [15:13:50] <_joe_> I will run puppet first [15:13:55] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on einsteinium is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:13:59] elukey, _joe_: what's the current status, what's the impact and where is help needed? [15:14:06] 10Operations, 10SRE-Access-Requests: +2 for Addshore on operations/puppet - https://phabricator.wikimedia.org/T199325 (10Legoktm) +2 on ops/puppet is associated with being a global root in Wikimedia (https://tools.wmflabs.org/ldap/group/ops). Without a pressing/convincing need, I doubt it will be granted. [15:14:06] <_joe_> then start kafka [15:14:19] <_joe_> paravoid: the kafka-main cluster has 2 nodes out of 3 that are failing [15:14:34] <_joe_> help is needed figuring out whatever is going on and how to make the cluster recover [15:14:36] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1003 is OK: All endpoints are healthy [15:14:43] _joe_ wait a sec to see if the cluster recovers [15:14:46] <_joe_> impact is anything going through kafka is blocked [15:14:57] <_joe_> elukey: on 1002 kafka is dead [15:15:12] <_joe_> oh just restarted by puppet [15:15:16] * gehel is late to the party, but we enabled kafka poller on wdqs10(09|10), I can kill them to see if it has impact [15:15:33] gehel: when did you enabled it? [15:15:34] <_joe_> gehel: yes kill them now pleas [15:15:54] <_joe_> elukey: let's first concentrate on recovering the current outage, then on root causes [15:15:55] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1002 is OK: All endpoints are healthy [15:15:56] someone stopped ircecho? [15:16:02] <_joe_> no [15:16:03] an no, sorry, changed user [15:16:15] <_joe_> so, status [15:16:27] <_joe_> kafka after runninfg puppet worked for a few seconds on 1002 [15:16:35] <_joe_> but now fails with java.io.IOException: Too many open files [15:16:39] !log killing wdqs-updater on wdqs10(09|10) to diminish load on kafka [15:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:47] <_joe_> is this some setting of the damn jvm? [15:16:55] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1001 is OK: All endpoints are healthy [15:16:56] <_joe_> I changed the systemd unit [15:17:45] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [15:17:51] kafka poller for wdqs was enabled at 1:28pm CEST [15:18:06] 11:28am UTC [15:18:24] root@kafka1002:~# ls /proc/`pidof java`/fd |wc -l [15:18:24] 134 [15:18:25] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:18:27] what am I missing? [15:18:35] looks like trouble started around 14:40 https://grafana.wikimedia.org/dashboard/db/kafka?orgId=1&from=1531316577740&to=1531321828891&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-cluster=eventbus&var-kafka_broker=All [15:18:41] <_joe_> !log restarting kafka on kafka1002 [15:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:46] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [15:19:02] so all the retry-retry topic files are touched about 14:38 [15:19:08] gehel: those times doesn't match when the outage started (from what I can see) [15:19:16] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1002 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [15:19:21] ah sorry, that was mirrormaker I think [15:19:27] <_joe_> paravoid: yes [15:19:38] yeah, me neither, unless there is a build up in some way. Anyway, let's keep wdqs out of the equation atm [15:19:53] <_joe_> paravoid: 15041 is the current pid [15:19:59] yeah got it [15:20:07] _joe_ from lsof on kafka1001 I can see a ton of fds for the topics partition logs on disk of course [15:20:10] that happened befroe when job queue and changep-prop started subscribing to each other's topics [15:20:16] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1002 is OK: All endpoints are healthy [15:20:31] now I am wondering if all those topics with weird repetitions are causing this mess [15:20:33] Pchelolo: I see from the dashboard above that just before 14:40 partition leaders spiked [15:20:35] fds are /srv/kafka/data open files [15:20:46] 4k and counting [15:20:53] and basically tripled [15:20:58] <_joe_> yes [15:21:09] <_joe_> so something started adding topics to kafka [15:21:14] <_joe_> and that opens more files [15:21:21] yeah it would make sense [15:21:29] and now kafka tries to recover them, ending up in the same state [15:21:30] it doesn't look like it's one per topic [15:21:33] <_joe_> so let's first raise the limit on the number of files [15:21:36] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1002 is CRITICAL: 206 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002 [15:21:44] what was the limit that was reached before? [15:21:46] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational [15:21:48] _joe_ yep I agree, have you got any luck with systemd ? [15:21:51] <_joe_> paravoid: 65k [15:21:51] was it 1024 before or something else? [15:21:52] paravoid: 65K [15:21:55] lol [15:21:58] great :P [15:22:09] <_joe_> elukey: I did a mistake, LimitNOFILE= means reset it [15:22:12] <_joe_> not no limit [15:22:15] <_joe_> I guess [15:22:17] one thing that we could do is isolate the wrong topics and nuke them [15:22:18] (03CR) 10C. Scott Ananian: [C: 04-2] "Self C-2 until I double-check Anomie's concern." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [15:22:43] elukey: all the wrong topics have `change-prop.retry.change-prop.retry` in them [15:22:46] yeah [15:23:04] =infinity should make them unlimited [15:23:12] <_joe_> moritzm: yes [15:23:30] (03PS1) 10Mforns: Correct white-list path for EventLogging sanitization in Hive [puppet] - 10https://gerrit.wikimedia.org/r/445187 (https://phabricator.wikimedia.org/T193176) [15:23:49] 2 eqiad.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.cpjobqueue.retry.mediawiki.job.TranslationsUpdateJob-0 [15:23:55] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [15:23:56] (03PS1) 10Giuseppe Lavagetto: kafka-main: actually allow aribtrary number of open files [puppet] - 10https://gerrit.wikimedia.org/r/445188 [15:23:59] change-prop.retry.change-prop.retry.[...] [15:24:15] eqiad.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.partiti [15:24:15] <_joe_> paravoid: yes, it looks like changeprop and changeprop-jobqueue started messing with each other [15:24:19] oned.mediawiki.job.refreshLinks-0 [15:24:22] etc. [15:24:25] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational [15:24:27] that change-prop.retry is repeating infinitely [15:24:28] <_joe_> and started creating topics over topics [15:24:54] it doesn't look like that's normal [15:25:01] at least to my untrained eye :) [15:25:03] <_joe_> paravoid: no it's not [15:25:13] Pchelolo, mobrovac: ^^ [15:25:16] paravoid: yeah exactly, we think that those are the cause of the file limit breached [15:25:18] paravoid: _joe_: that did happen before but we did fix that. No clue why did they start doing that again [15:25:19] <_joe_> I think that's what elukey and Pchelolo were proposing to nuke [15:25:27] yep [15:25:30] exactly [15:25:57] (03CR) 10Giuseppe Lavagetto: [C: 032] kafka-main: actually allow aribtrary number of open files [puppet] - 10https://gerrit.wikimedia.org/r/445188 (owner: 10Giuseppe Lavagetto) [15:26:01] Pchelolo: are you looking into it? [15:26:12] (03CR) 10Daniel Kinzler: [C: 031] Add techadmin to privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421122 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [15:26:15] 2086 topics with change-prop.retry.change-prop.retry [15:26:17] I'm trying to find out why they started to get created [15:26:33] <_joe_> so good news is with the new limit kafka1002 seems to stay afloat [15:26:39] !log install dtach on labnet1003 [15:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:49] _joe_ \o/ [15:27:07] _joe_: it's 10k fds now, so if the limit was 65k before it wouldn't have hit that? [15:27:17] (03CR) 10C. Scott Ananian: "I did indeed confuse them! Although $wgTidyConf is (soft)deprecated, and we'll want to remove it eventually too. But not until ParserMig" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445070 (https://phabricator.wikimedia.org/T199293) (owner: 10Tim Starling) [15:27:33] <_joe_> paravoid: yeah let's see in a few minutes if that number keeps growing [15:27:35] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [15:27:52] paravoid,_joe: the default in puppet was 8k, maybe it was that one the value set instead of 65k? [15:27:53] I mean, if the fix was to increase the limit from 64k to infinite [15:28:15] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1002 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002 [15:28:25] oh that would make more sense [15:28:40] it's fairly stable at 10.3k now, I don't see how it could go over 64k right now? [15:28:45] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [15:28:51] unless it was wdqs :) [15:29:03] <_joe_> well we killed basically anything [15:29:10] <_joe_> Pchelolo: are you sure cp is killed? [15:29:19] lsof -p on the MainPid of kafka.service gives me close to 18k though? [15:29:19] <_joe_> I see connections from scb1004 [15:29:23] I see jobqueue processing going up [15:29:25] 17915 on kafka1003 [15:29:32] 17726 on kafka1002 [15:29:34] _joe_: both cp and cpjobqueue are killed in codfw [15:30:06] <_joe_> Pchelolo: in eqiad? [15:30:15] <_joe_> cpjobqueue will work in eqiad [15:30:16] "lsof -p $PID | wc -l" I meant [15:30:23] yeye in eqiad [15:30:51] almost to normal levels 300 jobs/s processed [15:31:20] ah, puppet restarted it [15:31:21] _joe_ is the new file limit applied to other kafka hosts or only 1002? [15:31:46] and backlog going down [15:31:51] <_joe_> Pchelolo: puppet restarted it maybe? [15:31:55] <_joe_> elukey: just on 1002 [15:31:58] _joe_: ye.. [15:32:06] <_joe_> it's applied via puppet on 1001 but not restarted the service [15:32:12] I see no new insertions [15:32:15] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [15:32:16] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [15:33:04] which both things seem intended and or desirable [15:34:26] <_joe_> Pchelolo: if i am not mistaken, mediawiki writes jobs to kafka via eventbus, right? [15:34:34] right _joe_ [15:34:59] <_joe_> so we need to restart eventbus if it was stopped, I lost track of it [15:35:17] <_joe_> jynus: which dashboard are you looking at? [15:35:25] _joe_: https://grafana.wikimedia.org/dashboard/db/jobqueue-eventbus [15:35:43] processing seems kinda more normal now [15:35:48] but no new insertions [15:36:09] (with asterisks- backlog, etc.) [15:36:25] so from kafka topics --describe I can see in sync replicas recovering [15:36:32] but https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-cluster=eventbus&var-kafka_broker=All&from=now-1h&to=now seems not updated? [15:36:38] I know it says insert rate 54 [15:36:42] godog: if you have time can we check metrics for kafka --^ ? [15:36:48] but the job insertion rate is empty [15:36:57] elukey: sure [15:37:18] let's restart eventlogging-service-eventbus ? it might have got confused with all this [15:37:55] last msg on kafka1001 is Jul 11 15:37:34 kafka1001 eventlogging-service-eventbus[1639]: (MainThread) Client closed connection before sending events finished. [15:38:04] I would expect 1K-3K jobs inserted, at least that used to be the rate some time ago [15:38:11] (per second) [15:38:47] Pchelolo: are you sure the migration of the rerender blacklist worked? Can you see if there are any events for edits that should have been filtered out by the blacklist? User:Cyberbot_I on en is an easy one to check for, very active. [15:38:49] <_joe_> Pchelolo: ok I am restarting it [15:39:06] _joe_ ack I was about to ask [15:39:06] RECOVERY - Varnishkafka Statsv Delivery Errors per second on einsteinium is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=statsv&var-host=All [15:39:12] this is good --^ [15:39:28] elukey: I think prometheus is timing out on fetch metrics from kafka with all those topics present, double checking [15:39:32] ori: no it didn't quite, and we didn't completely migrate it, for now we just duplicated it in changeprop [15:39:38] godog: /o\ [15:40:05] RECOVERY - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1 [15:40:15] <_joe_> !log restarting eventbus on kafka-main in eqiad [15:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:38] probably unrelated, but it looks like fr-tech mailbox is overflowing [15:41:21] <_joe_> gehel: yes, unrelated [15:41:38] <_joe_> elukey: any idea what is the status of the kafka cluster? [15:41:52] (03CR) 10Muehlenhoff: Remove pear packages from MW Application Servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434710 (https://phabricator.wikimedia.org/T195364) (owner: 10Reedy) [15:42:36] <_joe_> in the eventbus logs I keep seeing [15:42:37] <_joe_> Jul 11 15:42:07 kafka1002 eventlogging-service-eventbus[18649]: (MainThread) Client closed connection before sending events finished. [15:43:05] _joe_ metrics are not available due to all those topics and prometheus timing out, but from kafka topics --describe I see that things should be relatively ok [15:43:07] (03PS2) 10Filippo Giunchedi: phabricator: bump request rate_limits [puppet] - 10https://gerrit.wikimedia.org/r/445145 [15:43:09] (03PS1) 10Filippo Giunchedi: prometheus: bump scrape_timeout for kafka jobs [puppet] - 10https://gerrit.wikimedia.org/r/445191 [15:43:11] I did some commons edits that I think are handled by the jobque (category edits), but I got the expected result [15:43:32] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: bump scrape_timeout for kafka jobs [puppet] - 10https://gerrit.wikimedia.org/r/445191 (owner: 10Filippo Giunchedi) [15:43:39] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/445149 (https://phabricator.wikimedia.org/T192092) (owner: 10Muehlenhoff) [15:43:44] elukey: merging as soon as jenkins says yes [15:43:49] unless anyone has an objection, I'm going to move wdqs10(09|10) to RC poller instead of kafka (revert of https://gerrit.wikimedia.org/r/c/operations/puppet/+/444265). Atm the updater is just masked on those nodes [15:44:21] example [15:44:29] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: eqiad1: fix typo in dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/445192 (https://phabricator.wikimedia.org/T196633) [15:44:31] (nope too many tabs) [15:44:34] godog: ack, thanks! [15:44:48] <_joe_> I keep seeing the errors on eventbus [15:44:49] gehel: makes sense thanks [15:45:27] <_joe_> Pchelolo: can you check cp and cpjobqueue? [15:45:37] <_joe_> jynus: mediawiki errors in logstash? [15:45:45] let me check [15:45:56] _joe_: yup looking at them. I think I'll restart them in eqiad just to be sure [15:46:11] <_joe_> Pchelolo: try with one, see if the logs are better, go with the rest [15:46:18] <_joe_> and log your actions [15:46:28] <_joe_> so kafka1002 and 1003 seem ok [15:46:34] btw it seems like kafka prometheus metrics do not work [15:46:36] I was going to say nothing relevant, but "{exception_id}] {exception_url} JobQueueError from line 828 of /srv/mediawiki/php-1.32.0-wmf.12/includes/jobqueue/JobQueueDB.php: Wikimedia\Rdbms\DBQueryError: A database query error has occurred. Did you forget to run your application's database schem" [15:46:36] <_joe_> 1001 seems to still be doing rebalancings [15:46:42] godog: btw it seems like kafka prometheus metrics do not work [15:46:55] Pchelolo: should be recovering soon, were timing out [15:46:56] there's a lot of events flowing in the topics, but metrics report 0 activity [15:47:04] (03PS1) 10Gehel: Revert "Enable kafka poller on test hosts" [puppet] - 10https://gerrit.wikimedia.org/r/445193 [15:47:05] .12 was deployed a few hours ago, but could be a false positibe [15:47:15] also are the spurious topics already nuked? [15:47:19] _joe_ the topic partition leaders are auto rebalancing when the brokers are healthy, but if consumers are restarted the consumer groups will need to be configured etc.. [15:47:32] godog: not yet, I was waiting, but I have a list ready [15:47:33] yeah, not related, that is labtestweb2001 only [15:47:37] not production [15:47:43] <_joe_> elukey: the consumers being eventbus, right? [15:47:58] (03CR) 10Gehel: [C: 032] Revert "Enable kafka poller on test hosts" [puppet] - 10https://gerrit.wikimedia.org/r/445193 (owner: 10Gehel) [15:48:23] _joe_ eventbus is a producer, consumers are like cp or others (not sure if they are still enabled or not, was trying to explain what "rebalancing" might mean in the logs" [15:48:26] PROBLEM - Check systemd state on labnet1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:48:35] _joe_: noting relevant on kibana, but I will keep looking [15:48:36] I will actually wait with restarting cp and cpjq [15:49:14] <_joe_> yeah, things seem in a better state now, right? [15:49:18] 10Operations, 10fundraising-tech-ops, 10netops: New PFW policy for Amazon - https://phabricator.wikimedia.org/T199341 (10cwdent) [15:49:36] RECOVERY - Check systemd state on labnet1003 is OK: OK - running: The system is fully operational [15:50:05] also not sure if edits could be affected by it, but it seem a bit lower than usual before and higher than usal now [15:50:14] according to metrics based on grafite cp and cpjq work correctly. metrics based on prometheus don't work [15:50:22] https://grafana.wikimedia.org/dashboard/db/edit-count?refresh=5m&orgId=1&from=now-3h&to=now [15:50:30] Pchelolo: if you have time, can you sanity check /home/elukey/topics_to_delete on kafka1001? (not taking any action but want to be ready if we want to delete those) [15:50:43] elukey: doing [15:53:01] elukey: looks good [15:53:14] still looking into why prometheus isn't ingesting kafka stats, but please go ahead and remove topics as it might help too [15:53:34] all right, I think that a simple kafka topics --delete --topic $something should be enought Pchelolo (in loop reading from the file) [15:53:54] elukey: these topics were not the cause of the outage, they were created looong time ago, I've asked ottomata to delete them [15:53:55] with some delay between each delete to avoid overwhelming kafka [15:54:08] apparently deleting topics in kafka doesn't really delete the files.. [15:54:47] Pchelolo: so what caused the outage? Sorry but I was convinced that those topics were the problem [15:54:59] I don't know... [15:55:00] I remember some weird topics with repeated naming but not that many [15:55:31] 10Operations, 10ops-eqiad, 10Cloud-VPS: rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) [15:55:45] <_joe_> godog: any idea why prometheus is broken for kafka? [15:56:38] it times out polling from the agent on kafka100[1-3] IIUC [15:56:48] elukey: no idea, looking into the logs.. [15:57:01] _joe_: initially I thought it was a timeout like elukey mentioned, still looking as a longer timeout didn't do it [15:57:29] <_joe_> godog: how do we fetch the metrics from kafka? [15:57:48] <_joe_> I mean do we have an exporter on the servers or it's exposed by the software itself? [15:58:36] <_joe_> elukey: mirrormaker is dead in codfw [15:58:56] _joe_: via jmx_exporter [15:59:28] (03PS2) 10Marostegui: filtered_tables: Remove ar_text and ar_flags [puppet] - 10https://gerrit.wikimedia.org/r/437432 (https://phabricator.wikimedia.org/T192926) [15:59:44] _joe_ we can restart it, it is not a bit deal for the moment [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180711T1600). [16:00:04] GBirke_WMDE: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:16] I'll bump the timeout more [16:00:39] <_joe_> elukey: it's maybe a symptom of things not being in good shape? [16:00:51] <_joe_> also those messages about record too large [16:01:38] <_joe_> Pchelolo: all jobs seem ok but refreshlinks [16:01:48] <_joe_> they have practically no processing going on [16:02:11] _joe_ we had a similar error a while ago IIRC with the previous version of mirror maker, but then we blacklisted some high volume topics. The new version didn't show any sign of errors [16:03:18] <_joe_> elukey: the error is on kafka [16:04:15] the one that you posted in the other cahn? [16:04:20] *chan ? [16:04:20] <_joe_> yes [16:04:23] _joe_: makes sense - refreshLinks are recursive, so when we lost all the recursion - nothing to do.. [16:04:29] (03PS3) 10Filippo Giunchedi: phabricator: bump request rate_limits [puppet] - 10https://gerrit.wikimedia.org/r/445145 [16:04:31] (03PS1) 10Filippo Giunchedi: prometheus: timeout for jmx_kafka job at 45s [puppet] - 10https://gerrit.wikimedia.org/r/445198 [16:04:56] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: timeout for jmx_kafka job at 45s [puppet] - 10https://gerrit.wikimedia.org/r/445198 (owner: 10Filippo Giunchedi) [16:05:15] Pchelolo: at the time of the drop in traffic though I can see a lot of [16:05:18] [2018-07-11 14:38:34,400] INFO Created log for partition eqiad.change-prop.retry.change-prop.retry.mediawiki.job.LocalPageMoveJob-0 [16:05:38] plus if you look at partition count they spiked up [16:05:46] ye found that as well. so change-prop did start creating the freaking retry-retry topics [16:06:39] <_joe_> godog: uhm I just tried locally on kafka1002 and indeed I get no response from the jmx exporter [16:06:48] <_joe_> oh I got one now [16:06:53] https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&panelId=48&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-cluster=eventbus&var-kafka_broker=All&from=now-6h&to=now [16:07:05] Pchelolo: --^ just added [16:07:19] timing matches, 14:38 [16:07:38] so those topics seems to be the cause afaics [16:07:39] no? [16:07:47] elukey: yup, seems so [16:08:07] now I have to understand why did it start creating them [16:08:32] <_joe_> and yes, the reason why jmx takes 18 seconds to respond [16:08:40] _joe_: sorry to ask, but are we now semi-ok, but monitoring has issues, is what you mean? (I am a bit lost) [16:08:41] <_joe_> is that we have a ton of metrics like [16:08:42] <_joe_> kafka_cluster_Partition_LastStableOffsetLag{partition="0",topic="eqiad.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.cpjobqueue.retry.mediawiki.job.cirrusSearchJobChecker",} 0.0 [16:08:51] <_joe_> jynus: I think so, yes [16:08:58] _joe_ shall we drop some garbage topics? [16:09:03] <_joe_> Pchelolo, elukey we should really remove those topics, yes [16:09:11] <_joe_> AIUI [16:09:17] <_joe_> but please make a call yourself [16:09:47] agreed let's delete the topics why I'm figuring out what happened [16:09:54] just tried with kafka topics --delete --topic eqiad.change-prop.retry.change-prop.retry.ChangeNotification [16:09:56] <_joe_> also we can see if it happens again [16:10:01] and it worked fine [16:10:11] in sence why did they begin getting created [16:10:25] <_joe_> Pchelolo: yeah that needs to be understood of course [16:10:39] <_joe_> it would be good if the two systems could not interact with each other, too [16:11:35] <_joe_> godog: let's wait that elukey has cleaned up the topics, else just raise the timeout to 20 seconds for scraping [16:12:10] _joe_: it is already at 45s, 25s didn't work so well [16:12:22] <_joe_> uhm [16:12:27] meaning 45s is crazy high but seems to work for now [16:13:12] so, apparently, when jobque has issues, databases have lots of open idle connections- and we are going back to that state again [16:13:27] <_joe_> jynus: what do you mean? [16:13:40] <_joe_> we are now in that state, so you think there is some issue? [16:13:51] there was right now a spike of "jobque badness" [16:14:11] see: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=9&fullscreen&orgId=1&from=1531304042278&to=1531325642278&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All [16:14:30] things were bad from 14:40 to 15:35 [16:14:44] and we seem to get bad again [16:14:55] <_joe_> elukey: ^^ [16:15:00] <_joe_> whatever you're doing [16:15:17] (note dbs have nothing to do, I am just looking at an independent factor) [16:15:19] I am not doing anything right now [16:15:23] <_joe_> ok [16:15:39] <_joe_> so topics are still rebalancing, but from the data I see [16:15:52] <_joe_> I expect everything to be ok-ish in operational terms [16:16:05] from 15:35 to 16:10 things seemed much better [16:16:06] <_joe_> I'd ask everyone to look at the logs ffo systems they onok [16:16:52] (could be just backlog recovering, so in that case we have have to wait [16:17:10] <_joe_> elukey: also, 1003 has never been restarted, should we? [16:17:49] hah, 1003 is also almost always timing out on 45s for metrics [16:17:50] I was checking it now, it shows some weird errors [16:17:53] I'll restart cpjobqueue to see if that helps [16:18:01] _joe_ yes please let's restart it [16:18:16] things are much clear here: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=1&fullscreen&orgId=1&from=1531304283386&to=1531325883386&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=master [16:18:26] writes from the job queue are very low right now [16:18:31] <_joe_> elukey: can you do it? I am now on my mobile connection [16:18:38] sure [16:18:44] <_joe_> jynus: that's refreshlinks [16:18:47] <_joe_> I guess [16:18:48] ok, then [16:18:54] if that is accounted for [16:19:06] <_joe_> as rflinks is way down as rate [16:19:08] I just wanted to ping for anything strange I was seeing [16:19:20] <_joe_> not sure it's completely explained [16:19:24] !log restart cpjobqueue [16:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:35] !log restart kafka on kafka1003 [16:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:53] ouch elukey I'll wait for your restart to finish [16:19:55] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:20:00] <_joe_> godog: kafka1003 still times out [16:20:18] restbase-dev is known [16:20:19] <_joe_> it takes 39 seconds to respond, maybe now that it was restarted it could go a bit better [16:20:28] yeah I think it is faster now [16:21:55] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1003 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003 [16:21:55] PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1 [16:22:08] I am going to deploy a db change before I block someone else on the queue [16:22:26] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1003 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [16:22:26] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1001 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [16:22:26] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1002 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [16:22:37] errrr.... [16:22:49] <_joe_> wat [16:22:51] kafka1003 restart? [16:22:55] <_joe_> I guess [16:23:02] it is OOMing sadly [16:23:07] <_joe_> I dunno, I'm without an internet connection right now [16:23:07] for all those topics [16:23:25] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1086 with low load (duration: 00m 58s) [16:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:17] [2018-07-11 16:24:51,434] ERROR [GroupMetadataManager brokerId=1001] Appending metadata message for group test-change-prop-mobile_rerender generation 459 failed due to org.apache.kafka.common.errors.RecordTooLargeException, returning UNKNOWN error code to the client (kafka.coordinator.group.GroupMetadataManager) [16:25:56] <_joe_> elukey: is kafka1003 ok now? [16:25:58] 10Operations, 10fundraising-tech-ops, 10netops: New PFW policy for Amazon - https://phabricator.wikimedia.org/T199341 (10cwdent) @ayounsi the last one I posted was incomplete, I found the problem and 1531326142 should fix it [16:26:03] I've stopped that thing :( [16:26:13] _joe_ from the logs it keeps going, seems better now [16:26:28] <_joe_> eventbus is in a bad state [16:26:35] I would really remove those topics if possible [16:27:10] elukey: ye please do when kafka1003 comes back [16:28:45] Pchelolo: 1003 went in shutdown (i think) because of OOM, we have only 1G heap space in there [16:28:49] that has been fine up to now [16:29:10] ah no -Xmx2G -Xms2G [16:29:34] even if the graph shows 1g max [16:29:58] ah no that was mirror maker (the 2g settings) [16:29:59] kafka is -Xmx1G -Xms1G [16:30:48] <_joe_> elukey: I think you really need to delete those topics, maybe do it slowly [16:31:07] <_joe_> kafka1003 cannot respond to prometheus right now, probably too many topics there [16:31:09] it may be ok with two brokers up [16:31:21] <_joe_> so 1003 is not up? [16:31:35] it went into shutdown after oom [16:31:39] I can retry to start it [16:32:00] <_joe_> is it in shutdown? [16:32:03] <_joe_> it's logging [16:32:30] <_joe_> [2018-07-11 16:32:26,099] INFO Updated PartitionLeaderEpoch. New: {epoch:144, offset:13178429}, Current: {epoch:132, offset:13178427} for Partition: __consumer_offsets-2. Cache now contains 4 entries. (kafka.server.epoch.LeaderEpochFileCache) [16:32:45] that is puppet surely that has restarted it [16:32:49] <_joe_> but it's timing out on connections to the other nodes [16:33:17] <_joe_> java.io.IOException: Connection to 1001 was disconnected before the response was read [16:33:23] <_joe_> things like that [16:33:33] lemme try to delete some topics and see what happens [16:33:43] at this point there are two brokers somehow healthy [16:33:47] I'll proceed slowly [16:33:57] like 1 every 5s or similar [16:34:10] ok for everybody? [16:34:14] yes [16:34:30] <_joe_> I see eventbus is in complete failure now [16:35:15] <_joe_> elukey: you're deleting partitions I see [16:35:50] yeah [16:36:14] quite weirdly it's not reflected in /srv/kafka/data/ directories number [16:36:17] !log start topic clean procedure on kafka1001 (tmux root session) [16:36:18] <_joe_> I see a ton of errors in the kafka logs [16:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:35] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1002 is CRITICAL: 561 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002 [16:36:41] _joe_: any specific broker? [16:37:18] <_joe_> on all brokers [16:37:24] <_joe_> java.io.IOException: Connection to 1001 was disconnected before the response was read [16:37:27] <_joe_> on 1002 [16:37:46] <_joe_> [ReplicaFetcher replicaId=1001, leaderId=1002, fetcherId=2] Error in response for fetch request on 1001 [16:38:25] <_joe_> so, Pchelolo can I stop for real changeprop and cpjobqueue? [16:38:29] I see some Eventbus: "Unable to deliver all events Timeout was reached" starting at 14:20 [16:38:34] on mediawiki logs [16:38:37] _joe_: ye sure [16:38:46] <_joe_> jynus: that's exactly the effect I expected there [16:38:56] both versions of the train [16:39:39] not a lot of them, 6K in the last 2 hours [16:40:26] also lower at the time I mention things seemed much better [16:40:46] <_joe_> !log masking and stopping cpjobqueue, changeprop everywhere [16:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:25] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1003 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [16:42:46] 66 topics deleted up to now [16:42:50] <_joe_> elukey: it looks like the kafka cluster is just not able to work [16:42:55] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:42:55] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:42:56] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:42:57] <_joe_> elukey: out of how many? [16:43:03] 2086 :( [16:43:05] <_joe_> scb is expected, it's me [16:43:15] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:43:15] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:43:36] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:43:46] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:43:48] <_joe_> it looks better though [16:43:55] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:43:55] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:43:55] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:45:11] <_joe_> elukey: so it seems no node can keep up with each other [16:45:17] prometheus timing out for all kafkas now btw [16:45:29] <_joe_> godog: yeah that would be expected in this situation [16:45:37] <_joe_> we shouldn't have restarted 1003 [16:46:01] <_joe_> java.io.IOException: Connection to kafka1002.eqiad.wmnet:9093 (id: 1002 rack: null) failed. [16:46:06] <_joe_> on 1003 [16:46:16] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational [16:46:25] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [16:46:56] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1003 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [16:47:25] I hope this is purgatory before the recovery [16:47:42] <_joe_> we are losing events to the jobqueue [16:47:46] <_joe_> this is a serious issue [16:48:09] <_joe_> I would like any suggestion how to get ourselves out of this that is not "let's hope this recovers". [16:48:16] <_joe_> can we point the jobqueue to another cluster? [16:48:20] reset the jobque [16:48:26] failover to codfw [16:48:34] _joe_: all the events are being stored to a file btw [16:48:46] so my current proposal is to stop the topics removal, and then try to restart kafka, and see if the cluster is more stable [16:48:56] <_joe_> Pchelolo: so the disk will be full on the kafka nodes soon? [16:49:06] the other option is to increase a bit the heap size, say from 1G to 2G, to allow more room for the bootstrap work [16:49:09] _joe_: they are stored to files by kafka as well [16:49:26] so the rate of disk usage growing is the same [16:49:56] PROBLEM - Varnishkafka Statsv Delivery Errors per second on einsteinium is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=statsv&var-host=All [16:50:16] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1003 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [16:50:28] increasing the heap sounds reasonable to me elukey what do you thing _joe_? [16:50:29] <_joe_> elukey: your call [16:50:31] !log stop topics cleaner script [16:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:41] <_joe_> I think it makes sense, maybe [16:50:58] <_joe_> but I'd like to think if failing over to codfw to write to the queue is ok in the meantime [16:51:07] <_joe_> and use kafka there [16:51:30] if mirror maker acted as I think, all the topics have been mirrored [16:51:32] <_joe_> so do whatever you want elukey but I'd like to make a call on that too [16:51:38] addshore did the backport deploy happen or did I miss smth? [16:51:38] (03PS3) 10Andrew Bogott: realm: add regexes for detecting the new Neutron cloud VM IP ranges. [puppet] - 10https://gerrit.wikimedia.org/r/445045 [16:51:40] (03PS1) 10Andrew Bogott: eqiad1 scheduler: use simple hostname in scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/445202 [16:51:45] PROBLEM - Check systemd state on kafka1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:51:51] PROBLEM - Kafka Broker Server on kafka1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [16:52:02] <_joe_> elukey: yes but we need to point mediawiki in eqiad to kafka in codfw [16:52:13] <_joe_> so that events are produced towards that [16:52:16] codfw looks good! [16:52:36] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1002 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [16:52:58] ok I am going to try to bring up kafka again on kafka1001 [16:53:05] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:53:06] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:53:12] (03CR) 10Andrew Bogott: [C: 032] eqiad1 scheduler: use simple hostname in scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/445202 (owner: 10Andrew Bogott) [16:53:16] elukey: with a larger heap? [16:53:39] <_joe_> Pchelolo: what do you think? writing to codfw would work as a stopgap? [16:53:52] funnilly, I am seeing better state right now [16:54:05] RECOVERY - Check systemd state on kafka1001 is OK: OK - running: The system is fully operational [16:54:07] _joe_: if everything that we designed works correctly - it should work. It will not make it worse anyway [16:54:11] RECOVERY - Kafka Broker Server on kafka1001 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [16:54:34] <_joe_> Ok, would you prepare the mediawiki-config change? [16:54:40] Pchelolo: for the moment no [16:54:48] _joe_: kk [16:55:35] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventbus_8085: Servers kafka1003.eqiad.wmnet are marked down but pooled [16:55:45] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventbus_8085: Servers kafka1003.eqiad.wmnet are marked down but pooled [16:55:59] <_joe_> sigh [16:56:25] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kafka1003.eqiad.wmnet]) [16:56:56] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kafka1003.eqiad.wmnet]) [16:56:58] _joe_: wouldn't it be easier to change dns for http://eventbus.discovery.wmnet? [16:57:16] <_joe_> Pchelolo: oh we're using that? [16:57:28] <_joe_> ok, how can we check if the messages flow through codfw? [16:57:37] _joe_: yes. mediwiki uses http://eventbus.discovery.wmnet:8085 [16:57:57] <_joe_> does eventbus log things anywhere? [16:58:05] kafkacat or https://grafana.wikimedia.org/dashboard/db/eventbus?refresh=1m&panelId=28&fullscreen&orgId=1 [16:58:25] _joe_: ye, locally on the node and to logstash under `EventBus` [16:58:55] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [16:59:45] (03CR) 10Imarlier: [C: 031] webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [16:59:51] akosiaris: Did we restart apertium-apy? [17:00:07] <_joe_> Pchelolo: ok I'll depool eqiad [17:00:15] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on einsteinium is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [17:00:19] <_joe_> everyone is on board with that? [17:00:27] +1 [17:00:32] _joe_: let's try, we won't make things worse anyway [17:00:51] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=eventbus,name=eqiad [17:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:16] <_joe_> it will take a few minutes for this to propagate [17:02:10] GBirke_WMDE: did you schedule it for swat? [17:02:16] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventbus_8085: Servers kafka1001.eqiad.wmnet, kafka1002.eqiad.wmnet are marked down but pooled [17:02:17] addshore yes [17:02:31] jouncebot: now [17:02:32] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [17:02:37] jouncebot: next [17:02:37] In 1 hour(s) and 57 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180711T1900) [17:02:41] PROBLEM - LVS HTTP IPv4 on eventbus.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:02:56] <_joe_> heh we need help, people [17:03:03] <_joe_> I can control so many things at once [17:03:11] <_joe_> can someone check what's up with eventbus in eqiad? [17:03:13] note sure what eventbus.svc.eqiad.wmnet means now [17:03:17] (03PS2) 10Arturo Borrero Gonzalez: hieradata: openstack: eqiad1: fix typo in dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/445192 (https://phabricator.wikimedia.org/T196633) [17:03:22] <_joe_> i guess kafka is down so eventbus is down [17:03:27] _joe_: jobs are coming to codfw now [17:03:51] _joe_ kafka shouldn't be down, but I am going to check eventbus [17:03:51] <_joe_> elukey: kafka oom on 1001 [17:03:59] (03CR) 10Arturo Borrero Gonzalez: [C: 032] hieradata: openstack: eqiad1: fix typo in dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/445192 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [17:04:12] <_joe_> Pchelolo: ok so, I'll try starting cpjobqueue on one node in codfw [17:04:12] ok it needs a heap bump [17:04:13] (03CR) 10Imarlier: [C: 031] webperf: Rename webperf profiles for clarity [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [17:04:14] I am trying it [17:04:24] GBirke_WMDE: that swat window was 3 hours ago! Were you here for it? [17:05:06] <_joe_> !log restarting cpjobqueue on scb2001 [17:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:26] (03CR) 10Imarlier: [C: 031] webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [17:05:56] addshore Damn timezones, I was only here 2 hours ago. [17:06:16] RECOVERY - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=1&fullscreen&orgId=1 [17:06:18] this seems happy: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=9&fullscreen&orgId=1 [17:06:19] <_joe_> jynus: are the Eventbus exceptions in mediawiki going down? [17:06:19] (03CR) 10Imarlier: [C: 031] mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762 (owner: 10Krinkle) [17:06:30] but we will see if trafic come back [17:06:31] looking [17:06:43] !log restarted kafka on kafka1001 with Xmx 2G and Xms 2F [17:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:48] yeah 2F [17:06:49] ufff [17:06:50] _joe_: last error 17:00 [17:07:22] <_joe_> Pchelolo: I think it's working well on scb2001, can you check? [17:07:25] 10Operations, 10fundraising-tech-ops, 10netops: New PFW policy for Amazon - https://phabricator.wikimedia.org/T199341 (10ayounsi) Pushed. [17:07:35] I am wondering how to diferenciate between no activity and good actibity now :-/ [17:07:45] yup _joe_ seems so, looking more closely [17:08:01] oh, mysql traffic seems going up, which is a good synthom [17:08:09] back to pre-issue levels [17:08:40] <_joe_> ok, this is what we should've done immediately [17:08:41] <_joe_> :P [17:08:48] <_joe_> move eventbus to codfw [17:08:51] (and yes, I know that has nothing to do, but as many of you know, I use it as an independent measure of normalness [17:08:59] honestly, I didn't know it was so easy [17:09:10] so kafka1001 with 2G seems much better [17:09:28] <_joe_> Pchelolo: ok to start cpjobqueue on the other nodes in codfw? I'd say yes [17:09:44] is there no downside, like penalty in latency or something? [17:09:53] _joe_: it can survive on a single node and starting on more nodes will make it rebalance [17:10:01] !log restart kafka on kafka1002 with 2G heap settings [17:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:09] <_joe_> Pchelolo: so yes or no? :P [17:10:32] <_joe_> jynus: well it takes longer to mediawiki to send messages [17:10:32] (03PS1) 10Arturo Borrero Gonzalez: hieradata: eqiad1: fix public IP address [puppet] - 10https://gerrit.wikimedia.org/r/445204 (https://phabricator.wikimedia.org/T196633) [17:10:44] _joe_: I'd say let's just wait for kafka eqiad to get ok.? [17:10:57] https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&panelId=43&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-cluster=eventbus&var-kafka_broker=All&from=now-1h&to=now [17:10:57] <_joe_> and it takes longer to send them back over the wire for the jobrunners to consume [17:11:02] <_joe_> but nothing user-visible [17:11:07] heap looks good [17:11:16] now 1001 is using more than 1GB [17:11:20] I was more like wondering if too slow to keep up, etc. [17:11:21] (03CR) 10Imarlier: [C: 031] "Maybe want to rename the hieradata file at hieradata/role/common/webperf/processors_and_site.yaml?" [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [17:11:49] <_joe_> elukey: but both nodes are down and not responding to requests [17:12:20] <_joe_> Pchelolo: the graphs for codfw look weird though [17:12:21] (03CR) 10Imarlier: "Noted on an earlier commit, the production setting is in hieradata/role/common/webperf/processors_and_site.yaml, which I think may be misn" [puppet] - 10https://gerrit.wikimedia.org/r/443764 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [17:12:27] <_joe_> in jobqueue eventbus [17:12:36] !log restart kafka on kafka1003 with 2G heap settings [17:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:21] _joe_: ye.. that't the problem that kafka dashboards are mixed between grafite and prometheus so templating doesn't quite work [17:13:25] _joe_ gimme 5 mins [17:13:28] <_joe_> Pchelolo: specifically https://grafana.wikimedia.org/dashboard/db/jobqueue-eventbus?panelId=1&fullscreen&orgId=1&from=now-30m&to=now&var-site=codfw&var-type=All [17:13:30] it seems promising [17:13:35] <_joe_> elukey: if you say so [17:13:46] _joe_: yup ,that's a prometheus graph.. [17:13:56] 10Operations, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10MoritzMuehlenhoff) p:05Triage>03Normal [17:13:58] _joe_: thanks for the trust! :D [17:14:06] on a mostly grafite dashboard [17:14:10] <_joe_> elukey: I trust you, not what I'm seeing :D [17:14:40] <_joe_> ok so next time something happens to the eqiad cluster [17:14:50] <_joe_> remember first thing to do is to failover eventbus to codfw [17:15:05] yeah definitely [17:15:06] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1003 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [17:15:06] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1002 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [17:15:08] <_joe_> we will add this to the playbook I'm going to write tomorrow [17:15:24] I was thinking with classic jobqueue mentality [17:15:35] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [17:15:36] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1002 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002 [17:15:38] classic == redis [17:15:45] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [17:15:50] <_joe_> jynus: me too [17:15:51] RECOVERY - LVS HTTP IPv4 on eventbus.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1561 bytes in 0.002 second response time [17:15:57] <_joe_> I'm kicking myself for that [17:16:04] in which that was not very practical and had clear issues [17:16:26] although lets not celebrate early- was this ever tested before in production? [17:16:29] <_joe_> sigh, all I needed to do was a confctl line [17:16:38] <_joe_> jynus: no this is a first [17:16:46] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal [17:16:49] <_joe_> but it's working by any measures I have [17:16:53] so, let's not bee too confident :-) [17:17:23] almost, OOMs registered for kafka again [17:17:25] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [17:17:43] not even sure how [17:17:48] since it seems not using all that space [17:17:49] <_joe_> elukey: I'm talking about the codfw cluster now [17:18:09] I am looking at the traffic and we are only getting 7/9ths of the regular traffic [17:18:09] <_joe_> elukey: which node went down? [17:18:35] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1001 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [17:18:37] <_joe_> jynus: that should change if we start more cpjobqueue instances [17:18:44] ok, so expected [17:18:50] cool [17:19:12] _joe_ sorry I was talking about eqiad, it seems oom registered everywhere [17:19:15] ok, so part of our switchover goal completed ;p [17:19:23] it reaches a moment in which it has too many things to handle [17:19:26] mark: troll :-) [17:19:51] 10Operations, 10SRE-Access-Requests: +2 for Addshore on operations/puppet - https://phabricator.wikimedia.org/T199325 (10MoritzMuehlenhoff) @Jonas +2 permissions on ops/puppet would be equivalent with root access across the complete production cluster (as it would e.g. allow to merge arbitrary SSH keys etc). I... [17:26:12] so how's kafka elukey? [17:26:49] I am still not sure, 1003/1002 seems a bit better, 1001 seems down [17:26:57] (03PS1) 10Mark Bergsma: [WiP] Test Server invariants [debs/pybal] - 10https://gerrit.wikimedia.org/r/445207 [17:26:59] but they are using a lot more heap now [17:27:17] Pchelolo: https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&panelId=43&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-cluster=eventbus&var-kafka_broker=All&from=now-1h&to=now [17:27:57] <_joe_> Pchelolo: I see you're fixing grafana <3 [17:27:57] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Test Server invariants [debs/pybal] - 10https://gerrit.wikimedia.org/r/445207 (owner: 10Mark Bergsma) [17:28:16] _joe_: I've just duplicated a panel.. [17:29:08] at least we know that switchover works as we expected... [17:29:17] <_joe_> yeah [17:30:55] so kafka on 1001 seems alive but it's constantly doing something.. [17:31:06] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1002 is CRITICAL: 787 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002 [17:31:13] rebuilding indexes etc [17:31:17] yeah I've restarted it, it went oom for a bit [17:31:28] !log restart kafka on kafka1001 (oom registered) [17:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:00] <_joe_> Pchelolo: should I start a second instance of cpjobqueue in codfw? [17:32:10] (03CR) 10Rush: [C: 04-1] "The regex seems off to me but for clarity here is the eqiad and codfw ranges:" [puppet] - 10https://gerrit.wikimedia.org/r/445045 (owner: 10Andrew Bogott) [17:32:17] (03CR) 1020after4: [C: 031] phabricator: bump request rate_limits [puppet] - 10https://gerrit.wikimedia.org/r/445145 (owner: 10Filippo Giunchedi) [17:32:32] _joe_: It's better to start them all simultaniously, including 2001. then there's way less rebalances happening [17:32:47] <_joe_> so I should restart them all? [17:32:54] <_joe_> ok :) [17:32:55] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1003 is CRITICAL: 278 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003 [17:32:56] (03CR) 1020after4: [C: 031] "see also https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/444810/" [puppet] - 10https://gerrit.wikimedia.org/r/445145 (owner: 10Filippo Giunchedi) [17:32:59] _joe_: ye, that would be the best [17:33:48] <_joe_> ok, doing that [17:36:03] <_joe_> !log restarted cpjobqueue in codfw [17:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:27] addshore: wait, what?!? wasn't the swat deployment between 16 and 17 utc? [17:38:25] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1003 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003 [17:38:26] that would have been 6pm CEST [17:38:36] <_joe_> Pchelolo: should we start changeprop as well? [17:39:30] _joe_: hm... I believe we could yes [17:39:36] I am going to fully repool db1086 [17:39:57] <_joe_> Pchelolo: or should we wait for eqiad to be ok first? [17:40:11] _joe_: change-prop is less important then job queue [17:40:19] <_joe_> yeah [17:40:19] so let's wait.. [17:40:29] kafka seems to be getting somewhere.. [17:41:09] I want to have full weight in case at some point we start to get a flood of refreshlinks :-D [17:41:19] <_joe_> jynus: yeah [17:41:33] addshore: oh, figured it out. never mind. [17:43:24] (03PS1) 10Jcrespo: mariadb: Repool db1086 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445208 [17:44:45] 10Operations, 10ops-codfw, 10Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339 (10Papaul) @robh @faidon if this is no longer an issue can we resolve it? it has been open for more then a year now. Thanks. [17:45:13] no mediawiki error of type eventbus since 17:00 [17:45:16] (03PS1) 10Rush: openstack: eqiad1 assign neutron::network_public_ip [puppet] - 10https://gerrit.wikimedia.org/r/445209 (https://phabricator.wikimedia.org/T196633) [17:46:25] PROBLEM - Check systemd state on kafka1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:46:31] PROBLEM - Kafka Broker Server on kafka1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [17:46:41] come on... [17:46:41] thanks kafka1002 [17:46:46] (03CR) 10Rush: [C: 032] openstack: eqiad1 assign neutron::network_public_ip [puppet] - 10https://gerrit.wikimedia.org/r/445209 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [17:47:02] [2018-07-11 17:43:52,164] ERROR Shutdown broker because all log dirs in /srv/kafka/data have failed (kafka.log.LogManager) [17:47:04] I guess not expected [17:47:44] maybe 3 nodes at the time are too much [17:48:07] what if we keep 1002 down [17:48:34] is Dzhan still on vacations? [17:50:12] elukey: this is really brutally weird - 2k topics (mostly empty topics) shouldn't be a problem for it at all [17:50:33] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1086 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445208 (owner: 10Jcrespo) [17:50:44] Pchelolo: I honestly didn't expect this mess, there must be something weird going on [17:52:21] (03Merged) 10jenkins-bot: mariadb: Repool db1086 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445208 (owner: 10Jcrespo) [17:52:37] (03CR) 10jenkins-bot: mariadb: Repool db1086 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445208 (owner: 10Jcrespo) [17:52:52] now 1001 is just complaining that it can't connect to 1002 and 1003 is complaining about 1001 [17:56:30] elukey: OOM on 1001 [17:56:31] Hello, any ideas why I'm getting 500 Server Error: Error: timed out for url: https://stream.wikimedia.org/v2/stream/recentchange, which I use for processing recentchanges? Can this be relevant to the Kafka problem? [17:56:54] Urbanecm: hello, yes, this is precisely because of the kafka problem [17:57:18] <_joe_> Pchelolo: oh ok what is serving stream.wikimedia.org? [17:57:26] <_joe_> we should fix that in varnish/traffic [17:57:34] eventstreams service [17:57:35] <_joe_> sorry, dns [17:57:53] <_joe_> and eventstreams connects directly to kafka, yes [17:57:59] forgot about it [17:58:16] is irc independent or also uses kafka? [17:58:24] if that also still works [17:58:25] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1003 is CRITICAL: 766 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003 [17:58:42] let me check [17:58:54] its working [17:59:32] Last I heard it was some sort of script that took udp packets and put them in the channel. But my knowledge is severely out of date on how it works [17:59:38] _joe_: but eventstreams will not get fixed just by switching the domains [17:59:44] I can confirm that, bawolff [18:00:15] ok Pchelolo let's try this way. 1002/3 are down, 1001 is bootstrapping. It should reach a point in which it is stable, then we might attempt to clean all the topics [18:00:21] and finally restart the others [18:00:37] or I can try with a 4G heapsize [18:00:45] PROBLEM - Check systemd state on kafka1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:00:52] PROBLEM - Kafka Broker Server on kafka1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [18:01:26] (03PS1) 10Rush: openstack: eqiad1 and labtestn fix and cleanup neutron settings [puppet] - 10https://gerrit.wikimedia.org/r/445211 (https://phabricator.wikimedia.org/T196633) [18:01:39] <_joe_> Pchelolo: why? [18:01:40] elukey: hm.. removing more topics might create more work for 1002 and 1003, but if it does, then 4g heap.. [18:01:47] <_joe_> Pchelolo: re:eventstreams [18:01:53] _joe_: ask ottomata.. [18:02:05] <_joe_> Pchelolo: no please if you know why, do explain [18:02:06] PROBLEM - statsv process on webperf1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args statsv [18:02:10] <_joe_> otto is not here :P [18:02:25] PROBLEM - Check systemd state on webperf1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:02:27] _joe_: I'm trying to remember, I think there was a comment about that somewhere [18:02:31] (03PS1) 10BBlack: cache_misc: stop using eventstreams eqiad backend [puppet] - 10https://gerrit.wikimedia.org/r/445212 [18:02:33] lemme find it [18:02:39] (03CR) 10Rush: [C: 032] openstack: eqiad1 and labtestn fix and cleanup neutron settings [puppet] - 10https://gerrit.wikimedia.org/r/445211 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [18:02:46] <_joe_> Pchelolo: I would guess it reads events created by eventbus [18:03:06] (03CR) 10BBlack: [C: 032] cache_misc: stop using eventstreams eqiad backend [puppet] - 10https://gerrit.wikimedia.org/r/445212 (owner: 10BBlack) [18:03:11] Pchelolo: IIRC the problem with eventstreams is if we failover to codfw then it will start pulling from what's into codfw [18:03:15] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1003 is OK: All endpoints are healthy [18:03:15] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1002 is OK: All endpoints are healthy [18:03:16] loosing all the eqiad ones [18:03:33] <_joe_> elukey: but we are producing the events to codfw now [18:03:37] (03PS2) 10BBlack: cache_misc: stop using eventstreams eqiad backend [puppet] - 10https://gerrit.wikimedia.org/r/445212 [18:03:49] <_joe_> bblack: wait a sec, let me check one thing [18:03:51] (03CR) 10BBlack: [V: 032 C: 032] cache_misc: stop using eventstreams eqiad backend [puppet] - 10https://gerrit.wikimedia.org/r/445212 (owner: 10BBlack) [18:04:02] ok [18:04:16] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1001 is OK: All endpoints are healthy [18:04:23] _joe_ yep yep I know, I was answering to Pchelolo, I remember that conversation.. in theory it is not super easy to switch because we can loose data, but in this case it would be fine(ish) [18:04:35] so atm only kafka on kafka1001 is active [18:04:56] <_joe_> bblack: sorry my bad [18:04:59] <_joe_> the issue is here [18:05:01] <_joe_> metadata.broker.list: kafka1001.eqiad.wmnet:9092,kafka1002.eqiad.wmnet:9092,kafka1003.eqiad.wmnet:9092 [18:05:09] <_joe_> Pchelolo: we need to change that list I guess [18:05:11] ok, revert my thing? [18:05:18] (it never puppet-merged) [18:05:28] <_joe_> bblack: I would say so, yes [18:05:34] <_joe_> sorry :( [18:05:40] (03PS1) 10BBlack: Revert "cache_misc: stop using eventstreams eqiad backend" [puppet] - 10https://gerrit.wikimedia.org/r/445213 [18:05:41] <_joe_> I know zero about this system [18:05:45] PROBLEM - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:05:53] and it needs to pull from kafka codfw yes [18:05:55] (03CR) 10BBlack: [V: 032 C: 032] Revert "cache_misc: stop using eventstreams eqiad backend" [puppet] - 10https://gerrit.wikimedia.org/r/445213 (owner: 10BBlack) [18:05:58] _joe_: still looking for the comment where it's explained why it's like that [18:06:01] np! [18:06:09] you can use T199353 to centalize patches [18:06:10] T199353: kafka eqiad cluster keeps crashing - https://phabricator.wikimedia.org/T199353 [18:06:42] _joe_: here it is https://github.com/wikimedia/puppet/blob/c2594dec3cea0bca0259cb55081277b7e8b404d5/modules/profile/manifests/eventstreams.pp#L7 [18:07:25] <_joe_> Pchelolo: except it's reading from kafka-main [18:07:27] <_joe_> see above [18:07:46] interesting. I never said "yes" to the original, but puppet-merge after the revert, I was expecting an empty review of 2x commits (do and undo) [18:07:58] it just fast-forwarded with no review or question [18:08:00] <_joe_> I got that from /etc/eventstreams/config.yaml on scb1001 [18:08:11] <_joe_> bblack: oh? [18:08:15] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 56.03, 32.66, 20.62 [18:08:25] <_joe_> oh no the appservers too, [18:08:25] not a big issue [18:08:40] elukey: so kafka1001 is good [18:08:44] _joe_: that may be the usual issue [18:08:47] just noting, there's no real review if the set of changes to pull in for puppet-merge amount to a no-op [18:09:56] oh, I had it backwards, but still no issue. because the 2x pending commits amount to zero net diff, it didn't actually merge them (which is fine too) [18:10:00] Pchelolo: yes, I am going to try with 1002 in a bit [18:10:04] on the next real change I guess someone will merge all 3 [18:10:13] Pchelolo: I am also turning off all the mirror makers running [18:10:31] <_joe_> Pchelolo: so [18:10:33] <_joe_> hieradata/role/common/scb.yaml:profile::eventstreams::kafka_cluster_name: main-eqiad [18:10:46] _joe_: ok, found the comment I was looking for https://github.com/wikimedia/puppet/blob/d7e651554411aecb4690e9b28c47393a033440f9/hieradata/role/common/scb.yaml#L58-L70 [18:10:59] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1086 fully (duration: 00m 57s) [18:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:06] the previous one is outdated and needs to be removed [18:11:31] <_joe_> if as I think eventstreams consumes events generated via eventbus [18:11:40] <_joe_> the events are being generated in codfw now [18:11:43] <_joe_> right? [18:11:54] yup. So no events are coming. [18:11:57] <_joe_> so yeah, up to you people [18:12:06] RECOVERY - statsv process on webperf1001 is OK: PROCS OK: 3 processes with command name python, args statsv [18:12:08] <_joe_> I'm going to dinner, sorry [18:12:23] <_joe_> elukey: phone people if you need help [18:12:25] RECOVERY - Check systemd state on webperf1001 is OK: OK - running: The system is fully operational [18:12:26] ok _joe_ we will try to get kafka back asap, thank you [18:12:52] Pchelolo: starting kafka on 1002 now [18:13:21] RECOVERY - Kafka Broker Server on kafka1002 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [18:13:35] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties [18:13:55] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 19.01, 24.93, 21.08 [18:13:56] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties [18:14:05] PROBLEM - Check systemd state on kafka1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:14:33] so I stopped mirror makers on kafka100[1-3] [18:14:42] !log stop mirror makers on kafka100[1-3] [18:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:48] !log start kafka on kafka1002 [18:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:07] and now [18:15:08] [2018-07-11 18:14:53,457] ERROR Shutdown broker because all log dirs in /srv/kafka/data have failed (kafka.log.LogManager) [18:15:14] need to figure out what this means [18:15:15] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [18:15:26] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties [18:15:36] RECOVERY - Varnishkafka Statsv Delivery Errors per second on einsteinium is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=statsv&var-host=All [18:15:45] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1002 is CRITICAL: 252 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002 [18:16:06] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1001 is CRITICAL: 1071 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001 [18:16:19] (03PS4) 10Andrew Bogott: realm: add regexes for detecting the new Neutron cloud VM IP ranges. [puppet] - 10https://gerrit.wikimedia.org/r/445045 [18:17:28] Pchelolo: ah! I might have found something! [18:17:32] PROBLEM - Kafka Broker Server on kafka1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [18:17:39] ERROR Error while renaming dir for eqiad.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.ch... [18:17:43] <_joe_> and 1001 died again as well [18:17:45] b81aafbb9f089d2f-delete: File name too long [18:17:50] <_joe_> ahahahahh [18:17:50] oh... [18:18:05] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1002 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [18:18:05] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1003 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [18:18:05] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1001 is CRITICAL: /v1/events (Produce a valid test event) timed out before a response was received [18:18:12] (03PS5) 10Andrew Bogott: realm: add regexes for detecting the new Neutron cloud VM IP ranges. [puppet] - 10https://gerrit.wikimedia.org/r/445045 [18:18:56] Pchelolo: so this might be the reason for the brokers to die [18:18:56] _joe_: elukey and I have found what caused it and I deserve a "I broke wikipedia" t-shirt [18:19:05] ahahhaha [18:19:06] (03CR) 10Andrew Bogott: [C: 032] realm: add regexes for detecting the new Neutron cloud VM IP ranges. [puppet] - 10https://gerrit.wikimedia.org/r/445045 (owner: 10Andrew Bogott) [18:19:19] so what could we possibly do? Brutally clean the dirs on /srv/kafka/data [18:19:22] ? [18:19:24] seems the only viable option [18:19:52] I guess it is the only option now. _joe_??? Sorry for interrupting your dinner [18:20:39] <_joe_> I would +1 that, but I suggest you call people around [18:20:51] <_joe_> bblack: ^^ there is a call to make, and it's a bit delicate [18:20:53] a brutal rm -rf *.change-prop.retry.change-prop.retry* [18:21:24] elukey: lemme try doing that locally really quick [18:21:25] <_joe_> elukey: wouldn't that make kafka complain of not finding those files? [18:21:31] yes definitely [18:21:48] <_joe_> elukey, Pchelolo please consider doing something for eventstreams [18:22:05] or we could just delete the super long ones [18:22:27] <_joe_> elukey: I mean how would that solve something? [18:23:43] _joe_ from what I can see on the broker logs, at some point due to the fs limitation and failures above the broker just shuts down [18:23:55] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:23:59] so until those dirs are there, it will keep getting in that tsate [18:24:01] *state [18:24:15] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [18:24:19] <_joe_> elukey: ok, do what you think is best [18:24:29] <_joe_> I have no informed opinion on kafka, sorry [18:25:04] Pchelolo: can I ask for a --verbose about what you are doing? [18:25:04] <_joe_> elukey: try this, then please address the eventstreams issue [18:25:55] elukey: I've just locally stopped kafka, rm -rf test_dc.*, and started kafka back - it did start.. [18:26:47] the only thing that I am worried about is the topic state in zookeeper [18:26:56] kafka will complain for sure [18:27:56] hello [18:28:17] * mark reads [18:28:19] Pchelolo: if we just spin up one broker, then delete all topics (it will clean up zookeeper) and then restart the brokers? [18:28:22] hello mark [18:28:42] elukey: ye let's do that first, seems more safe [18:28:46] all righ [18:29:05] mark: it's a mess.. [18:29:17] let's recap a bit [18:29:19] so current thinking it's a fs limitation that is breaking kafak? [18:29:26] basically yes [18:29:37] but kafka keeps topics on zookeeper too [18:29:41] mark: ye, some accidentally created topic names are too long for it [18:29:54] what fs is this? [18:29:59] ext4? [18:30:03] so what I want to do is bring up only one broker, delete all topics from the kafka cli (it will hopefully clean up zookeeper) [18:30:06] then restart the brokers [18:30:13] mark: seems a file name len issue [18:30:28] we have topic names that repeats themselves a ton of times [18:30:28] elukey: ok [18:31:06] ok so my plan is [18:31:14] 1) start only kafka on kafka1001 [18:31:21] 2) delete all topics as I was doing before [18:31:31] 3) start the other brokers one by one [18:32:01] RECOVERY - Kafka Broker Server on kafka1001 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [18:32:08] and you're hoping that will also delete those topics on the other brokers? [18:32:17] * mark doesn't know kafka, so no idea [18:32:18] (03PS1) 10Andrew Bogott: nova.conf: set dhcp_domain = [puppet] - 10https://gerrit.wikimedia.org/r/445219 [18:32:31] PROBLEM - Kafka Broker Server on kafka1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [18:32:35] that's the hope, definitely less brutal then rm -rf [18:32:42] yes [18:32:50] are those pages unavoidable? Could I downtime something? [18:32:51] what would impact be of a total data loss? [18:32:53] mark: in theory yes, kafka should read the topics from zk when boostrapping, possibly avoiding doing fs moves with them (that are causing the exceptions) [18:32:56] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 52.27, 33.34, 25.53 [18:33:08] (03CR) 10Andrew Bogott: [C: 032] nova.conf: set dhcp_domain = [puppet] - 10https://gerrit.wikimedia.org/r/445219 (owner: 10Andrew Bogott) [18:33:10] mark: no data, the topics are actually completely empty [18:33:15] andrewbogott: yes sorry, you can downtime kafka1001->1003 [18:33:17] sorry :( [18:33:22] let's try that then [18:33:25] thanks, will do [18:33:35] PROBLEM - statsv process on webperf1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args statsv [18:33:37] hey, what's up ? [18:33:45] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 46.76, 37.16, 29.38 [18:33:45] PROBLEM - Check systemd state on webperf1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:33:52] akosiaris: eqiad kafka cluster is down [18:33:59] kafka-main ? [18:34:06] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1001 is CRITICAL: 1071 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001 [18:34:09] and brokers keep shutting down due to a fs path length limitation [18:34:13] So I found out why I get TOO MANY REQUESTS in phabricator [18:34:13] yeah judging by the hostnames [18:34:18] (03PS1) 10Ppchelko: Switch eventstreams to consuming only from kafka-codfw. [puppet] - 10https://gerrit.wikimedia.org/r/445229 (https://phabricator.wikimedia.org/T199353) [18:34:21] it's because the preview uses Ajax [18:34:29] so everytime you type it counts as a request [18:34:31] !log restarted topic nuke script for kafka main [18:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:48] mark: fs path length limitation ? [18:34:55] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1001 is OK: All endpoints are healthy [18:35:16] downtimed for 4 hours [18:35:21] akosiaris: apparently some topics that accidentally got created with a very long path name [18:35:35] ahem... [18:35:39] so nested directories which lead to too long paths... [18:36:00] andrewbogott: thanks! [18:36:38] elukey: so, what's the state right now ? [18:36:40] akosiaris: yeah it seems that for some reason, a ton of topics got created with super long names for change-prop.retry.change-prop.retry.. etc.. [18:37:14] the brokers got a bigger heap size that helped to avoid the OOMs, but then they shutdown when fs exceptions are thrown due to long filenames [18:37:16] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [18:37:22] so topic names are stored in zookeeper too [18:37:30] so via the CLI, I am deleting them [18:37:31] is this 4096 max path length? [18:37:44] thing like eqiad.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.cpjobqueue.retry.me ? [18:37:50] created a gerrit to switch eventstreams to codfw https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/445229/ should we do it or we have high expectations on current approach? [18:38:22] Pchelolo: what's against switching to codfw like everything else? [18:38:25] 10Operations, 10MediaWiki-extensions-Score: Contrabass MIDI instrument is unusable - https://phabricator.wikimedia.org/T199356 (10Ebe123) [18:38:26] i'm missing context I think :) [18:38:39] 10Operations, 10MediaWiki-extensions-Score: Contrabass MIDI instrument is unusable - https://phabricator.wikimedia.org/T199356 (10Ebe123) p:05Triage>03Normal [18:38:57] mark: now - nothing any more, it's been too long [18:39:20] let's do it then [18:39:49] akosiaris: could you help get that merged? [18:40:15] yeah looking at it right now. I am not familiar with eventstreams [18:40:19] elukey: so, seems the gentle approach have failed " Shutdown broker because all log dirs in /srv/kafka/data have failed" ? [18:40:44] We *can* failover to a different [18:40:44] # Kafka cluster, e.g. main-codfw, if we really need to. But note that [18:40:44] # doing so would be disruptive to clients, as their stored offsets [18:40:44] # would no longer make sense after the failover, and cause them to [18:40:44] # miss messages. [18:40:49] great.... [18:40:55] ah [18:41:19] akosiaris: ye, that is relevant when both clusters are operational [18:41:27] Pchelolo: that error is the final state after the fs issue [18:41:36] still deleting topics [18:41:48] after that I'll delete them manually on disk [18:41:50] and start the brokers [18:41:57] got a rough ETA? [18:42:44] I 'll switch over eventstreams to kafka-codfw now [18:42:51] akosiaris: thanks a lot [18:42:58] (03PS2) 10Alexandros Kosiaris: Switch eventstreams to consuming only from kafka-codfw. [puppet] - 10https://gerrit.wikimedia.org/r/445229 (https://phabricator.wikimedia.org/T199353) (owner: 10Ppchelko) [18:42:59] mark: I hope something like 30/40 mins [18:43:09] (03CR) 10Alexandros Kosiaris: [C: 032] Switch eventstreams to consuming only from kafka-codfw. [puppet] - 10https://gerrit.wikimedia.org/r/445229 (https://phabricator.wikimedia.org/T199353) (owner: 10Ppchelko) [18:43:14] there are 2k topics to clean [18:43:16] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Switch eventstreams to consuming only from kafka-codfw. [puppet] - 10https://gerrit.wikimedia.org/r/445229 (https://phabricator.wikimedia.org/T199353) (owner: 10Ppchelko) [18:43:43] <_joe_> hey I am back [18:44:03] !log ok, change merged, running puppet on scb hosts [18:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:19] <_joe_> so mark, alex we are producing events to eventbus in codfw, so mediawiki -> eventbus codfw -> kafka codfw -> eventstreams [18:44:23] akosiaris: noooo [18:44:34] puppet's disabled in eqiad for a reason.. [18:44:36] hm... [18:44:45] hmm [18:44:46] PROBLEM - Varnishkafka Statsv Delivery Errors per second on einsteinium is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=statsv&var-host=All [18:44:47] <_joe_> Pchelolo: it's ok to run puppet on codfw [18:45:00] fwiw I did not enable puppet anywhere yet [18:45:04] so no harm done yet [18:45:17] yeye, just stopping Alex from running in eqiad :) [18:45:30] it's also disabled on codfw btw [18:45:39] Reason: 'cpjobqueue down' [18:45:49] <_joe_> akosiaris: I think we can enable it everywhere, and run it everywhere, then disable cpjobqueue in eqiad again [18:45:50] just to be clear, it's fine to enable puppet on codfw scb boxes, right ? [18:45:56] <_joe_> akosiaris: yes [18:46:14] <_joe_> akosiaris: in eqiad as well, really [18:46:37] elukey: kafka's down on 1001 [18:46:49] your script must be doing nothing [18:47:05] so hypothetically, if we have to wipe the entire eqiad kafka cluster [18:47:07] what would we lose? [18:47:07] I see /bin/bash /usr/local/bin/kafka topics --delete --topic eqiad.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.cpjobqueue.retry.mediawiki.job.cirrusSearchJobChecker [18:47:12] events from before we switched to codfw? [18:47:23] akosiaris: I am running it [18:47:24] <_joe_> mark: some events from that time, yes [18:47:27] elukey: ok [18:47:36] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational [18:47:45] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational [18:47:46] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational [18:47:46] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational [18:47:55] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [18:47:59] retention is 7 days for us IIRC, even in main-eqiad [18:48:06] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [18:48:09] ! log changeprop and eventstreams started on scb boxes after merge of https://gerrit.wikimedia.org/r/445229 T199353 [18:48:10] T199353: kafka eqiad cluster keeps crashing - https://phabricator.wikimedia.org/T199353 [18:48:21] <_joe_> Pchelolo: changeprop is now on in codfw [18:48:41] _joe_: yup went looking into it [18:48:46] so, fill me in, why is kafka codfw fine ? [18:48:54] the weird topics never made it to it ? [18:48:59] <_joe_> akosiaris: because of that, yes [18:49:29] and it was luck I think, we mirror topics between dcs.. maybe those ones were blacklisted by our mirror maker topic blacklist [18:49:35] if so, pure luck [18:49:35] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 56.46, 37.20, 29.47 [18:49:39] great :) [18:49:44] fate sharing [18:49:46] <_joe_> akosiaris: so in eqiad, I'll fix eventstreams now [18:50:06] I was about to ask, should I run puppet in eqiad scb boxes too ? [18:50:21] elukey: are you sure the script is doing something? kafka itself seems to be down.. [18:50:49] <_joe_> akosiaris: yes, then we need to disable it again and stop changeprop and cpjobqueue [18:50:50] ther process is definitely running since 18:50 on kafka1001 [18:51:06] <_joe_> so lemme do that with some cumin-fu [18:51:23] yup, but is it actually acheiving anything? [18:51:32] _joe_: what's the point ? I can just update eventstreams config [18:51:36] and just restart that [18:51:46] sounds faster and less error prone tbh [18:52:35] Pchelolo: so aiui it needs to be removed from zookeeper, right? [18:52:45] ...can we see what's happening in zookeeper? :) [18:52:46] Pchelolo: if I got it correctly, it is scheduling them for deletion, so cleaning up zookeeper and then allowing kafka to boostrap correctly.. what I am planning to do is to manually remove those from /srv/kafka/data [18:52:57] mark: I was about to check [18:53:43] _joe_: I am restarting eventstreams with a renewed config [18:53:52] I am leaving changeprop and cpjobqueue as is [18:54:08] <_joe_> akosiaris: oh I was about to run puppet everywhere on those machines [18:55:01] <_joe_> ok cool [18:55:09] I have some errors in the logs [18:55:14] js stacktraces [18:55:16] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:55:25] <_joe_> that's my fault ^^ [18:55:32] <_joe_> I stopped my command a tad too late? [18:55:57] Pchelolo: ok so the topics are under /kafka/main-eqiad/brokers/topics in zk [18:56:30] akosiaris: which node gives the errors? [18:57:09] Pchelolo: https://pastebin.com/jk3TtuaS [18:57:15] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:57:22] and it's scb1001, scb1003, scb1004 [18:57:27] scb1002 is fine up to now [18:57:43] it was a one off btw [18:57:55] I only noticed it once, but on all 3 hosts [18:58:36] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:59:30] akosiaris: as I see the config for eventstreams in eqiad is still pointing to kafka in eqiad, so pretty clear it will fail [18:59:33] ah lovely, zk puts stuff under /kafka/main-eqiad/admin/delete_topics [18:59:42] <_joe_> I am still getting failures, yes [18:59:52] <_joe_> akosiaris: uhm where did you change the config? [18:59:58] I checked with eqiad.change-prop.retry.change-prop.retry.ChangeNotification, it is in there [19:00:02] <_joe_> it's not as simple as you'd like to think [19:00:02] so yes it seems working [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180711T1900) [19:00:04] dammit [19:00:07] but it is a bit slow [19:00:13] Pchelolo: yes my fault, sorry [19:00:20] I 've updated config-vars.yaml, not config.yaml [19:00:22] sigh [19:00:23] elukey: what are these paths? http paths? file system paths? [19:00:25] fixing [19:00:26] <_joe_> akosiaris: ahah yeah [19:00:45] mark: file system paths [19:00:51] <_joe_> mark: /kafka/main-eqiad/admin/delete_topics is a zookeeper path [19:00:52] ok [19:01:03] <_joe_> the rest are filesystem paths :) [19:01:08] yes sorry :) [19:01:15] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:20] we are almost half way through [19:01:35] so if zk puts those topics on the filesystem, why does it not break for max path length as well? :) [19:01:37] <_joe_> elukey: halfway through what? [19:01:48] _joe_: elukey is deleting all the problematic topics [19:01:56] <_joe_> oh ok sorry [19:02:01] <_joe_> directly from zk? [19:02:07] yeah those are 2k, one at the time it takes ages.. [19:02:07] i believe via kafka cli [19:02:10] nope via kafka cli [19:02:25] (brb) [19:02:39] what generated them btw ? I guess some bug in changeprop ? [19:02:56] <_joe_> akosiaris: yes, Pchelolo says he got what happened [19:03:03] ok, good to know [19:03:14] _joe_: ye, I know what happened and I've stopped what made it happen [19:03:21] so it will not happen again [19:03:32] <_joe_> Pchelolo: so we can just restart changeprop in codfw maybe [19:03:52] _joe_: puppet restarted it already [19:03:56] I can see it working [19:03:59] <_joe_> oh right [19:04:03] <_joe_> eheh [19:04:07] <_joe_> sorry, I'm too tired [19:04:17] have a coffee ;) [19:04:55] eqiad eventstreams seems to be fine [19:04:56] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 41.46, 35.29, 31.92 [19:05:03] at least that's what my curl foo says [19:05:06] <_joe_> mark: yeah, I'm supposed to have the techcom meeting tonight too :D [19:05:18] <_joe_> yeah it is [19:05:20] akosiaris: confirmed, eventstreams are working [19:05:34] <_joe_> akosiaris: also I just realized we have no real monitoring of that thing [19:05:51] _joe_: yup, very minimal as it seems [19:06:05] <_joe_> Urbanecm: stream.wikimedia.org is back, more or less [19:06:07] <_joe_> sorry [19:06:21] Thank you _joe_! [19:06:26] _joe_: therere's https://grafana.wikimedia.org/dashboard/db/eventstreams?refresh=1m&orgId=1 but it's all we have I guess [19:06:29] <_joe_> as you might've gathered, it's being a tough day :P [19:06:44] <_joe_> Pchelolo: heh [19:07:22] eventbus, eventlogging eventstreams [19:07:28] E_TOOMANYEVENTS [19:07:43] 1466 topics deleted [19:07:49] out of ? [19:07:54] 2086 [19:07:58] ok [19:08:01] thanks for the update [19:09:13] so kafka topics --describe gives [19:09:31] Topic:eqiad.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.mediawiki.job.cirrusSearchJobChecker PartitionCount:1 ReplicationFactor:3 Configs:MarkedForDeletion:true [19:09:38] MarkedForDeletion looks promising [19:11:13] hopefully it'll work [19:11:35] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:12:11] elukey: so are you planning to delete the data files as well? [19:12:15] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 48.16, 32.72, 25.58 [19:12:22] <_joe_> burrow has died [19:12:31] <_joe_> in the meantime, the appservers are burning [19:12:40] what is burrow [19:12:41] <_joe_> I'll look into them [19:12:43] (03CR) 10BryanDavis: [C: 031] "I don't have +2 here so someone else will need to roll this out. The app should deal with the change gracefully. The config is read for ea" [puppet] - 10https://gerrit.wikimedia.org/r/441131 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [19:12:45] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 49.60, 31.83, 26.41 [19:13:05] <_joe_> burrow is the service that gives information and monitoring of kafka [19:13:05] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 53.14, 32.43, 24.68 [19:13:11] <_joe_> oh shit [19:13:15] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 48.01, 30.93, 25.94 [19:13:46] really? The appservers now? [19:14:00] <_joe_> yes [19:14:04] <_joe_> I'm looking into them [19:15:18] elukey: ok, so what's next? let's drop the files or attempt to start kafka on 1001? [19:15:59] Pchelolo: your thinking being, kafka might drop them itself? [19:16:05] and if it doesn't work we can still rm -rf? [19:16:14] mark: ye... that's the hope [19:16:18] agreed [19:16:23] <_joe_> !log restarting a few hhvm appservers with high load [19:16:24] there's also a completely brutal way https://stackoverflow.com/a/21882027/1924141 [19:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:47] Pchelolo: still not finished, gimme a bit and I should be done (less than 200 topics remaining) [19:16:48] what happened with the fr-tech full inbox ? [19:16:49] <_joe_> Pchelolo: yeah I thought we were going that way [19:16:51] I just noticed [19:17:09] ah Error validating Amazon message. Firewall problem? [19:17:17] Pchelolo, _joe_ I'd attempt a first round to leave kafka delete files itself [19:17:22] it if doesn't work, brutal rm [19:17:26] kk. [19:17:30] ok I saw some update in some task for that, looks fully unrelated [19:17:32] let's go! [19:17:39] akosiaris: yeah, we stopped the mail at its source [19:17:59] ejegg: ah, good to know. thanks! [19:18:15] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 37.75, 35.77, 30.49 [19:18:17] our payment notification listener was having trouble making requests to amazon, and was sending a mail for each payment notification [19:18:33] _joe_: any idea what's up yet? [19:18:55] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 39.78, 35.22, 30.35 [19:19:19] seems all api appservers are trending up on cpu graphs... [19:19:22] each one is a different process (it's a little web server that just accepts requests from payment processors), so it's a bit difficult to roll up multiple failures in a single email [19:19:53] <_joe_> mark: it looks like it's high load from traffic [19:19:54] <_joe_> tbh [19:20:13] <_joe_> https://grafana.wikimedia.org/dashboard/db/apache-hhvm?panelId=27&fullscreen&orgId=1&from=now-3h&to=now [19:20:21] _joe_: wondering if this could somehow be related to missing events from eqiad kafka? [19:20:30] so elukey are you starting it? [19:20:32] _joe_: : if they aren't suffering from timeouts trying to talk to kafka I am happy with it [19:20:53] <_joe_> akosiaris: they talk to eventbus in codfw [19:21:01] hmmm [19:21:08] Pchelolo: still finishing, gimme a min :D [19:21:26] done! 2k topics deleted [19:21:33] let's start kafka1001 [19:21:36] cool [19:21:39] <_joe_> let's kill changeprop please [19:21:55] <_joe_> akosiaris: can you ^^ [19:22:06] sure [19:22:29] <_joe_> that's what's killing us most probably [19:22:42] <_joe_> Pchelolo: we need to turn down concurrency on changeprop I guess? [19:22:51] !log stop changeprop on all scb hosts [19:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:02] <_joe_> 18:48 < akosiaris> ! log changeprop and eventstreams started on scb boxes after merge of https://gerrit.wikimedia.org/r/445229 T199353 [19:23:02] T199353: kafka eqiad cluster keeps crashing - https://phabricator.wikimedia.org/T199353 [19:23:05] <_joe_> timing fits [19:23:06] _joe_: the rate was not higher then usual: https://grafana.wikimedia.org/dashboard/db/eventbus?refresh=1m&panelId=8&fullscreen&orgId=1 [19:23:33] <_joe_> load is down [19:23:35] [2018-07-11 19:23:27,376] ERROR Error while accepting connection (kafka.network.Acceptor) [19:23:38] java.lang.ArithmeticException: / by zero [19:23:45] great! [19:23:47] :( [19:23:52] <_joe_> that was always there [19:23:57] ? [19:24:03] amusing [19:24:04] <_joe_> that error [19:24:07] seems going forward [19:24:26] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 61.85, 32.75, 24.60 [19:24:27] wait, we have a service that logs java.lang.ArithmeticException: / by zero and then just proceeds fine ? [19:24:34] seems so yes [19:24:42] * akosiaris facepalm [19:24:46] PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 66.93, 40.06, 31.28 [19:24:56] ok kafka1001 looks fine afaics [19:24:59] load is down says joe ;) [19:25:01] at severity ERROR of all things [19:25:15] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:25:16] (03PS3) 10Rush: Phabricator: Double the rate limit and connection limit [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) (owner: 1020after4) [19:25:16] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:25:20] (03PS1) 10Andrew Bogott: Revert "prometheus: tools: scrape paws metrics into prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/445251 [19:25:25] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 48.91, 29.96, 24.03 [19:25:25] elukey: and it's logging that it's deleting index [19:25:25] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 58.02, 36.28, 29.99 [19:25:25] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:25:26] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 49.51, 37.18, 32.84 [19:25:26] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:25:31] (03PS2) 10Andrew Bogott: Revert "prometheus: tools: scrape paws metrics into prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/445251 [19:25:35] PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 62.38, 40.37, 33.75 [19:25:35] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 37.36, 31.66, 24.83 [19:25:44] <_joe_> uhm [19:25:45] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:26:01] well at least croatia just scored [19:26:15] (03CR) 10Andrew Bogott: [C: 032] Revert "prometheus: tools: scrape paws metrics into prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/445251 (owner: 10Andrew Bogott) [19:26:16] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 48.30, 30.61, 22.64 [19:26:26] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 60.02, 39.52, 29.00 [19:26:26] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 59.10, 35.34, 24.60 [19:26:35] PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 63.21, 38.25, 28.87 [19:26:41] started kafka1002 [19:27:05] PROBLEM - High CPU load on API appserver on mw1289 is CRITICAL: CRITICAL - load average: 63.64, 39.81, 29.76 [19:27:06] elukey: nah... 1001 just died again [19:27:11] nope [2018-07-11 19:26:50,646] ERROR Shutdown broker because all log dirs in /srv/kafka/data have failed (kafka.log.LogManager) [19:27:13] let's do rm -rf [19:27:26] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 6.92, 19.27, 22.86 [19:27:30] :( [19:27:48] yeah [19:27:50] RECOVERY - Kafka Broker Server on kafka1002 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [19:27:51] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 65.48, 36.34, 23.52 [19:28:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:28:49] Pchelolo: let's remove the stuff with change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.changeprop [19:28:53] *change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.changeprop* basically [19:28:55] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:28:56] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:28:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:29:15] <_joe_> we're having an outage on the apis [19:29:15] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:29:18] <_joe_> and I have no idea why [19:29:25] PROBLEM - High CPU load on API appserver on mw1289 is CRITICAL: CRITICAL - load average: 55.21, 44.04, 32.73 [19:29:35] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:29:36] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 44.08, 35.22, 24.04 [19:29:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:30:05] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 53.30, 38.87, 26.05 [19:30:06] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 54.87, 44.02, 31.81 [19:30:13] elukey: just drop all the *change-prop.retry.change-prop.retry* [19:30:20] I'm back from dinner if I can be of any help [19:30:22] doesn't matter [19:30:35] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:30:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:30:36] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:30:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:30:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:30:56] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 37.68, 35.34, 26.66 [19:30:56] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 58.69, 33.06, 27.05 [19:31:05] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:31:12] <_joe_> volans: try to understand what's happening on the api cluster [19:31:13] <_joe_> I guess [19:31:16] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [19:31:30] _joe_: ack having a look [19:31:35] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [19:31:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:31:59] !log cleaned up *change-prop.retry.change-prop.retry* in /srv/kafka/data on kafka100[1-3] [19:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:16] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [19:32:29] <_joe_> as far as I can tell, it looks like genuine load [19:32:46] PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CRITICAL - load average: 49.14, 44.12, 34.81 [19:33:09] elukey: let's start it up? [19:33:26] I see [19:33:28] Jul 11 19:33:21 mw1281 apache2[28702]: [proxy_fcgi:error] [pid 28702:tid 140672669517568] [client 10.64.48.29:42184] AH01079: failed to make connection to backend: 127.0.0.1 [19:33:30] for instance there [19:33:40] <_joe_> chasemp: that means hhvm is too loaded to respond [19:33:43] and puppet is diabled by you _joe_ but I'm unsure if intended [19:33:43] ack [19:33:55] so death spiral [19:34:05] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1003 is OK: All endpoints are healthy [19:34:05] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1002 is OK: All endpoints are healthy [19:34:15] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on einsteinium is OK: (C)0 le (W)25 le 38.01 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [19:34:21] RECOVERY - Kafka Broker Server on kafka1003 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [19:34:55] Pchelolo: I did it, 1002 died again for the fs name issue [19:34:56] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on einsteinium is OK: (C)0 le (W)25 le 173.4 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [19:35:12] 1001 is doing recovering segment and rebuilding index files... (kafka.log.Log) [19:35:15] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:35:25] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 14.76, 24.26, 23.25 [19:35:26] I see some requests flowing now [19:35:26] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1003 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003 [19:35:31] what changed? [19:35:45] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:35:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:35:48] Pchelolo: ah wait the rm -rf didn't work as expected [19:35:50] re-doing it [19:35:51] 2018-07-11 19:35:42 10.64.32.107 10.2.2.22 > GET en.wikipedia.org /w/api.php?format=json&rvprop=content&prop=revisions&rvparse=&titles=Phoenix%2C+Arizona&rvlimit=1&action=query [19:35:55] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 23.51, 30.57, 29.63 [19:36:05] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:36:16] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:36:16] was this a retry death spiral bc something soemthing changeprop was down? [19:36:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:36:25] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:36:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:36:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:36:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:37:06] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1002 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1002 [19:37:10] (03PS2) 10Andrew Bogott: deployment-prep: Update wikimail_smarthost [puppet] - 10https://gerrit.wikimedia.org/r/436431 (https://phabricator.wikimedia.org/T184244) (owner: 10Alex Monk) [19:37:16] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [19:37:21] i saw the behavior change right at: [19:37:24] 2018-07-11 19:35:02 10.2.2.22 10.64.0.102 < - - - HTTP/1.1 503 Service Unavailable [19:37:24] 2018-07-11 19:35:02 10.64.32.107 10.2.2.22 > GET sv.wikipedia.org /w/api.php?format=json&action=parse&page=Filmåret_1998&prop=text|langlinks|categories|links|templates|images|externallinks|sections|properties HTTP/1.1 - - [19:37:31] suddenly the world was recovering [19:37:45] RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 18.29, 22.79, 23.71 [19:37:55] _joe_: elukey volans ^ I'm assuming somethign happend at 19:35:02 [19:38:01] so there's been a tremendous spike in RecordLint jobs and wikibase-addusagesPerPage [19:38:01] if things keep trending down [19:38:28] (03CR) 10Andrew Bogott: [C: 032] deployment-prep: Update wikimail_smarthost [puppet] - 10https://gerrit.wikimedia.org/r/436431 (https://phabricator.wikimedia.org/T184244) (owner: 10Alex Monk) [19:39:06] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 18.39, 22.49, 23.83 [19:39:45] RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 23.34, 27.01, 29.68 [19:40:16] RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 19.38, 22.89, 29.56 [19:40:36] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:41:06] elukey: damn it [19:41:22] it took all the topics back from somewhere and recreated them [19:41:34] from another broker? [19:41:48] elukey: did you delete them on all 3 brokers? [19:42:01] who's publishing those topics? [19:42:25] RECOVERY - statsv process on webperf1001 is OK: PROCS OK: 3 processes with command name python, args statsv [19:42:25] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:42:26] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [19:42:26] volans: they were created by mistake, now we can't get rid of them [19:42:36] RECOVERY - Check systemd state on webperf1001 is OK: OK - running: The system is fully operational [19:42:36] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [19:42:46] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:42:50] Pchelolo: yeah it did [19:43:03] elukey: what did? [19:43:06] Pchelolo: yeah, I was wondering if they are still getting created somehow [19:43:50] mark: I think that kafka has a list of topics (including the long ones) and a list of topics to delete. It seems trying to re-create the first list and then apply the second [19:43:56] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [19:44:09] Pchelolo: the other option is to remove the topics from zookeeper itself [19:44:15] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:44:40] elukey: I believe it's the only option left [19:44:45] elukey@kafka1002:/srv/kafka/data$ ls -l eqiad.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.changeprop.retry.resource_change-0 [19:44:48] total 0 [19:44:49] created just now [19:44:52] -rw-r--r-- 1 kafka kafka 10485760 Jul 11 19:39 00000000000000000000.index [19:44:57] things are trending down-ish but not normal still https://grafana.wikimedia.org/dashboard/db/apache-hhvm?panelId=27&fullscreen&orgId=1&from=now-3h&to=now [19:45:08] !log restarting wdqs-updater on wdqs1010 (still not using Kafka) [19:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:48] !log note: updater was not and will not be using kafka [19:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:46] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 20.77, 20.94, 23.94 [19:46:54] Pchelolo: we need to delete on zk [19:47:05] RECOVERY - Varnishkafka Statsv Delivery Errors per second on einsteinium is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=statsv&var-host=All [19:47:07] up. [19:47:38] ok so [19:47:42] 1) all brokers down [19:47:49] 2) re-clean up /srv/kafka/data [19:48:02] (03PS1) 10EBernhardson: [WIP] Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) [19:48:16] 3) remove the topics from what specified in https://github.com/darrenfu/bigdata/issues/6 [19:48:19] start again [19:48:36] elukey: how is this different from before? [19:48:48] elukey: agreed [19:48:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) (owner: 10EBernhardson) [19:48:56] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 40.13, 31.14, 32.09 [19:49:06] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 41.26, 29.97, 33.08 [19:49:16] mark: before we were trying to be gentle and rely on kafka to delete them properly. now we just force delete them from ZK [19:49:24] k [19:49:55] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 65.43, 49.26, 41.43 [19:49:56] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 57.41, 34.37, 29.86 [19:50:59] checking on zk [19:51:15] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 54.99, 35.70, 31.91 [19:51:19] load seems to be increasing across API servers [19:51:55] PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 54.99, 44.72, 40.90 [19:52:25] PROBLEM - statsv process on webperf1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args statsv [19:52:36] PROBLEM - Check systemd state on webperf1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:53:11] it might be me stopping kafka servers [19:53:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:53:16] this could be the pattern [19:53:16] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:53:20] but not sure why [19:53:51] Pchelolo: need to come up with a script to delete those topics, it is non trivial [19:53:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:53:56] going to do it now [19:54:45] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:55:16] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:55:35] elukey: doesn't zkCli.sh support wildcards?? [19:55:45] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:55:55] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:56:13] oh no it does not [19:56:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:56:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:58:25] RECOVERY - High CPU load on API appserver on mw1289 is OK: OK - load average: 25.38, 27.60, 29.83 [19:59:40] load seems to be reducing again where I'm spot checking so I'm wondering if elukey returned something to working order (brokers?) [20:00:05] RECOVERY - High CPU load on API appserver on mw1286 is OK: OK - load average: 23.61, 26.18, 29.98 [20:00:07] cscott, arlolra, subbu, bearND, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180711T2000). [20:00:30] chasemp: all the kafka traffic is going through codfw now, so eqiad brokers should not affect anything [20:01:05] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 18.16, 19.40, 23.96 [20:01:55] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:01:56] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 50.31, 41.06, 40.13 [20:02:16] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [20:02:23] twentyafterfour: thank you :) [20:02:25] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 34.94, 28.15, 32.07 [20:02:45] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:03:16] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 35.15, 31.80, 32.08 [20:03:53] !log rolling restarting mediawiki API in eqiad with the highest load [20:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:06] tx Pchelolo :) seems like just happenstance then [20:06:09] ok ready for the first batch [20:06:25] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 57.49, 43.95, 41.09 [20:07:01] elukey: first batch being? [20:07:45] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 40.00, 33.20, 32.37 [20:08:05] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 0.77, 4.73, 23.65 [20:08:15] PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:08:40] Pchelolo: in the end all the topics in the three paths sorry, it is running [20:08:40] i'm not sure how it would create load, but comparing api.reversed for 19:00-20:00 from xenon flame graphs shows yesterday shows the main difference in time spent is that usleep has gone from 8% on 10th to 23% on the 11th. I'm not finding it at all on the report for the 9th. but maybe its totally unrelated. this is some sort of locking that happens when setting values to the WAN object [20:08:46] cache. [20:08:55] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 59.62, 41.09, 33.18 [20:09:46] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 65.65, 36.03, 28.15 [20:11:35] RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational [20:11:46] PROBLEM - Varnishkafka Statsv Delivery Errors per second on einsteinium is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=statsv&var-host=All [20:12:03] volans: ^ this was the issue with the debmonitor users session .scope unit you were mentioning [20:12:15] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 37.83, 32.62, 32.22 [20:12:16] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 51.92, 35.61, 32.01 [20:12:22] a systemctl reset-failed fixed it for now, back to the actual problems [20:12:40] thanks akosiaris [20:12:56] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:12:57] and so far it happened only on swift backends [20:13:25] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [20:13:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:13:47] so it is taking a long time [20:13:53] I am still at ~200 topics [20:13:55] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on einsteinium is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [20:14:00] really sorry people, it must be done 1 by 1 [20:14:10] no worries [20:14:25] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 43.13, 34.20, 32.71 [20:14:26] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on einsteinium is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [20:14:26] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 61.63, 39.76, 33.89 [20:14:37] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 60.84, 40.34, 33.09 [20:14:47] why's mirror maker still alive somewhere? [20:14:50] I'll stop them [20:14:59] I think it is on kafka2* Pchelolo [20:15:05] PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 59.77, 32.84, 24.49 [20:15:06] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 57.53, 38.77, 33.84 [20:15:20] ok 1003 kafka-mirror is up [20:15:25] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 56.93, 33.69, 23.85 [20:15:26] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 51.04, 36.03, 29.37 [20:15:46] PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 61.77, 37.18, 27.51 [20:15:51] damn those API servers, I restart one and 3 more pops out here [20:16:33] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@03fa731]: Update mobileapps to b5e152d (T195325 T189830 T177619 T196523) [20:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:40] T195325: Page media endpoint missing math formula images - https://phabricator.wikimedia.org/T195325 [20:16:41] T189830: Improve media endpoint performance - https://phabricator.wikimedia.org/T189830 [20:16:41] T177619: Return variant URLs and titles in the metadata response - https://phabricator.wikimedia.org/T177619 [20:16:41] T196523: Page Preview should not strip space before period - https://phabricator.wikimedia.org/T196523 [20:16:56] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:16:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:17:03] volans: leave me mw1222 please [20:17:15] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:17:15] elukey: ack [20:17:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:17:52] elukey: kafka-mirror are running in eqiad and I have no rights to stop them [20:18:07] https://wikitech.wikimedia.org/wiki/User:Giuseppe_Lavagetto/How_to_debug_HHVM [20:18:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:18:45] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:18:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:18:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:18:57] Pchelolo: in eqiad there is no jvm runnign for kafka [20:19:06] RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 19.35, 29.19, 26.37 [20:19:35] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:19:45] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:19:45] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:19:46] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 9.53, 23.08, 22.34 [20:19:48] elukey: oh.. `kafka-mirror` systemd unit is something different and weird [20:20:04] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@03fa731]: Update mobileapps to b5e152d (T195325 T189830 T177619 T196523) (duration: 03m 30s) [20:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:23] !log rolled back mobileapps deploy [20:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:36] RECOVERY - High CPU load on API appserver on mw1221 is OK: OK - load average: 17.17, 24.48, 23.69 [20:20:46] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:20:46] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:20:53] sorry, didn't know about the other issues earlier. My bad. [20:21:04] Pchelolo: snap.. on which host? maybe just to reset fail it [20:21:29] elukey: on all, but they're not running, they're `active (exited)`.. I don't know what's that.. [20:21:38] I mean don't know what are those units [20:21:52] (03CR) 10Greg Grossmeier: [C: 031] Phabricator: Double the rate limit and connection limit [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) (owner: 1020after4) [20:21:55] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:22:05] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:22:06] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 52.94, 35.80, 30.90 [20:22:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:22:26] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 56.16, 32.54, 26.54 [20:22:35] PROBLEM - High CPU load on API appserver on mw1224 is CRITICAL: CRITICAL - load average: 52.03, 29.87, 21.24 [20:22:35] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:22:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:22:46] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:22:55] PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 52.44, 34.87, 27.53 [20:23:05] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 63.40, 42.82, 34.53 [20:23:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:23:25] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 60.09, 36.35, 28.77 [20:23:28] elukey: how's the script doing? [20:23:56] PROBLEM - High CPU load on API appserver on mw1289 is CRITICAL: CRITICAL - load average: 66.97, 44.82, 35.74 [20:23:57] Pchelolo: I am doing some sanity checks, it seems working so far [20:24:17] Pchelolo: I am purging three zk dirs: [20:24:17] /kafka/main-eqiad/brokers/topics [20:24:18] /kafka/main-eqiad/admin/delete_topics [20:24:18] /kafka/main-eqiad/config/topics [20:24:25] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:24:45] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [20:24:46] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [20:24:46] PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 74.31, 46.90, 38.94 [20:24:48] purging like removing completely everything or just for our failed topics? [20:25:05] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:25:10] Pchelolo: no no I've selected the long topics [20:25:15] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 56.90, 45.29, 36.41 [20:25:15] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:25:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:25:16] okok [20:25:25] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:25:26] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 55.10, 35.22, 30.88 [20:25:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:25:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:25:44] Pchelolo: currently at 580/2086 [20:25:45] PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 64.72, 42.70, 32.61 [20:25:55] PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 49.03, 31.76, 21.50 [20:25:55] PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CRITICAL - load average: 71.87, 41.24, 29.53 [20:25:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:25:55] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:26:15] RECOVERY - High CPU load on API appserver on mw1284 is OK: OK - load average: 18.58, 23.61, 29.10 [20:26:25] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:26:26] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 53.82, 33.63, 26.28 [20:26:56] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 14.64, 22.16, 29.16 [20:28:05] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 49.96, 44.17, 35.39 [20:28:45] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 50.53, 37.27, 40.86 [20:29:05] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 39.72, 36.65, 31.07 [20:29:06] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 60.10, 31.48, 31.31 [20:30:36] PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 58.93, 45.22, 37.38 [20:31:15] RECOVERY - High CPU load on API appserver on mw1286 is OK: OK - load average: 9.08, 25.29, 28.49 [20:32:36] PROBLEM - High CPU load on API appserver on mw1343 is CRITICAL: CRITICAL - load average: 71.79, 53.36, 44.16 [20:33:26] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 45.12, 36.81, 32.31 [20:34:16] PROBLEM - High CPU load on API appserver on mw1345 is CRITICAL: CRITICAL - load average: 74.71, 44.07, 32.92 [20:34:36] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 62.22, 42.02, 35.80 [20:35:06] PROBLEM - High CPU load on API appserver on mw1279 is CRITICAL: CRITICAL - load average: 66.32, 41.77, 32.81 [20:37:35] PROBLEM - High CPU load on API appserver on mw1342 is CRITICAL: CRITICAL - load average: 76.31, 33.80, 26.72 [20:38:06] PROBLEM - High CPU load on API appserver on mw1343 is CRITICAL: CRITICAL - load average: 52.49, 52.89, 47.21 [20:38:36] (03CR) 10Krinkle: [C: 031] Phabricator: Double the rate limit and connection limit [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) (owner: 1020after4) [20:39:07] <_joe_> !log depooling mw1223 for debugging [20:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:30] _joe_: should I just stop the job queue for a little while to decrease the load? [20:39:45] RECOVERY - High CPU load on API appserver on mw1342 is OK: OK - load average: 38.00, 33.96, 27.75 [20:39:48] <_joe_> Pchelolo: the jobqueue goes to another cluster [20:39:56] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 71.88, 40.07, 36.05 [20:40:37] elukey: let me know if i can help [20:41:06] nuria_: thanks :) currently waiting for zookeeper to be clean to restart kafka again [20:41:25] PROBLEM - High CPU load on API appserver on mw1344 is CRITICAL: CRITICAL - load average: 74.69, 43.97, 35.47 [20:42:25] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 17.51, 23.59, 28.98 [20:42:35] PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:42:55] PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:43:15] PROBLEM - Nginx local proxy to apache on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:43:35] RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 1.36, 14.61, 22.56 [20:43:45] RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 10.74, 15.51, 28.37 [20:43:45] PROBLEM - High CPU load on API appserver on mw1343 is CRITICAL: CRITICAL - load average: 51.52, 50.83, 48.15 [20:44:16] PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 76.43, 53.52, 40.15 [20:45:15] RECOVERY - Nginx local proxy to apache on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.034 second response time [20:45:45] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 76613 bytes in 0.274 second response time [20:45:49] Pchelolo: 1252/2086 [20:45:58] oh it's so slow.. [20:46:05] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.023 second response time [20:46:16] should the topic be updated for api outage? Users are noticing the api is not working in #wikipedia-en [20:47:12] yes [20:47:34] thanks. [20:47:35] PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 59.31, 55.23, 43.44 [20:47:55] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 70.93, 38.46, 32.45 [20:48:11] (03CR) 10Framawiki: [C: 031] phabricator: Attempt to multiply rate limits for WMDE and WMF offices [puppet] - 10https://gerrit.wikimedia.org/r/444124 (https://phabricator.wikimedia.org/T198612) (owner: 10Alex Monk) [20:48:47] <_joe_> !log repooled mw1223 after investigation [20:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:05] RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 23.68, 23.92, 29.75 [20:49:17] <_joe_> !log depooling mw1280 for debugging [20:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:36] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:49:36] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:49:46] RECOVERY - High CPU load on API appserver on mw1345 is OK: OK - load average: 24.44, 33.98, 35.65 [20:49:48] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259 (10Krenair) [20:49:50] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244 (10Krenair) 05Open>03Resolved [20:49:52] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006 (10Krenair) [20:50:16] RECOVERY - High CPU load on API appserver on mw1344 is OK: OK - load average: 27.50, 34.71, 35.40 [20:50:26] (03PS1) 10Krinkle: Revert "Make all non-test wikis write to both nutcracker and mcrouter again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445298 [20:50:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:50:55] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:51:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:51:16] RECOVERY - High CPU load on API appserver on mw1224 is OK: OK - load average: 13.98, 20.04, 23.58 [20:51:25] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:51:25] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:51:36] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:52:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:52:18] (03PS1) 10Alex Monk: deployment-prep: Remove hiera data for deleted instance [puppet] - 10https://gerrit.wikimedia.org/r/445299 [20:52:26] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 3.35, 23.23, 28.32 [20:53:05] PROBLEM - Nginx local proxy to apache on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:53:05] RECOVERY - High CPU load on API appserver on mw1277 is OK: OK - load average: 16.64, 21.48, 29.54 [20:53:08] (03CR) 10Krinkle: [C: 032] Revert "Make all non-test wikis write to both nutcracker and mcrouter again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445298 (owner: 10Krinkle) [20:53:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:53:20] * Krinkle staging on deploy1001 and mwdebug1002 [20:53:36] PROBLEM - HHVM rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:53:37] (03CR) 10Andrew Bogott: [C: 032] deployment-prep: Remove hiera data for deleted instance [puppet] - 10https://gerrit.wikimedia.org/r/445299 (owner: 10Alex Monk) [20:53:46] PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:54:58] (03Merged) 10jenkins-bot: Revert "Make all non-test wikis write to both nutcracker and mcrouter again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445298 (owner: 10Krinkle) [20:56:35] RECOVERY - High CPU load on API appserver on mw1339 is OK: OK - load average: 22.82, 29.75, 35.62 [20:56:57] (03CR) 10jenkins-bot: Revert "Make all non-test wikis write to both nutcracker and mcrouter again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445298 (owner: 10Krinkle) [20:57:26] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:57:51] !log krinkle@deploy1001 Synchronized wmf-config/mc.php: Ifa659de6453 - Revert multi-write mcrouter for most wikis - T198239 (duration: 00m 58s) [20:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:54] T198239: Rollout use of mcrouter for MediaWiki in production - https://phabricator.wikimedia.org/T198239 [20:57:56] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [20:58:05] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [20:58:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:58:16] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:58:24] heh [20:59:04] that... what's my change. [20:59:08] wasn't* [20:59:10] probably not no [20:59:16] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5&from=1531339346598&to=1531342728641 [20:59:23] that decline started 10min ago [20:59:45] yes it has been flapping for a long time [20:59:49] since 19:29:16 [20:59:57] Ulsfo HTTP 5xx reqs/min on graphite1001 that is [21:00:05] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [21:01:15] Eqiad HTTP 5xx reqs/min has recovered also 2 times since 19:28:55 [21:01:25] RECOVERY - High CPU load on API appserver on mw1343 is OK: OK - load average: 25.62, 28.18, 35.78 [21:01:45] RECOVERY - High CPU load on API appserver on mw1279 is OK: OK - load average: 19.35, 20.54, 29.47 [21:02:06] RECOVERY - High CPU load on API appserver on mw1235 is OK: OK - load average: 11.78, 13.16, 23.25 [21:02:25] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 21.61, 22.16, 29.65 [21:02:28] 10Operations, 10Cloud-Services, 10netops: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10chasemp) >>! In T193496#4228340, @ayounsi wrote: > https://apps.db.ripe.net/db-web-ui/#/lookup?source=ripe&key=185.15.56.0%2F24AS14907&type=route created. > ``` > 185... [21:03:05] RECOVERY - Nginx local proxy to apache on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 7.657 second response time [21:03:35] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 76653 bytes in 0.131 second response time [21:03:45] RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time [21:03:55] RECOVERY - High CPU load on API appserver on mw1221 is OK: OK - load average: 15.95, 16.27, 23.61 [21:03:55] RECOVERY - High CPU load on API appserver on mw1285 is OK: OK - load average: 18.66, 20.58, 29.30 [21:04:15] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 7.45, 10.54, 23.36 [21:04:58] Pchelolo: ~200 to go [21:05:14] after those I'll do a quick sanity check in zk and then attempt another start [21:05:25] kk [21:05:36] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 13.92, 14.28, 23.35 [21:07:45] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 14.53, 14.77, 23.95 [21:10:35] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 10.62, 11.61, 23.79 [21:10:35] RECOVERY - High CPU load on API appserver on mw1278 is OK: OK - load average: 18.67, 21.57, 29.21 [21:10:51] (03PS4) 10Rush: Phabricator: Double the rate limit and connection limit [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) (owner: 1020after4) [21:10:55] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 9.34, 12.77, 23.71 [21:11:05] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 9.73, 11.06, 23.24 [21:11:47] <_joe_> elukey: when you think you're ready to repool eventbus, ping me; or it's ok to wait tomorrow [21:12:04] _joe_: so what happened with appservers? [21:12:15] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 13.97, 14.17, 23.95 [21:12:28] <_joe_> Pchelolo: it's a long story, but unrelated from kafka [21:12:32] (03CR) 10Rush: [C: 032] Phabricator: Double the rate limit and connection limit [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) (owner: 1020after4) [21:12:35] RECOVERY - High CPU load on API appserver on mw1289 is OK: OK - load average: 19.67, 21.20, 29.27 [21:12:45] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 10.29, 11.97, 23.53 [21:12:56] RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 15.36, 16.44, 23.53 [21:13:06] RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 14.21, 17.07, 23.31 [21:13:12] Pchelolo, _joe_ I am trying to start again kafka, zk should be clean now [21:13:25] kk elukey watching it as well [21:14:19] <_joe_> !log repooling mw1280 [21:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:25] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1001 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001 [21:16:57] !log starting kafka on kafka100[1-3] after zk cleanup [21:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:26] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on einsteinium is OK: (C)0 le (W)25 le 68.7 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [21:19:56] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on einsteinium is OK: (C)0 le (W)25 le 76.51 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [21:21:11] elukey: how's it looking? [21:21:45] mark: for now looking like starting up [21:22:02] before it died within 10 min i think [21:24:36] mark: this time logs are clean and steady, no horrible errors [21:24:40] the logs look much better now at least [21:25:12] yeah Pchelolo, also all three brokers are partition leaders [21:25:28] ISRs are good [21:27:09] cool. next steps ? [21:27:31] I guess some more verification as step 1 ? or do you feel ok with this ? [21:28:05] so I'd need to puppetize the 2G heap increase (modified manually when disabled puppet, it makes sense to keep it since it helped) [21:28:27] elukey: I have like 10 items for postmortem here.. [21:28:39] then there is all the work to revert the codfw failover (eventbus + eventstreams).. I'd wait tomorrow if you guys are ok [21:28:49] <_joe_> elukey: uhm ok [21:28:59] <_joe_> I'm too tired to do something about it myself tbh [21:29:00] fine by me [21:29:03] I was wondering as well whether revert the switchiver today [21:29:08] <_joe_> and I would not be able to do a switchback [21:29:19] <_joe_> to codfw in case of need [21:29:27] let's do tomorrow [21:29:29] <_joe_> someone needs to send an email to ops@ at least [21:29:33] yeah I am super tired as well, too prone to failures [21:29:53] the only thing I would do is restart change-prop in codfw [21:29:55] _joe_: ^ [21:30:16] RECOVERY - Varnishkafka Statsv Delivery Errors per second on einsteinium is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=statsv&var-host=All [21:30:21] cause that one is switched off in both clusters [21:30:35] I can do that [21:30:52] I guess there's no objections to that? [21:31:05] I will stay more to watch its graphs [21:31:17] <_joe_> I am worried for the mediawiki api [21:31:18] we killed it in case it was causing the mediawiki api issues, but I think it's clear it did not [21:31:25] <_joe_> akosiaris: yeah [21:31:32] <_joe_> still it was hammering it [21:31:44] yeah it was adding to the load, I 'll give you that [21:32:08] should we lower the concurrency perhaps ? [21:32:57] (03PS1) 10Andrew Bogott: Delegate 185.15.56.0/24 to labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/445303 [21:33:00] <_joe_> I would prefer that, but your call guys [21:33:46] (03PS2) 10Andrew Bogott: Delegate 185.15.56.0/24 to labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/445303 (https://phabricator.wikimedia.org/T199374) [21:34:23] hm it does use restbase/parsoid in codfw, but api is indeed from eqiad.. [21:34:35] (03CR) 10Alex Monk: "Is this really necessary? Why not just point the entire zone at labs-ns* ?" [dns] - 10https://gerrit.wikimedia.org/r/445303 (https://phabricator.wikimedia.org/T199374) (owner: 10Andrew Bogott) [21:35:29] akosiaris: if you feel like it I can quickly decrease the concurrency, but the one that's important is only the one for transclusions [21:35:55] and it's currently 200 [21:36:32] let's lower it then? [21:36:38] ok. wanna decrease just that one then ? [21:36:57] akosiaris: the others are 50 and they normally don't even reach that [21:37:17] but ok, let's just decrease all and have a good night sleep [21:37:20] wait a min [21:37:46] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [21:37:54] sleep would be good [21:38:04] <_joe_> mark: ++ [21:38:14] (03PS1) 10Elukey: role::kafka::main: raise Kafka Java Xmx/Xms [puppet] - 10https://gerrit.wikimedia.org/r/445304 [21:38:27] this is the current setting (puppet disabled on kafka100[1-3] [21:38:54] thanks a lot everyone :) [21:39:24] I'll just deploy changeprop with new concurrency to codfw [21:39:32] ok [21:39:56] I am running pcc now, after that a quick sanity check for the patch would be awesome [21:40:12] Pchelolo: basically raising Xmx/Xms to 2G (rather than only 1G) [21:40:24] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@45c3807]: Temporary decrease concurrency [21:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:42] and [21:40:46] ouch [21:40:56] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [21:41:05] and I'll start a postmortem tomorrow if you don't mind [21:41:12] postmortem for kafka issue [21:41:19] heh, I guess scap is restarting changeprop [21:41:32] akosiaris: that's what we want isn't it? [21:41:40] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@45c3807]: Temporary decrease concurrency (duration: 01m 17s) [21:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:46] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational [21:41:46] yup, I was ready to do it, turns out I don't have too :-) [21:41:55] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational [21:41:56] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational [21:42:05] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational [21:42:25] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [21:42:46] RECOVERY - statsv process on webperf1001 is OK: PROCS OK: 3 processes with command name python, args statsv [21:42:53] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11780/ - don't see any change for /etc/default/kafka though..." [puppet] - 10https://gerrit.wikimedia.org/r/445304 (owner: 10Elukey) [21:43:05] RECOVERY - Check systemd state on webperf1001 is OK: OK - running: The system is fully operational [21:44:23] ok it is not the right file to change [21:44:32] I am also enabling puppet on scb2* [21:44:47] so I am going to restart puppet on kafka*, it will bring back /etc/default/kafka to 1G/1G without restarting anything [21:44:51] then I'll check tomorrow :) [21:45:01] (03CR) 10Ppchelko: role::kafka::main: raise Kafka Java Xmx/Xms (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/445304 (owner: 10Elukey) [21:45:16] elukey: that $heap_opts parameter is not used anywhere, is it ? [21:45:24] (03PS3) 10Andrew Bogott: Delegate 185.15.56.0/24 to labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/445303 (https://phabricator.wikimedia.org/T199374) [21:45:37] (03CR) 10jerkins-bot: [V: 04-1] Delegate 185.15.56.0/24 to labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/445303 (https://phabricator.wikimedia.org/T199374) (owner: 10Andrew Bogott) [21:45:49] akosiaris: I thought it was used inside kafka.default.erb, but apparently it is not the right way :) [21:46:46] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties [21:46:55] RECOVERY - Check systemd state on kafka1001 is OK: OK - running: The system is fully operational [21:47:03] elukey: tbh, my brain is fried... [21:47:08] akosiaris: mine too! [21:47:17] !log re-enable kafka mirror maker on kafka100[1-3] [21:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:35] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka1002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties [21:47:44] !log disabled phabricator throttling [21:47:45] elukey: you can leave puppet disabled for a few hours on the kafka boxes btw [21:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:55] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on einsteinium is OK: (C)0 le (W)100 le 3.086e+04 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [21:48:26] RECOVERY - Check systemd state on kafka1002 is OK: OK - running: The system is fully operational [21:48:30] akosiaris: I wanted to re-enable mirror maker so in these hours it'll catch up [21:48:35] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 3.082e+06 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [21:48:37] ah, good point [21:48:48] poor mm is lagging :D [21:48:56] ok done! [21:49:09] 3.082*10^6 ? [21:49:12] lol [21:49:23] (03PS4) 10Andrew Bogott: Delegate 185.15.56.0/24 to labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/445303 (https://phabricator.wikimedia.org/T199374) [21:49:25] RECOVERY - Check systemd state on kafka1003 is OK: OK - running: The system is fully operational [21:49:25] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka1003 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties [21:49:32] ok, change-pror in codfw is operational, but over night we will definitely accumulate some backlog [21:50:06] yeah I guess it will fine for a few hours [21:50:30] (hopefully) [21:50:30] <_joe_> how will people do without those prerendered graphs in graphoid [21:50:52] I think that rsyslog on lithium is misbehaving again like monday [21:51:08] <_joe_> elukey: I think someone in the US can take a look [21:51:41] yep :) [21:51:51] ok, I am going to bed [21:51:57] good night everyone [21:52:46] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [21:53:05] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1199 days) [21:53:48] !log restart rsyslog on lithium - in:imtcp stuck in EAGAIN (Resource temporarily unavailable) due to a old socket to tegmen.wikimedia.org [21:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:54] cc: godog --^ [21:54:49] going to bed as well :) [21:55:21] ok, thank you everyone, I'm leaving too [21:59:35] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10Nuria) [22:00:11] sent an email to operations@ Pchelolo [22:00:23] see you tomorrow people o/ [22:00:26] * elukey off! [22:00:34] see you elukey thanks a lot for help [22:00:45] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Please install Text::CSV_XS at stat1005 - https://phabricator.wikimedia.org/T199131 (10Nuria) 05Open>03Resolved [22:00:51] * elukey hugs Pchelolo [22:06:06] jouncebot: next [22:06:07] In 0 hour(s) and 53 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180711T2300) [22:08:20] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10Nuria) [22:16:10] anyone mind i I do a mobileapps deploy since I couldn't do it earlier with the outages? [22:18:16] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@03fa731]: Update mobileapps to b5e152d (T195325 T189830 T177619 T196523) [22:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:22] T195325: Page media endpoint missing math formula images - https://phabricator.wikimedia.org/T195325 [22:18:23] T189830: Improve media endpoint performance - https://phabricator.wikimedia.org/T189830 [22:18:23] T177619: Return variant URLs and titles in the metadata response - https://phabricator.wikimedia.org/T177619 [22:18:23] T196523: Page Preview should not strip space before period - https://phabricator.wikimedia.org/T196523 [22:23:12] !log created wikilove_log on ckbwiki T199336 [22:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:15] T199336: Problem with WikiLove in ckb Wikipedia - https://phabricator.wikimedia.org/T199336 [22:23:19] (03PS6) 10MarcoAurelio: Create site striker.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/441817 (https://phabricator.wikimedia.org/T189637) [22:25:00] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@03fa731]: Update mobileapps to b5e152d (T195325 T189830 T177619 T196523) (duration: 06m 44s) [22:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:06] T195325: Page media endpoint missing math formula images - https://phabricator.wikimedia.org/T195325 [22:25:06] T189830: Improve media endpoint performance - https://phabricator.wikimedia.org/T189830 [22:25:07] T177619: Return variant URLs and titles in the metadata response - https://phabricator.wikimedia.org/T177619 [22:25:07] T196523: Page Preview should not strip space before period - https://phabricator.wikimedia.org/T196523 [22:31:45] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 46 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [22:35:51] (03PS1) 10Alex Monk: openstack eqiad1: Run dns_floating_ip_updater [puppet] - 10https://gerrit.wikimedia.org/r/445310 (https://phabricator.wikimedia.org/T199374) [22:43:55] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [22:57:29] (03PS1) 10MarcoAurelio: deployment-prep: add eswikibooks to le_subjects [puppet] - 10https://gerrit.wikimedia.org/r/445315 (https://phabricator.wikimedia.org/T199387) [22:59:16] (03PS2) 10MarcoAurelio: deployment-prep: add eswikibooks to le_subjects [puppet] - 10https://gerrit.wikimedia.org/r/445315 (https://phabricator.wikimedia.org/T199387) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180711T2300). [23:00:04] davidwbarratt and Hauskatze: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:13] o/ [23:00:26] (03CR) 10Reedy: [C: 04-1] "This won't work because we've already too many in one cert" [puppet] - 10https://gerrit.wikimedia.org/r/445315 (https://phabricator.wikimedia.org/T199387) (owner: 10MarcoAurelio) [23:00:29] (03CR) 10Alex Monk: [C: 04-1] "The list is already at 100 entries you can't add more to it." [puppet] - 10https://gerrit.wikimedia.org/r/445315 (https://phabricator.wikimedia.org/T199387) (owner: 10MarcoAurelio) [23:00:55] here! [23:02:17] (03CR) 10MarcoAurelio: "Can we remove the 'zero' entries now that the WP0 program is being disbanded?" [puppet] - 10https://gerrit.wikimedia.org/r/445315 (https://phabricator.wikimedia.org/T199387) (owner: 10MarcoAurelio) [23:05:10] (03CR) 10Alex Monk: [C: 04-1] "Not while the sites are still running." [puppet] - 10https://gerrit.wikimedia.org/r/445315 (https://phabricator.wikimedia.org/T199387) (owner: 10MarcoAurelio) [23:05:41] I can SWAT [23:06:06] (03PS2) 10Thcipriani: Enable anonymous cookie blocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444246 (https://phabricator.wikimedia.org/T192017) (owner: 10Dbarratt) [23:07:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444246 (https://phabricator.wikimedia.org/T192017) (owner: 10Dbarratt) [23:08:42] (03Merged) 10jenkins-bot: Enable anonymous cookie blocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444246 (https://phabricator.wikimedia.org/T192017) (owner: 10Dbarratt) [23:08:56] (03CR) 10jenkins-bot: Enable anonymous cookie blocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444246 (https://phabricator.wikimedia.org/T192017) (owner: 10Dbarratt) [23:09:59] 10Operations, 10Core-Platform-Team, 10HHVM, 10User-ArielGlenn: Run all jobs on PHP7 - https://phabricator.wikimedia.org/T195392 (10CCicalese_WMF) [23:10:07] davidwbarratt: your change is live on mwdebug1002, check please [23:10:39] (03Abandoned) 10MarcoAurelio: deployment-prep: add eswikibooks to le_subjects [puppet] - 10https://gerrit.wikimedia.org/r/445315 (https://phabricator.wikimedia.org/T199387) (owner: 10MarcoAurelio) [23:10:40] thcipriani testing! [23:10:53] (03PS1) 10EBernhardson: Delete unused elasticsearch::proxy [puppet] - 10https://gerrit.wikimedia.org/r/445320 [23:15:16] thcipriani it works! [23:15:38] davidwbarratt: great! thanks for checking. going live. [23:16:46] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [23:17:46] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:444246|Enable anonymous cookie blocking]] T192017 (duration: 00m 58s) [23:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:49] T192017: Enable anon cookie blocking on all Wikimedia wikis - https://phabricator.wikimedia.org/T192017 [23:17:53] ^ davidwbarratt live everywhere now [23:18:00] WOOT! [23:18:03] thcipriani thanks! [23:18:08] yw :) [23:20:02] (03PS3) 10Thcipriani: throttle: lift limits for Kaqchikel edit-a-thon; clear expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444443 (https://phabricator.wikimedia.org/T199040) (owner: 10MarcoAurelio) [23:20:10] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444443 (https://phabricator.wikimedia.org/T199040) (owner: 10MarcoAurelio) [23:20:49] will push ^ live when it merges [23:21:06] ok [23:21:10] cannot be tested [23:21:24] (03Merged) 10jenkins-bot: throttle: lift limits for Kaqchikel edit-a-thon; clear expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444443 (https://phabricator.wikimedia.org/T199040) (owner: 10MarcoAurelio) [23:21:37] (03CR) 10jenkins-bot: throttle: lift limits for Kaqchikel edit-a-thon; clear expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444443 (https://phabricator.wikimedia.org/T199040) (owner: 10MarcoAurelio) [23:26:17] !log thcipriani@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:444443|throttle: lift limits for Kaqchikel edit-a-thon; clear expired rules]] T199040 (duration: 00m 56s) [23:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:20] T199040: Increase MediaWiki rate limit for Kaqchikel edit-a-thon - https://phabricator.wikimedia.org/T199040 [23:26:30] ^ Hauskatze throttle rules are live, thanks for the patch! [23:26:46] thcipriani: one more request remaining [23:27:00] oh I see it :) [23:27:09] I've never run it before... [23:28:05] thcipriani: it's easy, I can guide if required [23:29:18] thcipriani: just cd to srv/mediawiki-staging and 'scap update-interwiki-cache' then answer the questions, wait for the patch to be merged, answer 'yes' and scap will continue [23:29:34] okie doke, here goes :) [23:29:47] (03PS1) 10Thcipriani: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445322 [23:29:49] (03CR) 10Thcipriani: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445322 (owner: 10Thcipriani) [23:31:06] (03Merged) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445322 (owner: 10Thcipriani) [23:31:19] (03CR) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445322 (owner: 10Thcipriani) [23:32:23] !log thcipriani@deploy1001 Synchronized wmf-config/interwiki.php: Updating interwiki cache (duration: 02m 39s) [23:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:35] Hauskatze: relatively painless :) [23:32:53] thcipriani: thanks! [23:33:01] yw :) [23:33:23] thcipriani: feel free to amend the wikitech docs if there's something wrong on them, I tried to update them a couple of days ago based on my experience [23:33:45] you submitted a patch to fix this script some days ago as well so it was a good oportunity to test if it worked :D [23:34:44] heh, yep, docs looks right to me [23:36:41] perfecto [23:37:57] added in a bit of color [23:46:05] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [23:53:32] awesome, looks good :) [23:53:57] * Hauskatze goes to bed