[00:00:05] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171212T0000). [00:00:05] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:05] I'm the only one so I'll SWAT my own patches [00:05:38] (03PS2) 10Catrope: Give rcOresDamagingPref the same default as oresDamagingPref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396805 (https://phabricator.wikimedia.org/T182354) [00:05:40] (03CR) 10Catrope: [C: 032] Give rcOresDamagingPref the same default as oresDamagingPref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396805 (https://phabricator.wikimedia.org/T182354) (owner: 10Catrope) [00:07:00] (03Merged) 10jenkins-bot: Give rcOresDamagingPref the same default as oresDamagingPref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396805 (https://phabricator.wikimedia.org/T182354) (owner: 10Catrope) [00:07:11] (03CR) 10jenkins-bot: Give rcOresDamagingPref the same default as oresDamagingPref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396805 (https://phabricator.wikimedia.org/T182354) (owner: 10Catrope) [00:13:46] (03PS7) 10Paladox: Gerrit: Switch to the mariadb connector [puppet] - 10https://gerrit.wikimedia.org/r/384588 (https://phabricator.wikimedia.org/T176164) [00:13:57] (03PS8) 10Paladox: Gerrit: Switch to the mariadb connector [puppet] - 10https://gerrit.wikimedia.org/r/384588 (https://phabricator.wikimedia.org/T176164) [00:22:37] (03Abandoned) 10Paladox: contint: Remove duplicate Class[Contint::Browsers] [puppet] - 10https://gerrit.wikimedia.org/r/397603 (owner: 10Paladox) [00:22:44] RoanKattouw: I have a patch to SWAT, but I can deploy it myself if you're already done [00:22:49] (03PS3) 10Paladox: contint: Remove duplicate Class[Contint::Packages::Ruby] [puppet] - 10https://gerrit.wikimedia.org/r/397601 [00:22:58] legoktm: Still going, sorry, got distracted halfway [00:23:01] (03PS4) 10Paladox: contint: Remove duplicate Class[Contint::Packages::Ruby] [puppet] - 10https://gerrit.wikimedia.org/r/397601 (https://phabricator.wikimedia.org/T182642) [00:23:13] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Fix default for rcOresDamagingPref (duration: 00m 56s) [00:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:29] Also holy crap sync-file is suddenly VERY verbose [00:24:09] (03PS1) 10Legoktm: Have ExtensionDistributor treat REL1_30 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397719 [00:24:25] heh [00:24:34] I hadn't got that far yet [00:24:44] :D [00:24:52] RoanKattouw: ok, let me know when you're done [00:26:42] Also, for your amusement: T182643 [00:26:43] T182643: cache_git_info (from e.g. scap sync-file) is way way too verbose - https://phabricator.wikimedia.org/T182643 [00:27:02] I just paused to file that, and it took more time than I thought because my clipboard isn't big enough for the scap output [00:27:03] hah that is way too verbose [00:27:06] I had to copy it in four parts [00:27:20] debug flags left in? [00:27:29] At least it didn't exhaust my terminal bugger [00:27:30] *buffer [00:27:39] (03PS3) 10Catrope: Revert "Disable ORES in fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396806 (https://phabricator.wikimedia.org/T182354) [00:27:43] (03CR) 10Catrope: [C: 032] Revert "Disable ORES in fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396806 (https://phabricator.wikimedia.org/T182354) (owner: 10Catrope) [00:29:08] (03Merged) 10jenkins-bot: Revert "Disable ORES in fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396806 (https://phabricator.wikimedia.org/T182354) (owner: 10Catrope) [00:29:16] (03Draft1) 10Paladox: contint: Doin't install the npm package on stretch [puppet] - 10https://gerrit.wikimedia.org/r/397720 [00:29:19] (03CR) 10jenkins-bot: Revert "Disable ORES in fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396806 (https://phabricator.wikimedia.org/T182354) (owner: 10Catrope) [00:29:21] (03PS2) 10Paladox: contint: Doin't install the npm package on stretch [puppet] - 10https://gerrit.wikimedia.org/r/397720 [00:39:37] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Re-enable ORES on fawiki (T182354) (duration: 00m 56s) [00:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:47] T182354: OresDamagingPref back-compatibility is logging exceptions - https://phabricator.wikimedia.org/T182354 [00:40:31] I ran into that overly verbose log yesterday and freaked out. I thought something was broken. Turns out they pushed out a new scap version which does that. [00:42:31] legoktm: OK I'm done [00:42:38] ty [00:42:46] Niharika: Yeah Tyler responded saying it was fixed in master already [00:42:58] (03PS2) 10Legoktm: Have ExtensionDistributor treat REL1_30 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397719 [00:43:02] (03CR) 10Legoktm: [C: 032] Have ExtensionDistributor treat REL1_30 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397719 (owner: 10Legoktm) [00:43:13] I was slightly less freaked out but only because I remember the old scap which did this kind of stuff in normal operation [00:43:38] Yeah, I saw. I feel guilty for not filing a ticket for this myself. [00:43:41] Right. [00:44:32] (03Merged) 10jenkins-bot: Have ExtensionDistributor treat REL1_30 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397719 (owner: 10Legoktm) [00:45:17] (03PS2) 10Legoktm: Remove manual firejailing of Score binaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394913 (https://phabricator.wikimedia.org/T181535) [00:46:51] (03CR) 10jenkins-bot: Have ExtensionDistributor treat REL1_30 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397719 (owner: 10Legoktm) [00:47:16] !log legoktm@tin Synchronized wmf-config/CommonSettings.php: Have ExtensionDistributor treat REL1_30 as stable (duration: 00m 56s) [00:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:52] (03CR) 10Legoktm: [C: 032] Remove manual firejailing of Score binaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394913 (https://phabricator.wikimedia.org/T181535) (owner: 10Legoktm) [00:49:37] (03Merged) 10jenkins-bot: Remove manual firejailing of Score binaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394913 (https://phabricator.wikimedia.org/T181535) (owner: 10Legoktm) [00:49:49] (03CR) 10jenkins-bot: Remove manual firejailing of Score binaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394913 (https://phabricator.wikimedia.org/T181535) (owner: 10Legoktm) [00:53:08] !log legoktm@tin Synchronized wmf-config/CommonSettings.php: Remove manual firejailing of Score binaries (T181535) (duration: 00m 56s) [00:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:18] T181535: Convert Score binaries to use MediaWiki shell restrictions - https://phabricator.wikimedia.org/T181535 [00:55:10] (03CR) 10Legoktm: [C: 031] "This can be merged now, MediaWiki no longer uses them." [puppet] - 10https://gerrit.wikimedia.org/r/394914 (https://phabricator.wikimedia.org/T181535) (owner: 10Legoktm) [00:56:23] PROBLEM - HHVM rendering on mw2146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:57:22] RECOVERY - HHVM rendering on mw2146 is OK: HTTP OK: HTTP/1.1 200 OK - 73513 bytes in 0.352 second response time [01:12:42] PROBLEM - HHVM rendering on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:32] RECOVERY - HHVM rendering on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 73387 bytes in 0.141 second response time [01:34:00] (03PS1) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [01:54:56] (03PS2) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [02:27:42] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.11) (duration: 06m 21s) [02:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:13] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 787.69 seconds [03:28:02] (03PS1) 10Dzahn: mariadb::tendril: move firewall to role, use profile [puppet] - 10https://gerrit.wikimedia.org/r/397725 [03:33:50] (03PS1) 10Dzahn: druid: move firewall includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/397726 [03:37:29] (03PS1) 10Dzahn: prometheus: move duplicate firewall/standard include [puppet] - 10https://gerrit.wikimedia.org/r/397727 [03:40:50] (03PS1) 10Dzahn: mirrors: move standard include out of site [puppet] - 10https://gerrit.wikimedia.org/r/397728 [03:49:23] PROBLEM - IPMI Sensor Status on db1055 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [03:52:13] (03PS1) 10Dzahn: planet: add some missing Hiera calls for parameters [puppet] - 10https://gerrit.wikimedia.org/r/397729 [03:56:16] (03PS1) 10Dzahn: aptrepo: move Hiera calls into parameters [puppet] - 10https://gerrit.wikimedia.org/r/397730 [03:57:13] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 247.38 seconds [04:00:30] (03PS1) 10Dzahn: pentest::tools: add missing system::role [puppet] - 10https://gerrit.wikimedia.org/r/397731 [04:01:11] (03CR) 10jerkins-bot: [V: 04-1] pentest::tools: add missing system::role [puppet] - 10https://gerrit.wikimedia.org/r/397731 (owner: 10Dzahn) [04:01:54] (03PS2) 10Zppix: pentest::tools: add missing system::role [puppet] - 10https://gerrit.wikimedia.org/r/397731 (owner: 10Dzahn) [04:02:03] (03CR) 10Zppix: "Fixed whitespace" [puppet] - 10https://gerrit.wikimedia.org/r/397731 (owner: 10Dzahn) [04:03:45] (03PS1) 10Dzahn: icinga::nsca: drop system::role not in role class [puppet] - 10https://gerrit.wikimedia.org/r/397732 [04:05:11] (03PS17) 10TerraCodes: Add loginwiki and wikidata to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) [04:05:30] (03PS14) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) [04:05:43] (03PS28) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [04:17:10] (03PS1) 10Dzahn: gerrit: correct variable name for list of servers [puppet] - 10https://gerrit.wikimedia.org/r/397733 [04:21:50] (03PS1) 10Dzahn: planet: drop duplicate standard include [puppet] - 10https://gerrit.wikimedia.org/r/397734 [04:54:20] (03PS3) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [05:02:10] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler02/9286/" [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [05:12:22] PROBLEM - HHVM rendering on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:12] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 73311 bytes in 0.144 second response time [05:49:12] PROBLEM - Check Varnish expiry mailbox lag on cp4024 is CRITICAL: CRITICAL: expiry mailbox lag is 2025108 [06:30:04] (03PS10) 10MarcoAurelio: Extension:Translate default permissions for Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793) [06:33:17] (03PS11) 10MarcoAurelio: Extension:Translate default permissions for Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793) [06:38:21] (03PS12) 10MarcoAurelio: Extension:Translate default permissions for Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793) [06:41:56] (03PS1) 10Marostegui: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397739 (https://phabricator.wikimedia.org/T178359) [06:43:12] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397739 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:44:24] (03PS2) 10Marostegui: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397739 (https://phabricator.wikimedia.org/T178359) [06:46:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397739 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:46:55] 10Operations, 10DBA, 10Patch-For-Review: Power supply error on db1055 - https://phabricator.wikimedia.org/T182653#3830049 (10Marostegui) [06:47:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397739 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:47:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397739 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:48:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 - T178359 T182653 (duration: 00m 56s) [06:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:10] T182653: Power supply error on db1055 - https://phabricator.wikimedia.org/T182653 [06:49:10] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:56:18] !log stop MySQL on db1055 - T182653 [06:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:29] T182653: Power supply error on db1055 - https://phabricator.wikimedia.org/T182653 [07:04:37] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397740 (https://phabricator.wikimedia.org/T174569) [07:10:35] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Power supply error on db1055 - https://phabricator.wikimedia.org/T182653#3830085 (10Marostegui) a:03Cmjohnson @Cmjohnson I have been unable to identify which of the PSU is the one failing, the idrac console isn't recording which one is it (sometimes i... [07:10:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397740 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [07:12:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397740 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [07:12:30] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397740 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [07:13:28] (03CR) 10Marostegui: "The commit message was wrong, this was actually db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397740 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [07:14:08] wow, the scap output is now super verbose :| [07:14:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1081 - T174569 (duration: 00m 56s) [07:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:54] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [07:15:20] !log Deploy schema change on db1081 - https://phabricator.wikimedia.org/T174569 [07:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:00] (03PS1) 10KartikMistry: apertium-mk-bg: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-mk-bg] - 10https://gerrit.wikimedia.org/r/397741 (https://phabricator.wikimedia.org/T171406) [07:18:45] (03CR) 10jerkins-bot: [V: 04-1] apertium-mk-bg: Depends on cg3 (>= 1.0.0~r12254) [debs/contenttranslation/apertium-mk-bg] - 10https://gerrit.wikimedia.org/r/397741 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [07:28:34] (03PS1) 10KartikMistry: apertium-mk-en: Update dependency on cg3 [debs/contenttranslation/apertium-mk-en] - 10https://gerrit.wikimedia.org/r/397742 (https://phabricator.wikimedia.org/T171406) [07:29:14] (03CR) 10jerkins-bot: [V: 04-1] apertium-mk-en: Update dependency on cg3 [debs/contenttranslation/apertium-mk-en] - 10https://gerrit.wikimedia.org/r/397742 (https://phabricator.wikimedia.org/T171406) (owner: 10KartikMistry) [07:58:52] (03PS1) 10Elukey: netboot.cfg: change partman config for notebook1002 [puppet] - 10https://gerrit.wikimedia.org/r/397743 (https://phabricator.wikimedia.org/T181518) [07:59:45] (03CR) 10Elukey: [C: 032] netboot.cfg: change partman config for notebook1002 [puppet] - 10https://gerrit.wikimedia.org/r/397743 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [08:15:47] (03CR) 10Muehlenhoff: [C: 031] "Ack, that's identical to my earlier https://gerrit.wikimedia.org/r/#/c/394927/, I'll merge your's later on" [puppet] - 10https://gerrit.wikimedia.org/r/394914 (https://phabricator.wikimedia.org/T181535) (owner: 10Legoktm) [08:21:30] (03CR) 10Alexandros Kosiaris: "No we shouldn't. If anything we should just no longer install ruby-mysql at all in puppetmasters since starting with version 4 it is no lo" [puppet] - 10https://gerrit.wikimedia.org/r/391336 (owner: 10Paladox) [08:22:42] (03CR) 10Alexandros Kosiaris: [C: 032] icinga::nsca: drop system::role not in role class [puppet] - 10https://gerrit.wikimedia.org/r/397732 (owner: 10Dzahn) [08:22:47] (03PS2) 10Alexandros Kosiaris: icinga::nsca: drop system::role not in role class [puppet] - 10https://gerrit.wikimedia.org/r/397732 (owner: 10Dzahn) [08:22:56] 10Operations: Integrate stretch 9.3 point update - https://phabricator.wikimedia.org/T182655#3830186 (10MoritzMuehlenhoff) [08:23:46] 10Operations: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656#3830197 (10MoritzMuehlenhoff) [08:26:35] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3830208 (10hashar) [08:31:11] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] icinga::nsca: drop system::role not in role class [puppet] - 10https://gerrit.wikimedia.org/r/397732 (owner: 10Dzahn) [08:31:12] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [08:31:22] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [08:31:44] (03PS1) 10Giuseppe Lavagetto: Remove trendingedits discovery endpoint [dns] - 10https://gerrit.wikimedia.org/r/397745 (https://phabricator.wikimedia.org/T180384) [08:31:46] (03PS1) 10Giuseppe Lavagetto: Remove all references to trendingedits [dns] - 10https://gerrit.wikimedia.org/r/397746 (https://phabricator.wikimedia.org/T180384) [08:32:10] (03PS1) 10Hashar: Add .gitreview [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397747 [08:32:21] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM but needs:" [puppet] - 10https://gerrit.wikimedia.org/r/397571 (https://phabricator.wikimedia.org/T180384) (owner: 10Mobrovac) [08:32:39] (03CR) 10jerkins-bot: [V: 04-1] Add .gitreview [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397747 (owner: 10Hashar) [08:47:38] !log updated jessie installer netboot image after jessie 8.10 point release [08:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:10] (03PS1) 10Hashar: Restrict setup.py to python 3.4 or later [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397748 [08:49:37] (03PS1) 10Elukey: Add mw13[29-37] to site.pp and conftool [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) [08:49:39] (03CR) 10jerkins-bot: [V: 04-1] Restrict setup.py to python 3.4 or later [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397748 (owner: 10Hashar) [08:50:10] (03CR) 10jerkins-bot: [V: 04-1] Add mw13[29-37] to site.pp and conftool [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [08:51:22] ahh yes you are right jenkinks [08:51:55] sometime sometime :] [08:51:59] :) [08:52:30] elukey: in theory you could add a git hook locally that would run the tests automagically :D [08:52:34] (03PS2) 10Elukey: Add mw13[29-37] to site.pp and conftool [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) [08:52:42] (03PS1) 10Jcrespo: mariadb: Depool db2059 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397750 (https://phabricator.wikimedia.org/T181777) [08:52:49] hashar: I thought I had ran the tests but apparently not [08:53:09] (03CR) 10jerkins-bot: [V: 04-1] Add mw13[29-37] to site.pp and conftool [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [08:53:21] hashar: ufff [08:53:25] I have an alias for bundle exec rake test [08:53:30] is it enough? [08:53:49] bundle exec rake --jobs 1 test [08:53:59] so that tests are run serially [08:54:04] which makes it easier to spot the failure [08:54:42] theorically you could add it to: .git/hooks/commit-msg [08:54:50] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2059 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397750 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [08:55:27] or pre-commit ;D [08:55:35] or in your text editor [08:56:11] (03Merged) 10jenkins-bot: mariadb: Depool db2059 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397750 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [08:56:41] (03CR) 10jenkins-bot: mariadb: Depool db2059 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397750 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [08:57:00] hashar: ack! [08:57:45] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2059 (duration: 00m 56s) [08:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:08] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2059 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397751 [08:58:20] (03CR) 10Elukey: "The -1 is for:" [puppet] - 10https://gerrit.wikimedia.org/r/397749 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [09:00:13] !log stop, upgrade and reboot db2059 [09:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:07] (03PS15) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [09:04:09] (03PS1) 10Jcrespo: mariadb: Update db2059 socket location to the standard path [puppet] - 10https://gerrit.wikimedia.org/r/397753 (https://phabricator.wikimedia.org/T148507) [09:04:37] (03PS2) 10Jcrespo: mariadb: Update db2059 socket location to the standard path [puppet] - 10https://gerrit.wikimedia.org/r/397753 (https://phabricator.wikimedia.org/T148507) [09:05:38] (03CR) 10Jcrespo: [C: 032] mariadb: Update db2059 socket location to the standard path [puppet] - 10https://gerrit.wikimedia.org/r/397753 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [09:16:57] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397755 [09:17:00] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397755 [09:20:28] (03PS2) 10Alexandros Kosiaris: Remove torrus.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/351265 (https://phabricator.wikimedia.org/T87840) [09:20:42] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397755 (owner: 10Marostegui) [09:20:44] (03Abandoned) 10Alexandros Kosiaris: Remove torrus.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/351265 (https://phabricator.wikimedia.org/T87840) (owner: 10Alexandros Kosiaris) [09:21:52] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397755 (owner: 10Marostegui) [09:22:03] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397755 (owner: 10Marostegui) [09:23:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1096:3316 after InnoDB there - T178359 (duration: 00m 56s) [09:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:20] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:23:29] hashar: now it works perfectly, thanks a lot! [09:25:18] (03CR) 10ArielGlenn: "The wikidata cron job completed, no blockers left." [puppet] - 10https://gerrit.wikimedia.org/r/396928 (https://phabricator.wikimedia.org/T113467) (owner: 10ArielGlenn) [09:25:25] (03PS2) 10ArielGlenn: remove the last vestiges of the datasets user [puppet] - 10https://gerrit.wikimedia.org/r/396928 (https://phabricator.wikimedia.org/T113467) [09:26:28] (03CR) 10ArielGlenn: [C: 032] remove the last vestiges of the datasets user [puppet] - 10https://gerrit.wikimedia.org/r/396928 (https://phabricator.wikimedia.org/T113467) (owner: 10ArielGlenn) [09:30:52] RECOVERY - Check systemd state on ms-be1039 is OK: OK - running: The system is fully operational [09:34:12] !log Stop replication in sync on db1034 and db1039 for data consistency check - T163190 [09:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:23] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [09:38:05] !log reduce replication factor for cassandra on maps-test cluster and reset cassandra on maps-test2001 to work around limited disk space - T182583 [09:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:17] T182583: maps-test2001 is low on disk space - https://phabricator.wikimedia.org/T182583 [09:38:39] (03PS1) 10Marostegui: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397756 (https://phabricator.wikimedia.org/T163190) [09:40:13] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[enforce-users-groups-cleanup] [09:41:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397756 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [09:42:20] (03PS2) 10Alexandros Kosiaris: Remove torrus role, module, varnish backend and references [puppet] - 10https://gerrit.wikimedia.org/r/351276 [09:42:23] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397756 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [09:42:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397756 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [09:43:02] RECOVERY - Disk space on maps-test2001 is OK: DISK OK [09:43:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1086 - T163190 (duration: 00m 56s) [09:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:03] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [09:44:17] !log Stop db1039 and db1086 in sync - T163190 [09:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:13] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:45:14] (03CR) 10ArielGlenn: "wikidata cron job is complete, nothing uses this filesystem any more on the snapshots. time to unmount!" [puppet] - 10https://gerrit.wikimedia.org/r/396931 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [09:45:39] (03PS4) 10ArielGlenn: get rid of datasets1001 mount on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/396931 (https://phabricator.wikimedia.org/T182540) [09:47:27] (03CR) 10ArielGlenn: [C: 032] get rid of datasets1001 mount on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/396931 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [09:49:26] (03PS6) 10Volans: Metric alarms: convert dashboad_link to array [puppet] - 10https://gerrit.wikimedia.org/r/392607 (https://phabricator.wikimedia.org/T170353) [09:49:27] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2059 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397751 (owner: 10Jcrespo) [09:50:49] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2059 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397751 (owner: 10Jcrespo) [09:51:03] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2059 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397751 (owner: 10Jcrespo) [09:51:06] (03CR) 10Volans: [C: 032] Metric alarms: convert dashboad_link to array [puppet] - 10https://gerrit.wikimedia.org/r/392607 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [09:51:35] (03PS1) 10Marostegui: db1086.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/397757 [09:51:44] !log updated stretch installer netboot image after jessie 8.10 point release [09:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:53] !log updated stretch installer netboot image after stretch 9.3 point release [09:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:12] RECOVERY - Long running screen/tmux on restbase2004 is OK: OK: No SCREEN or tmux processes detected. [09:54:51] \o/ [09:58:07] (03PS3) 10Alexandros Kosiaris: Remove torrus role, module, varnish backend and references [puppet] - 10https://gerrit.wikimedia.org/r/351276 [10:01:41] (03CR) 10Alexandros Kosiaris: [C: 032] Remove torrus role, module, varnish backend and references [puppet] - 10https://gerrit.wikimedia.org/r/351276 (owner: 10Alexandros Kosiaris) [10:01:47] (03CR) 10Filippo Giunchedi: Add kubelet operational latencies check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397552 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [10:03:45] (03PS1) 10Volans: Varnish instance: fix child restarted check [puppet] - 10https://gerrit.wikimedia.org/r/397759 (https://phabricator.wikimedia.org/T170353) [10:04:17] (03CR) 10jerkins-bot: [V: 04-1] Varnish instance: fix child restarted check [puppet] - 10https://gerrit.wikimedia.org/r/397759 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:04:26] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2059 (duration: 00m 56s) [10:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:15] (03PS2) 10Volans: Varnish instance: fix child restarted check [puppet] - 10https://gerrit.wikimedia.org/r/397759 (https://phabricator.wikimedia.org/T170353) [10:05:32] (03PS1) 10Jcrespo: mariadb: Depool db2066 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397760 (https://phabricator.wikimedia.org/T181777) [10:06:36] (03PS3) 10Volans: Varnish instance: fix child restarted check [puppet] - 10https://gerrit.wikimedia.org/r/397759 (https://phabricator.wikimedia.org/T170353) [10:07:50] (03CR) 10Filippo Giunchedi: Add 3 prometheus checks for kubernetes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [10:07:57] (03CR) 10Volans: [C: 032] Varnish instance: fix child restarted check [puppet] - 10https://gerrit.wikimedia.org/r/397759 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:14:50] !log reimaging mw1260 (video scaler) to stretch [10:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:08] (03PS1) 10ArielGlenn: fix up some style violations in dumps manifests [puppet] - 10https://gerrit.wikimedia.org/r/397761 [10:16:30] (03CR) 10Volans: "Comments inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [10:18:48] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2066 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397760 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [10:20:07] (03Merged) 10jenkins-bot: mariadb: Depool db2066 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397760 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [10:20:19] (03CR) 10jenkins-bot: mariadb: Depool db2066 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397760 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [10:21:23] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:22:56] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2066 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397764 [10:23:06] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2066 (duration: 00m 56s) [10:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:31] !log gehel@tin Started deploy [kartotherian/deploy@6e223df]: new kartotherian packaging on maps-test2004 [10:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:48] !log gehel@tin Finished deploy [kartotherian/deploy@6e223df]: new kartotherian packaging on maps-test2004 (duration: 00m 18s) [10:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:09] !log stop, upgrade and reboot db2066 [10:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:47] (03PS1) 10Elukey: role::cache::misc: add a test Varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/397765 [10:29:35] (03PS2) 10ArielGlenn: fix up some style violations in dumps manifests [puppet] - 10https://gerrit.wikimedia.org/r/397761 [10:29:39] (03CR) 10Volans: "This affects also commands like "puppet node clean" on puppetmaster1001, so it's also blocking reimages." [puppet] - 10https://gerrit.wikimedia.org/r/397624 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [10:30:12] (03CR) 10Giuseppe Lavagetto: [C: 031] puppet: change location of environment setting from [main] to [agent] [puppet] - 10https://gerrit.wikimedia.org/r/397624 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [10:30:56] (03CR) 10ArielGlenn: [C: 032] fix up some style violations in dumps manifests [puppet] - 10https://gerrit.wikimedia.org/r/397761 (owner: 10ArielGlenn) [10:31:31] (03PS16) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [10:31:33] (03PS1) 10Jcrespo: mariadb: Update db2066 socket to the default location [puppet] - 10https://gerrit.wikimedia.org/r/397766 (https://phabricator.wikimedia.org/T148507) [10:33:17] (03PS2) 10Jcrespo: mariadb: Update db2066 socket to the default location [puppet] - 10https://gerrit.wikimedia.org/r/397766 (https://phabricator.wikimedia.org/T148507) [10:33:28] (03CR) 10Jcrespo: [C: 032] mariadb: Update db2066 socket to the default location [puppet] - 10https://gerrit.wikimedia.org/r/397766 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [10:35:37] 10Operations: Integrate stretch 9.3 point update - https://phabricator.wikimedia.org/T182655#3830407 (10MoritzMuehlenhoff) None of the packages removed in the 9.3 update were present in our environment. [10:35:49] 10Operations: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656#3830408 (10MoritzMuehlenhoff) None of the packages removed in the 8.10 update were present in our environment. [10:40:06] (03PS1) 10EddieGP: Restrict sending mails to new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397768 (https://phabricator.wikimedia.org/T182541) [10:41:13] 10Operations: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371#3830425 (10MoritzMuehlenhoff) >>! In T177371#3672012, @MoritzMuehlenhoff wrote: >>>! In T177371#3657276, @faidon wrote: >> We have at least another usage, the Ganeti key (cf. `modules/role/manifests/ganeti.pp`).... [10:41:46] (03CR) 10Filippo Giunchedi: "Minor things, LGTM overall" (032 comments) [debs/prometheus-ircd-exporter] - 10https://gerrit.wikimedia.org/r/395751 (owner: 10Muehlenhoff) [10:44:15] PROBLEM - nutcracker port on labtestweb2001 is CRITICAL: Cannot assign requested address [10:44:47] (03CR) 10Alexandros Kosiaris: Add 3 prometheus checks for kubernetes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [10:46:05] !log gehel@tin Started deploy [kartotherian/deploy@6e223df]: new kartotherian packaging on maps-test2003 [10:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:15] !log gehel@tin Finished deploy [kartotherian/deploy@6e223df]: new kartotherian packaging on maps-test2003 (duration: 00m 10s) [10:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:31] (03PS5) 10Alexandros Kosiaris: Add 3 prometheus checks for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) [10:46:33] (03PS5) 10Alexandros Kosiaris: Add kubelet operational latencies check [puppet] - 10https://gerrit.wikimedia.org/r/397552 (https://phabricator.wikimedia.org/T177395) [10:47:16] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] %27https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen%27 [10:50:13] (03CR) 10EddieGP: [C: 031] apache: redirect several wikis per Board of Trustees and LangCom request [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [10:53:33] ugh, the mediawiki memcacheerrors are coming from labtestwiki [10:54:04] my money is on this [10:54:05] 10:44 -icinga-wm:#wikimedia-operations- PROBLEM - nutcracker port on labtestweb2001 is CRITICAL: Cannot assign requested address [10:54:09] (03CR) 10Alexandros Kosiaris: Add kubelet operational latencies check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397552 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [10:54:37] I'll take a look at labtestweb [10:56:16] RECOVERY - nutcracker port on labtestweb2001 is OK: TCP OK - 0.000 second response time on port 11212 [10:57:01] (03PS1) 10EddieGP: apache, wwwportals: De-duplicate vhost code [puppet] - 10https://gerrit.wikimedia.org/r/397770 [10:57:22] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397771 [10:57:30] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397771 [10:58:33] (03CR) 10EddieGP: "This might be a horrible idea, maybe this was split up for a reason. And I have no clue how to test this, it may be broken, too. But I tho" [puppet] - 10https://gerrit.wikimedia.org/r/397770 (owner: 10EddieGP) [10:58:38] and parsoid talking to labtestweb's apache very often [10:59:00] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397771 (owner: 10Marostegui) [11:00:19] "I'll open a task" [11:00:19] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] %27https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=1fullscreen%27 [11:00:37] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397771 (owner: 10Marostegui) [11:00:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397771 (owner: 10Marostegui) [11:01:08] (03PS4) 10Ema: mtail: port varnishxcps [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) [11:01:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1081 - T174569 (duration: 00m 56s) [11:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:02] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [11:02:39] (03CR) 10Ema: mtail: port varnishxcps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [11:03:19] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2066 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397764 (owner: 10Jcrespo) [11:03:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397772 (https://phabricator.wikimedia.org/T174569) [11:04:39] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2066 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397764 (owner: 10Jcrespo) [11:05:01] (03PS1) 10ArielGlenn: snapshots only have one nfs filesystem mounted, remove cruft [puppet] - 10https://gerrit.wikimedia.org/r/397773 [11:05:29] <_joe_> eddiegp: I don't think it's a horrible idea, fwiw :P [11:05:34] (03PS2) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397772 (https://phabricator.wikimedia.org/T174569) [11:05:59] !log Deploy schema change on db1056 (s4) already depooled - T174569 [11:06:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] planet: add some missing Hiera calls for parameters (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397729 (owner: 10Dzahn) [11:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:31] https://phabricator.wikimedia.org/T182663 done [11:07:21] (03CR) 10Filippo Giunchedi: [C: 031] mtail: port varnishxcps [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [11:07:33] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2066 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397764 (owner: 10Jcrespo) [11:07:45] jynus: ok if I merge db-eqiad.php? [11:08:25] yes [11:08:34] ok thanks! [11:08:35] sorry, I was waiting for jenkins [11:08:41] and got distracted [11:08:47] no worries, I didn't want to mess up with your deployments :) [11:08:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397772 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [11:09:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] aptrepo: move Hiera calls into parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397730 (owner: 10Dzahn) [11:09:32] _joe_: Well that makes me confident I've not done complete bs ;) [11:10:15] cc andrewbogott re: https://phabricator.wikimedia.org/T182663 [11:10:15] <_joe_> eddiegp: I'll work on taking the other patch to prod today, btw [11:10:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397772 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [11:10:44] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397772 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [11:11:00] (03CR) 10Alexandros Kosiaris: "OSM has been done. Dashboard is at https://grafana.wikimedia.org/dashboard/db/openstreetmap?orgId=1. And yes, postgres is not yet done" [puppet] - 10https://gerrit.wikimedia.org/r/382905 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [11:11:05] marostegui: did you rebase to head? [11:11:21] I can see you did [11:11:23] thanks [11:11:26] !log Deploy schema change on db1084 (s4) - T174569 [11:11:30] _joe_: Yay :) Are you going to do the cleanup too, or should we poke someone else about it? [11:11:30] Yeah, I did :) [11:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:36] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [11:11:39] (03CR) 10Alexandros Kosiaris: "Yeah, agreed" [puppet] - 10https://gerrit.wikimedia.org/r/382906 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [11:11:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1084 - T174569 (duration: 00m 57s) [11:11:55] <_joe_> eddiegp: I guess someone else for the cleanup, tbh [11:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:14] <_joe_> I am looking int this as apparently I'm one of the few people confident in merging apache changes [11:12:16] (03CR) 10Muehlenhoff: Add a prometheus exporter for ircd (032 comments) [debs/prometheus-ircd-exporter] - 10https://gerrit.wikimedia.org/r/395751 (owner: 10Muehlenhoff) [11:12:18] marostegui: sorry, it takes so long that I got distracted by other tasks [11:12:24] (03PS2) 10Muehlenhoff: Add a prometheus exporter for ircd [debs/prometheus-ircd-exporter] - 10https://gerrit.wikimedia.org/r/395751 [11:12:39] jynus: no problem at all! [11:12:51] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2066 (duration: 00m 57s) [11:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:28] _joe_: Okay, I thought so as it's mostly mw config/db. :) [11:15:19] (03PS1) 10Filippo Giunchedi: prometheus: add mtail to varnish jobs [puppet] - 10https://gerrit.wikimedia.org/r/397774 (https://phabricator.wikimedia.org/T177199) [11:18:55] (03PS1) 10Jcrespo: mariadb: Depool db1092 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397775 (https://phabricator.wikimedia.org/T181777) [11:19:28] (03PS2) 10Elukey: role::cache::canary: add a test Varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/397765 [11:19:56] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1092 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397775 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [11:20:08] (03PS2) 10Marostegui: db1086.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/397757 [11:21:02] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1092 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397775 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [11:21:27] (03CR) 10Filippo Giunchedi: Add a prometheus exporter for ircd (031 comment) [debs/prometheus-ircd-exporter] - 10https://gerrit.wikimedia.org/r/395751 (owner: 10Muehlenhoff) [11:22:30] (03Merged) 10jenkins-bot: mariadb: Depool db1092 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397775 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [11:22:38] (03CR) 10Filippo Giunchedi: Add Prometheus scraper config for ircd exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395767 (https://phabricator.wikimedia.org/T182196) (owner: 10Muehlenhoff) [11:22:40] (03CR) 10jenkins-bot: mariadb: Depool db1092 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397775 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [11:24:54] (03PS3) 10Muehlenhoff: Add a prometheus exporter for ircd [debs/prometheus-ircd-exporter] - 10https://gerrit.wikimedia.org/r/395751 [11:25:03] (03CR) 10Muehlenhoff: Add a prometheus exporter for ircd (031 comment) [debs/prometheus-ircd-exporter] - 10https://gerrit.wikimedia.org/r/395751 (owner: 10Muehlenhoff) [11:25:05] (03PS2) 10ArielGlenn: snapshots only have one nfs filesystem mounted, remove cruft [puppet] - 10https://gerrit.wikimedia.org/r/397773 [11:26:28] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1092 (duration: 00m 56s) [11:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:45] !log Upgrade MySQL and kernel on db1086 [11:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:11] (03CR) 10Muehlenhoff: Add Prometheus scraper config for ircd exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395767 (https://phabricator.wikimedia.org/T182196) (owner: 10Muehlenhoff) [11:27:19] (03PS2) 10Muehlenhoff: Add Prometheus scraper config for ircd exporter [puppet] - 10https://gerrit.wikimedia.org/r/395767 (https://phabricator.wikimedia.org/T182196) [11:27:21] !log stop, upgrade and reboot db1092 [11:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:40] (03CR) 10Marostegui: [C: 032] db1086.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/397757 (owner: 10Marostegui) [11:29:56] 10Operations, 10monitoring, 10Technical-Debt: Retire Torrus - https://phabricator.wikimedia.org/T87840#3830539 (10akosiaris) I 've merged https://gerrit.wikimedia.org/r/351276 today (should have merged it months ago), this kills the module and role finally. [11:30:12] (03PS3) 10Elukey: role::cache::canary: add a test Varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/397765 [11:30:19] (03PS12) 10Giuseppe Lavagetto: apache: redirect several wikis per Board of Trustees and LangCom request [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [11:31:31] (03CR) 10Giuseppe Lavagetto: [C: 032] apache: redirect several wikis per Board of Trustees and LangCom request [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [11:33:20] (03PS1) 10Marostegui: db-eqiad.php: Repool db1086 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397777 [11:33:36] (03CR) 10Marostegui: [C: 04-2] "Wait for the server to catch up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397777 (owner: 10Marostegui) [11:37:27] (03CR) 10Volans: [C: 031] "LGTM as a starting point. We'll improve it once really used." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [11:38:05] <_joe_> eddiegp, fwiw, the change seems ok, letting puppet distribute it everywhere [11:38:13] (03CR) 10Alexandros Kosiaris: [C: 032] "Per https://puppet-compiler.wmflabs.org/compiler02/9290/ this will increase the available uwsgi workers in codfw by 50%, and slightly decr" [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) (owner: 10Halfak) [11:38:20] (03PS8) 10Alexandros Kosiaris: Refactor web workers for ORES [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) (owner: 10Halfak) [11:38:24] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Refactor web workers for ORES [puppet] - 10https://gerrit.wikimedia.org/r/396064 (https://phabricator.wikimedia.org/T182249) (owner: 10Halfak) [11:49:15] (03PS3) 10ArielGlenn: snapshots only have one nfs filesystem mounted, remove cruft [puppet] - 10https://gerrit.wikimedia.org/r/397773 [11:50:36] (03CR) 10ArielGlenn: [C: 032] snapshots only have one nfs filesystem mounted, remove cruft [puppet] - 10https://gerrit.wikimedia.org/r/397773 (owner: 10ArielGlenn) [11:53:30] (03PS3) 10Volans: puppet: change location of environment setting from [main] to [agent] [puppet] - 10https://gerrit.wikimedia.org/r/397624 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [11:53:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1086 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397777 (owner: 10Marostegui) [11:53:37] _joe_: FYI I'm merging it ^^^ [11:54:03] _joe_: Thanks. I'm currently on mobile, but I gonna test it later today. [11:54:39] (03CR) 10Volans: [C: 032] puppet: change location of environment setting from [main] to [agent] [puppet] - 10https://gerrit.wikimedia.org/r/397624 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [11:55:04] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1086 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397777 (owner: 10Marostegui) [11:55:12] <_joe_> volans: thanks [11:55:25] * volans cross fingers [11:55:44] <_joe_> eddiegp: not sure the cache will be purged by then [11:56:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1086 with low weight - T163190 (duration: 00m 55s) [11:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:26] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [11:56:46] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1086 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397777 (owner: 10Marostegui) [11:58:18] 10Operations, 10Dumps-Generation, 10Patch-For-Review: fix up datasets uid - https://phabricator.wikimedia.org/T113467#3830577 (10ArielGlenn) 05Open>03Resolved removed on all hosts, the dumpsgen user which replaces it, is set up properly. [11:58:51] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1092 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397783 [11:58:59] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1092 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397783 [11:59:42] (03PS5) 10Faidon Liambotis: mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733 [12:00:30] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733 (owner: 10Faidon Liambotis) [12:01:22] hashar: ^^ looks like a jenkins issue [12:01:34] (or multiple ones) [12:02:05] <_joe_> lemme see [12:02:59] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1092 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397783 (owner: 10Jcrespo) [12:03:49] it's complaining about wrong permissions [12:04:35] * volans wonders if the change recently merged could be related [12:04:36] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1092 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397783 (owner: 10Jcrespo) [12:04:49] doesn't seem so at first sight, but with puppet you never know ;) [12:05:13] (03PS5) 10Hashar: contint: remove browsertests role from permanent slaves [puppet] - 10https://gerrit.wikimedia.org/r/397601 (https://phabricator.wikimedia.org/T182642) (owner: 10Paladox) [12:06:03] (03PS6) 10Alexandros Kosiaris: Add 3 prometheus checks for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) [12:06:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add 3 prometheus checks for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/397546 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [12:06:07] (03CR) 10Alexandros Kosiaris: [C: 032] Add kubelet operational latencies check [puppet] - 10https://gerrit.wikimedia.org/r/397552 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [12:06:09] (03PS6) 10Alexandros Kosiaris: Add kubelet operational latencies check [puppet] - 10https://gerrit.wikimedia.org/r/397552 (https://phabricator.wikimedia.org/T177395) [12:06:11] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add kubelet operational latencies check [puppet] - 10https://gerrit.wikimedia.org/r/397552 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [12:06:21] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1092 (duration: 00m 55s) [12:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:47] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1092 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397783 (owner: 10Jcrespo) [12:08:25] (03CR) 10Hashar: "Cherry picked on the CI puppet master. We will see what happens once puppet has ran everywhere. At least integration-slave-jessie-1001 is " [puppet] - 10https://gerrit.wikimedia.org/r/397601 (https://phabricator.wikimedia.org/T182642) (owner: 10Paladox) [12:10:28] !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [12:10:32] !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [12:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:12] _joe_: varnish or some other cache? Should be testable with X-Wikimedia-Debug, no? [12:12:58] I'm confidet that it'll work/break on all apaches alike, so if it works on one I'd consider it working. [12:13:23] (03PS17) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [12:13:25] (03PS1) 10Jcrespo: mariadb: Move db2071 socket to the standard location [puppet] - 10https://gerrit.wikimedia.org/r/397784 (https://phabricator.wikimedia.org/T148507) [12:13:28] (Probably a bad assumption to make, I know) [12:13:38] godog: I'm testing one thing for the dashbord links, ignore the next Icinga complain on "carbon-frontend-relay metric drops" [12:13:58] <_joe_> eddiegp: varnish, with X-wikimedia-debug it should already work [12:14:18] <_joe_> or well, it did in my tests :) [12:16:15] (03PS2) 10Jcrespo: mariadb: Move db2071 socket to the standard location [puppet] - 10https://gerrit.wikimedia.org/r/397784 (https://phabricator.wikimedia.org/T148507) [12:16:17] (03PS1) 10Muehlenhoff: Record new MOU date for nithum [puppet] - 10https://gerrit.wikimedia.org/r/397786 [12:17:30] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3830640 (10elukey) a:03elukey [12:17:56] (03PS1) 10Hashar: contint: disable XDebug by default [puppet] - 10https://gerrit.wikimedia.org/r/397787 (https://phabricator.wikimedia.org/T175028) [12:18:05] (03CR) 10Muehlenhoff: [C: 032] Record new MOU date for nithum [puppet] - 10https://gerrit.wikimedia.org/r/397786 (owner: 10Muehlenhoff) [12:18:11] (03CR) 10Hashar: [C: 04-1] "gotta test it" [puppet] - 10https://gerrit.wikimedia.org/r/397787 (https://phabricator.wikimedia.org/T175028) (owner: 10Hashar) [12:18:29] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: IGNORE, this is a test - volans %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen%27 [12:18:31] ^^^^ that's me, ignore it please [12:19:04] (03CR) 10MarcoAurelio: "There's a problem with the RO redirects. Probably an encoding issue since they redirect you to "Pagina principalÄ3". Can somebody please f" [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [12:19:42] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3830645 (10elukey) notebook1002 is now PXE installing fine, I removed the previous hw config and created 12 1 disk RAID0 virtual devices with PERC. The oth... [12:19:57] (03PS1) 10Jcrespo: mariadb: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397788 (https://phabricator.wikimedia.org/T181777) [12:20:39] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen%27 [12:21:11] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397788 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [12:21:15] (03CR) 10Giuseppe Lavagetto: "I'll look into that." [puppet] - 10https://gerrit.wikimedia.org/r/393289 (https://phabricator.wikimedia.org/T169450) (owner: 10MarcoAurelio) [12:22:37] (03PS2) 10Jcrespo: mariadb: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397788 (https://phabricator.wikimedia.org/T181777) [12:25:13] (03PS1) 10ArielGlenn: fix up for script that lists last good dumps for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/397790 [12:26:03] (03PS2) 10ArielGlenn: fix up for script that lists last good dumps for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/397790 [12:26:46] (03CR) 10ArielGlenn: [C: 032] fix up for script that lists last good dumps for mirrors [puppet] - 10https://gerrit.wikimedia.org/r/397790 (owner: 10ArielGlenn) [12:28:43] (03PS3) 10Muehlenhoff: mediawiki: Remove Score firejail wrappers [puppet] - 10https://gerrit.wikimedia.org/r/394914 (https://phabricator.wikimedia.org/T181535) (owner: 10Legoktm) [12:28:54] _joe_: Duh, I don't know where that comes from. If I access the URI given in the apache config directly it's working fine. [12:29:26] <_joe_> eddiegp: I have the same issue btw when trying from a browser like firefox [12:29:38] <_joe_> I think I have the solution though [12:29:46] <_joe_> see how tricky apache changes are? :P [12:29:57] I did never say otherwise :P [12:29:59] <_joe_> if gerrit behaves, my patch should be incoming shortly [12:30:32] <_joe_> it's not behaving :P [12:30:52] (03PS1) 10Giuseppe Lavagetto: mediawiki: use utf-8 encoded redirects for mowiki [puppet] - 10https://gerrit.wikimedia.org/r/397793 (https://phabricator.wikimedia.org/T169450) [12:30:57] <_joe_> eddiegp: ^^ [12:31:27] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: use utf-8 encoded redirects for mowiki [puppet] - 10https://gerrit.wikimedia.org/r/397793 (https://phabricator.wikimedia.org/T169450) (owner: 10Giuseppe Lavagetto) [12:31:57] So it basically just needed escaping of the % [12:32:04] (03CR) 10Muehlenhoff: [C: 032] mediawiki: Remove Score firejail wrappers [puppet] - 10https://gerrit.wikimedia.org/r/394914 (https://phabricator.wikimedia.org/T181535) (owner: 10Legoktm) [12:32:14] <_joe_> eddiegp: yes [12:32:14] (03PS4) 10Muehlenhoff: mediawiki: Remove Score firejail wrappers [puppet] - 10https://gerrit.wikimedia.org/r/394914 (https://phabricator.wikimedia.org/T181535) (owner: 10Legoktm) [12:33:11] _joe_: Well, and I gave MarcoAurelio the percent-encoded link because I thought it'd be less likely to fail :D [12:33:23] (03PS3) 10Urbanecm: Restrict merging rights to autoconfirmed users on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374328 (https://phabricator.wikimedia.org/T174345) [12:33:34] <_joe_> eddiegp: :P [12:33:37] (03PS2) 10Urbanecm: Add NS aliases for zh_wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393835 (https://phabricator.wikimedia.org/T181374) [12:33:49] _joe_: shall I puppet-merge the mowiki patch along? [12:33:56] <_joe_> moritzm: no please hold [12:33:59] k [12:34:06] <_joe_> I'm waiting for disabling puppet to work everywhere [12:34:27] sure, you can simply merge mine along when you're ready, then [12:34:38] (03PS1) 10Alexandros Kosiaris: Escape the exclamation mark in icinga k8s master checks [puppet] - 10https://gerrit.wikimedia.org/r/397794 (https://phabricator.wikimedia.org/T177395) [12:35:15] (03CR) 10jerkins-bot: [V: 04-1] Escape the exclamation mark in icinga k8s master checks [puppet] - 10https://gerrit.wikimedia.org/r/397794 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [12:35:29] <_joe_> moritzm: merging [12:36:04] <_joe_> moritzm: but your change won't be applied until I've done my tests [12:36:20] (03PS1) 10Urbanecm: Lift account registration on en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397795 [12:36:31] yeah, np [12:36:38] it's just a cleanup [12:38:30] <_joe_> eddiegp: ok it should work. [12:40:14] <_joe_> confirmed working on mwdebug1001 [12:40:20] Hi, anybody who can have a look at T182407 ? [12:40:21] T182407: Strip 2FA for 'Martin Urbanec' account at arbcom-cs.wikipedia.org - https://phabricator.wikimedia.org/T182407 [12:40:23] (03PS1) 10Volans: Icinga: use the env variable instead of the macro [puppet] - 10https://gerrit.wikimedia.org/r/397796 (https://phabricator.wikimedia.org/T170353) [12:42:19] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3830715 (10Joe) With my latest patch, all redirects (including the mowiki and mowiktionary ones) work as expected. It will take some time fo... [12:43:36] _joe_: wfm too now [12:43:44] (03PS2) 10Alexandros Kosiaris: Escape the exclamation mark in icinga k8s master checks [puppet] - 10https://gerrit.wikimedia.org/r/397794 (https://phabricator.wikimedia.org/T177395) [12:44:00] (also the x-wikimedia-debug plugin works fine in firefox on mobile) [12:44:49] <_joe_> heh I have to see if it works on FF 57 though [12:44:52] (03CR) 10Alexandros Kosiaris: [C: 032] Escape the exclamation mark in icinga k8s master checks [puppet] - 10https://gerrit.wikimedia.org/r/397794 (https://phabricator.wikimedia.org/T177395) (owner: 10Alexandros Kosiaris) [12:44:57] !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [12:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:10] (03PS9) 10Rush: WIP: cloud: setup for attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) [12:45:15] It does. [12:45:52] (03CR) 10jerkins-bot: [V: 04-1] WIP: cloud: setup for attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) (owner: 10Rush) [12:45:58] I'm out now, I've got a job interview in ten minutes. I'll have another look on the redirects and will bug someone to do the cleanup when I'm home. [12:48:34] <_joe_> eddiegp: break a leg! :) [12:48:37] <_joe_> and thanks for the help [12:49:17] (03CR) 10Elukey: "Adding Ema, BBlack and Ottomata to discuss the details of where to place certs, etc.." [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [12:51:10] (03CR) 10Alexandros Kosiaris: "lol, ok lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/397796 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [12:51:13] (03CR) 10Alexandros Kosiaris: [C: 031] Icinga: use the env variable instead of the macro [puppet] - 10https://gerrit.wikimedia.org/r/397796 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [12:51:28] let's try it :D [12:51:52] (03PS2) 10Volans: Icinga: use the env variable instead of the macro [puppet] - 10https://gerrit.wikimedia.org/r/397796 (https://phabricator.wikimedia.org/T170353) [12:51:58] OK - apiserver_request_latencies is 4762 [12:52:04] 42 [12:52:10] <_joe_> akosiaris: in milliseconds? [12:52:15] yeah [12:52:18] in hours :D [12:52:33] <_joe_> akosiaris: eww. can't we run it on drupal instead? [12:52:41] (03CR) 10Volans: [C: 032] Icinga: use the env variable instead of the macro [puppet] - 10https://gerrit.wikimedia.org/r/397796 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [12:52:46] yeah let's create a module for it [12:53:00] <_joe_> or I dunno, rewrite kubernetes in nodejs, I hear it's nonblocking [12:53:21] <_joe_> jokes aside, it's quite slow :P [12:54:04] <_joe_> not sure that's expected, what metrics are you collecting exactly? [12:54:24] https://grafana-admin.wikimedia.org/dashboard/db/kubernetes-api?panelId=10&fullscreen&orgId=1 [12:54:27] <_joe_> don't tell me to read that check's parameters :P [12:54:28] it's a sum btw [12:54:37] <_joe_> oh ok [12:54:39] <_joe_> not a rate [12:55:03] it's a sum of all request latencies for the given master [12:55:11] and it's a rate too [12:55:23] sum(rate(apiserver_request_latencies_summary_sum{job="k8s-api",verb!="WATCH"}[5m])) [12:55:51] that's part of the query.. the actual one is a tad more complicated [12:55:51] <_joe_> ok so a sum of 5 minutes of the 5-min rate [12:55:53] <_joe_> uhm [12:56:51] <_joe_> also that number (4762) is like 5 seconds, not much to do with what I read in the graph [12:57:13] <_joe_> or was it 42 and you mistyped? [12:57:24] <_joe_> damn I just saw it. [12:57:28] no, 42 was me making fun [12:58:31] the graph is not per master btw [12:58:45] honestly... the check and the graph are not really related in any way [12:58:45] :P [12:58:46] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: IGNORE, this is a test - volans [12:58:51] ^^^ this is me, again, ignore ^^^^ [12:59:01] <_joe_> volans: we can read [12:59:06] :D [12:59:31] <_joe_> akosiaris: I see [12:59:42] (03PS10) 10Rush: cloud: setup for attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) [13:00:07] akosiaris: and of course we've disabled the env variables [13:00:11] enable_environment_macros=0 [13:00:24] (03CR) 10jerkins-bot: [V: 04-1] cloud: setup for attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) (owner: 10Rush) [13:00:35] cit. Enabling this option can cause performance issues in large installations, as it will consume a bit more memory and (more importantly) consume more CPU. :( [13:00:46] PROBLEM - Check whether ferm is active by checking the default input chain on mw1260 is CRITICAL: Return code of 255 is out of bounds [13:00:46] PROBLEM - nutcracker port on mw1260 is CRITICAL: Return code of 255 is out of bounds [13:00:57] 10Operations, 10Kubernetes: Operations 2017-18 Q2 Program 6 umbrella task - https://phabricator.wikimedia.org/T178325#3830767 (10akosiaris) [13:01:00] 10Operations, 10Prod-Kubernetes, 10monitoring, 10Kubernetes, and 3 others: Improve monitoring of the Kubernetes clusters - https://phabricator.wikimedia.org/T177395#3830765 (10akosiaris) 05Open>03Resolved And with the above merged, this is resolved. [13:01:10] (03PS11) 10Rush: cloud: setup for attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) [13:01:22] mw1260 is reimage noise, silencing [13:01:44] (03CR) 10jerkins-bot: [V: 04-1] cloud: setup for attended upgrade process [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) (owner: 10Rush) [13:02:56] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] [13:06:51] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/394200 (https://phabricator.wikimedia.org/T181647) (owner: 10Rush) [13:14:53] (03CR) 10Marostegui: [C: 031] mariadb: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397788 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [13:15:30] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3830791 (10Joe) @MarcoAurelio in my tests the redirects all work now, can you confirm they're ok? If so, should we close this ticket or wait... [13:16:54] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397805 [13:20:56] (03PS1) 10ArielGlenn: clean up all references to a 'public dumps dir' on web/nfs dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/397806 [13:21:35] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3830792 (10Strainu) I can confirm the redirects work as intended. [13:21:39] (03CR) 10jerkins-bot: [V: 04-1] clean up all references to a 'public dumps dir' on web/nfs dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/397806 (owner: 10ArielGlenn) [13:22:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397805 (owner: 10Marostegui) [13:23:43] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397805 (owner: 10Marostegui) [13:25:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1086 (duration: 00m 57s) [13:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:57] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397805 (owner: 10Marostegui) [13:28:55] (03PS2) 10ArielGlenn: clean up all references to a 'public dumps dir' on web/nfs dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/397806 [13:31:58] RECOVERY - Check whether ferm is active by checking the default input chain on mw1260 is OK: OK ferm input default policy is set [13:35:28] (03PS2) 10Hashar: contint: disable XDebug by default [puppet] - 10https://gerrit.wikimedia.org/r/397787 (https://phabricator.wikimedia.org/T175028) [13:40:35] (03CR) 10Faidon Liambotis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/357733 (owner: 10Faidon Liambotis) [13:40:48] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733 (owner: 10Faidon Liambotis) [13:40:52] hashar: ^ ? [13:41:17] huh [13:41:24] nevermind, different issue now [13:41:58] paravoid: yup that one is annoying. That is due to CI attempting to merge the proposed patch against the tip of the branch [13:42:04] so I guess "production" has evolved in between :/ [13:42:10] yeah that's a different issue than the one I was originally experiencing [13:42:15] and I automatically assumed it was the same [13:42:18] or maybe git get confused which sometime might happen [13:42:32] (03PS6) 10Faidon Liambotis: mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733 [13:42:40] _joe_ merged a redirects patch in the meantime [13:42:42] let's see now [13:43:16] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733 (owner: 10Faidon Liambotis) [13:43:45] hashar: ok, see now :) [13:44:00] https://integration.wikimedia.org/ci/job/operations-puppet-tests-docker/11445/console specifically [13:45:05] (03PS3) 10Faidon Liambotis: wmflib/to_milliseconds: fix two minor RuboCop cops [puppet] - 10https://gerrit.wikimedia.org/r/359450 [13:45:07] (03PS3) 10Faidon Liambotis: wmflib: cleanup secret.rb a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359449 [13:45:09] (03PS3) 10Faidon Liambotis: graphite: cleanup configparser_format a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359451 [13:45:12] (03PS3) 10Faidon Liambotis: Fix Style/FormatString Rucobop across all Rakefiles [puppet] - 10https://gerrit.wikimedia.org/r/359452 [13:45:14] (03PS3) 10Faidon Liambotis: wmflib: fix RuboCop infractions in serializers [puppet] - 10https://gerrit.wikimedia.org/r/359453 [13:45:16] (03PS2) 10Faidon Liambotis: Fix more whitespace-related Rubocop across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359478 [13:45:18] (03PS2) 10Faidon Liambotis: Fix Style/RegexpLiteral RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359479 [13:45:29] (03PS1) 10Faidon Liambotis: rubocop: remove exceptions for non-existent files [puppet] - 10https://gerrit.wikimedia.org/r/397810 [13:45:31] (03PS1) 10Faidon Liambotis: Fix Style/NumericLiterals RuboCop offense [puppet] - 10https://gerrit.wikimedia.org/r/397811 [13:45:33] (03PS1) 10Faidon Liambotis: puppet_statsd: fix three RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/397812 [13:45:37] (03PS2) 10Faidon Liambotis: wmflib: fix another couple minor RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359480 [13:45:39] (03PS2) 10Faidon Liambotis: wmflib, admin: fix RuboCop Style/For offenses [puppet] - 10https://gerrit.wikimedia.org/r/359481 [13:45:41] (03PS2) 10Faidon Liambotis: base: fix RuboCop MethodCallWithoutArgsParentheses [puppet] - 10https://gerrit.wikimedia.org/r/359482 [13:45:43] (03PS2) 10Faidon Liambotis: utils/expanderrb.rb: fix Style/SpecialGlobalVars [puppet] - 10https://gerrit.wikimedia.org/r/359483 [13:45:45] (03PS2) 10Faidon Liambotis: Fix Style/NegatedIf RuboCop offense across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359484 [13:45:47] (03PS2) 10Faidon Liambotis: rubocop: move three ignores to .rubocop.yml [puppet] - 10https://gerrit.wikimedia.org/r/359485 [13:45:49] ;] [13:46:56] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397814 [13:47:57] hashar: any ideas (re: the tox issues above) [13:48:15] (03CR) 10jerkins-bot: [V: 04-1] Fix Style/RegexpLiteral RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359479 (owner: 10Faidon Liambotis) [13:48:27] (03CR) 10Filippo Giunchedi: "This was attempted before in https://gerrit.wikimedia.org/r/#/c/364687/ but ultimately rejected (see also rationale there)" [puppet] - 10https://gerrit.wikimedia.org/r/377414 (owner: 10Hashar) [13:48:56] paravoid: hmm that sounds like a CI issue indeed [13:49:17] gotta verify [13:49:45] (03CR) 10Faidon Liambotis: [C: 032] rubocop: remove exceptions for non-existent files [puppet] - 10https://gerrit.wikimedia.org/r/397810 (owner: 10Faidon Liambotis) [13:49:54] (03CR) 10jerkins-bot: [V: 04-1] wmflib: fix another couple minor RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359480 (owner: 10Faidon Liambotis) [13:50:01] jouncebot, next [13:50:01] In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171212T1400) [13:50:22] (03CR) 10Faidon Liambotis: [C: 032] Fix Style/NumericLiterals RuboCop offense [puppet] - 10https://gerrit.wikimedia.org/r/397811 (owner: 10Faidon Liambotis) [13:50:24] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397788 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [13:50:34] (03CR) 10jerkins-bot: [V: 04-1] wmflib, admin: fix RuboCop Style/For offenses [puppet] - 10https://gerrit.wikimedia.org/r/359481 (owner: 10Faidon Liambotis) [13:50:52] (03CR) 10Faidon Liambotis: [C: 032] puppet_statsd: fix three RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/397812 (owner: 10Faidon Liambotis) [13:50:59] (03CR) 10jerkins-bot: [V: 04-1] base: fix RuboCop MethodCallWithoutArgsParentheses [puppet] - 10https://gerrit.wikimedia.org/r/359482 (owner: 10Faidon Liambotis) [13:51:21] oh god did you have to look at ssl_ciphersuite for some of those? it's horrible :P [13:51:34] I did :) [13:51:58] just one issue I think [13:52:05] if ! -> unless [13:53:14] (03CR) 10jerkins-bot: [V: 04-1] Fix Style/NegatedIf RuboCop offense across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359484 (owner: 10Faidon Liambotis) [13:54:10] (03Merged) 10jenkins-bot: mariadb: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397788 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [13:54:51] (03PS1) 10Hashar: (DO NOT MERGE) touch tox.ini > fail because /cache/pip is RO [puppet] - 10https://gerrit.wikimedia.org/r/397817 [13:54:53] paravoid: the docker image is broken [13:55:14] the tox dependencies are provisionned as root and that writes to /cache/ [13:55:17] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] [13:55:18] those files belong to root [13:55:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397814 (owner: 10Marostegui) [13:55:38] whenever tox.ini is touched, tox reinstall the precached environment and fails when trying to write to /cache bah [13:55:39] (03CR) 10jerkins-bot: [V: 04-1] (DO NOT MERGE) touch tox.ini > fail because /cache/pip is RO [puppet] - 10https://gerrit.wikimedia.org/r/397817 (owner: 10Hashar) [13:56:40] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2071 (duration: 00m 57s) [13:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:00] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397814 (owner: 10Marostegui) [13:57:07] RECOVERY - nutcracker port on mw1260 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:57:15] (03CR) 10Zfilipin: [C: 031] Add NS aliases for zh_wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393835 (https://phabricator.wikimedia.org/T181374) (owner: 10Urbanecm) [13:57:17] (03PS4) 10Faidon Liambotis: Fix Style/FormatString RuboCop across all Rakefiles [puppet] - 10https://gerrit.wikimedia.org/r/359452 [13:57:19] (03PS4) 10Faidon Liambotis: wmflib: fix RuboCop infractions in serializers [puppet] - 10https://gerrit.wikimedia.org/r/359453 [13:57:21] (03PS3) 10Faidon Liambotis: Fix more whitespace-related RuboCop across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359478 [13:57:23] (03PS3) 10Faidon Liambotis: Fix Style/RegexpLiteral RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359479 [13:58:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1086 (duration: 00m 56s) [13:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:56] (03PS2) 10Andrew Bogott: WMCS: set puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/397711 (https://phabricator.wikimedia.org/T178717) [13:59:07] (03PS2) 10Zfilipin: Lift account registration on en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397795 (https://phabricator.wikimedia.org/T182665) (owner: 10Urbanecm) [13:59:23] !log stop, upgrade and reboot db2071 [13:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:02] (03CR) 10Zfilipin: [C: 031] Lift account registration on en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397795 (https://phabricator.wikimedia.org/T182665) (owner: 10Urbanecm) [14:00:05] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171212T1400). [14:00:05] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:12] Present [14:00:15] I can swat today [14:00:24] That's great! [14:00:36] hashar: so, what needs to be done and by whom? :) [14:00:46] paravoid: I did a patch about it yesterday luckily https://gerrit.wikimedia.org/r/#/c/397577/ I am rebuilding the image : [14:00:49] Urbanecm: ok, I'll deploy 397795 since there is nothing to test there, right? [14:00:52] and hopefully will figure out how to publish it [14:00:57] (03CR) 10jenkins-bot: mariadb: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397788 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [14:01:00] If that's the throttle rule.... [14:01:06] Urbanecm: yest [14:01:14] Ok, then go ahead please! [14:01:18] (03PS3) 10Andrew Bogott: WMCS: use puppet 4 for any VM-hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/397711 [14:01:41] Urbanecm: I'll ping you in a few minutes when the other patch is at mwdebug1002, so you can test there [14:01:48] Ok, thanks [14:01:54] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397795 (https://phabricator.wikimedia.org/T182665) (owner: 10Urbanecm) [14:02:06] !log upgrading codfw puppet agents [14:02:08] (03PS3) 10Zoranzoki21: Lift account registration on en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397795 (https://phabricator.wikimedia.org/T182665) (owner: 10Urbanecm) [14:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:32] (03CR) 10Zoranzoki21: "First for 13th. Than for 14th." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397795 (https://phabricator.wikimedia.org/T182665) (owner: 10Urbanecm) [14:02:56] (03CR) 10Zoranzoki21: "@Zfilipin Please again add +2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397795 (https://phabricator.wikimedia.org/T182665) (owner: 10Urbanecm) [14:03:29] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2071 socket to the standard location [puppet] - 10https://gerrit.wikimedia.org/r/397784 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [14:03:37] (03PS3) 10Jcrespo: mariadb: Move db2071 socket to the standard location [puppet] - 10https://gerrit.wikimedia.org/r/397784 (https://phabricator.wikimedia.org/T148507) [14:04:08] Zfilipin: Please again add +2 on patch https://gerrit.wikimedia.org/r/#/c/397795/3 [14:04:23] Zfilipin: Please again add +2 on patch https://gerrit.wikimedia.org/r/#/c/397795 [14:04:35] I made edit in same time when you added +2 [14:04:37] Sorry [14:05:16] (03PS4) 10Andrew Bogott: WMCS: set puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/397711 (https://phabricator.wikimedia.org/T178717) [14:05:17] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [14:05:34] Zoranzoki21: that's a last second change if there ever was one :D [14:05:42] (03PS5) 10Andrew Bogott: WMCS: set puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/397711 (https://phabricator.wikimedia.org/T178717) [14:05:47] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397795 (https://phabricator.wikimedia.org/T182665) (owner: 10Urbanecm) [14:06:05] zeljkof: Ok. Sorry again [14:06:14] Zoranzoki21: no problem [14:06:23] zeljkof: Thank you [14:07:07] (03Merged) 10jenkins-bot: Lift account registration on en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397795 (https://phabricator.wikimedia.org/T182665) (owner: 10Urbanecm) [14:08:52] uh oh, did scap become very verbose lately? cc hashar, thcipriani|afk? [14:09:03] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:397795|Lift account registration on en.wiki (T182665)]] (duration: 00m 56s) [14:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:14] T182665: Lift account registration on en.wikipedia for 14th December 2017 - https://phabricator.wikimedia.org/T182665 [14:09:48] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393835 (https://phabricator.wikimedia.org/T181374) (owner: 10Urbanecm) [14:10:00] (03PS3) 10Zoranzoki21: Remove mysql module from WMF [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [14:11:12] (03Merged) 10jenkins-bot: Add NS aliases for zh_wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393835 (https://phabricator.wikimedia.org/T181374) (owner: 10Urbanecm) [14:13:03] Urbanecm: please stand by, running `scap pull` at mwdebug1002, but it's taking more time than usual :| [14:13:29] I can be available during the whole SWAT, so...nothing happens :D [14:15:10] Urbanecm: 393835 is at mwdebug1002, please test and let me know if I can deploy [14:15:31] scap is strange today, let me know if it looks like it's not deployed [14:15:37] I can re-run it [14:16:07] ok, I did re-run it at mwdebug, worked fast this time, not sure what happened the first time [14:16:09] (03PS3) 10ArielGlenn: clean up all references to a 'public dumps dir' on web/nfs dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/397806 [14:16:52] It is working, please deploy to the whole universe [14:16:59] Urbanecm: deploying [14:18:12] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:393835|Add NS aliases for zh_wikiquote (T181374)]] (duration: 00m 56s) [14:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:21] T181374: Request to change namespaces of zh wikiquote - https://phabricator.wikimedia.org/T181374 [14:18:49] Urbanecm: deployed, please check and thanks for deploying with #releng! :) [14:18:57] anything else for SWAT? [14:20:09] !log EU SWAT finished [14:20:11] Not for SWAT itself, but I would like to see T182407 resolved... [14:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:21] T182407: Strip 2FA for 'Martin Urbanec' account at arbcom-cs.wikipedia.org - https://phabricator.wikimedia.org/T182407 [14:20:34] If I found any I will put here [14:20:55] Urbanecm: sorry, can not help you there [14:21:08] zeljkof, ok, I will wait for somebody else :) [14:21:44] zeljkof: Yeah, scap is too verbose in the latest release. See T182643. [14:21:44] T182643: cache_git_info (from e.g. scap sync-file) is way way too verbose - https://phabricator.wikimedia.org/T182643 [14:22:07] Niharika: thanks! [14:22:40] (03PS1) 10Marostegui: db-eqiad.php: Restore db1086 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397819 [14:25:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1086 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397819 (owner: 10Marostegui) [14:27:08] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1086 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397819 (owner: 10Marostegui) [14:28:06] (03PS6) 10Andrew Bogott: WMCS: set puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/397711 (https://phabricator.wikimedia.org/T178717) [14:28:08] (03PS1) 10Andrew Bogott: labs-bootstrapvz-jessie: include puppet pinning in base images [puppet] - 10https://gerrit.wikimedia.org/r/397821 (https://phabricator.wikimedia.org/T178717) [14:28:10] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397822 [14:28:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original weight for db1086 (duration: 00m 56s) [14:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:17] (03PS2) 10Andrew Bogott: labs-bootstrapvz-jessie: include puppet pinning in base images [puppet] - 10https://gerrit.wikimedia.org/r/397821 (https://phabricator.wikimedia.org/T178717) [14:31:22] (03CR) 10Andrew Bogott: [C: 032] labs-bootstrapvz-jessie: include puppet pinning in base images [puppet] - 10https://gerrit.wikimedia.org/r/397821 (https://phabricator.wikimedia.org/T178717) (owner: 10Andrew Bogott) [14:31:43] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397822 (owner: 10Jcrespo) [14:32:59] herron, andrewbogott: we should probably impoort puppet to jessie-wikimedia and remove the pins [14:33:18] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397822 (owner: 10Jcrespo) [14:33:24] paravoid sounds good, how do you do that? [14:33:46] we've done the backports thing before and it didn't go very well over time (backports was updated to an even newer version, I think 4 at the time) [14:34:14] importing should help with new builds as well [14:34:22] (03CR) 10Ottomata: role::cache::canary: add a test Varnishkafka instance (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [14:35:16] jessie-backports is /probably/ not going to change soon, but still, probably a bad idea to rely on a distribution that isn't stable for such a critical piece of infrastructure [14:35:20] so how you do that is [14:35:51] download the dsc and debs from Debian in install1002 (e.g. "apt-get source puppet" and "apt-get download puppet{,-common}") [14:36:08] then sudo -i, and "reprepro include" [14:36:17] I think we have more detailed instructions on wikitech, let me check [14:36:29] yeah: https://wikitech.wikimedia.org/wiki/Reprepro [14:37:34] (03CR) 10Elukey: role::cache::canary: add a test Varnishkafka instance (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [14:37:59] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2071 (duration: 00m 56s) [14:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:23] paravoid thanks! [14:39:00] herron: if we do that it'll save me a few steps :) [14:39:08] jessie-wikimedia takes precedence over jessie, so the pins won't be necessary after that [14:39:11] (but should be cleaned up) [14:39:25] (03PS1) 10Jcrespo: mariadb: Depool db2072 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397828 (https://phabricator.wikimedia.org/T181777) [14:39:29] for sure [14:39:30] (03CR) 10Ottomata: role::cache::canary: add a test Varnishkafka instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [14:39:36] !log Rebuild operations-puppet-tests-docker image based on c76d8920901fd0be0f9ced3bc900cb72f2d1d4a2 | T178620 and /cache being owned by root [14:39:45] (03CR) 10Ema: [C: 032] mtail: port varnishxcps [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:46] T178620: operations-puppet Docker container takes a while to build - https://phabricator.wikimedia.org/T178620 [14:39:50] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/397817 (owner: 10Hashar) [14:39:55] (03PS5) 10Ema: mtail: port varnishxcps [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) [14:39:59] (03CR) 10Ema: [V: 032 C: 032] mtail: port varnishxcps [puppet] - 10https://gerrit.wikimedia.org/r/395578 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:41:00] (03CR) 10Ottomata: role::cache::canary: add a test Varnishkafka instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [14:41:23] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/357733 (owner: 10Faidon Liambotis) [14:42:18] (03CR) 10Elukey: role::cache::canary: add a test Varnishkafka instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [14:42:22] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3831090 (10EddieGP) [14:42:26] paravoid: the tox run is fixed (I have rebuild the docker container based on my yesterday patch) [14:42:29] (03Abandoned) 10Hashar: (DO NOT MERGE) touch tox.ini > fail because /cache/pip is RO [puppet] - 10https://gerrit.wikimedia.org/r/397817 (owner: 10Hashar) [14:42:39] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3413551 (10EddieGP) [14:42:48] paravoid: so your patch touching tox.ini is now passing https://gerrit.wikimedia.org/r/#/c/357733/ [14:42:51] awesome [14:42:53] thanks :) [14:43:11] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2072 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397828 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [14:43:26] (03PS2) 10Elukey: Prepare the conditions to rename notebook1002 in kafka1023 [dns] - 10https://gerrit.wikimedia.org/r/397539 (https://phabricator.wikimedia.org/T181518) [14:43:51] paravoid: you have been unlucky enough to be the first person to touch tox.ini in 9 weeks :] [14:44:17] (03CR) 10Faidon Liambotis: [C: 032] wmflib/to_milliseconds: fix two minor RuboCop cops [puppet] - 10https://gerrit.wikimedia.org/r/359450 (owner: 10Faidon Liambotis) [14:44:24] hashar: lucky me! [14:44:25] (03PS4) 10Faidon Liambotis: wmflib/to_milliseconds: fix two minor RuboCop cops [puppet] - 10https://gerrit.wikimedia.org/r/359450 [14:44:42] (03CR) 10Steinsplitter: [C: 031] "Looks good to me, but needs rebase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397129 (https://phabricator.wikimedia.org/T182534) (owner: 10Jon Harald Søby) [14:45:16] (03CR) 10Volans: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/397539 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [14:45:45] (03Merged) 10jenkins-bot: mariadb: Depool db2072 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397828 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [14:45:57] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, and 2 others: Redirect several wikis - https://phabricator.wikimedia.org/T169450#3831128 (10EddieGP) 05Open>03Resolved a:03EddieGP The redirect part is done and working as intended (thanks to @Joe !). What's left to... [14:46:47] !log start rename notebook1002 -> kafka1023 - step 2, dns config (host already shutdown) - T181518 [14:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:58] T181518: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518 [14:47:01] (03PS1) 10Ema: cache: install varnishxcps.mtail [puppet] - 10https://gerrit.wikimedia.org/r/397831 (https://phabricator.wikimedia.org/T177199) [14:47:13] (03CR) 10Elukey: [C: 032] Prepare the conditions to rename notebook1002 in kafka1023 [dns] - 10https://gerrit.wikimedia.org/r/397539 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [14:48:23] (03CR) 10Steinsplitter: "This is blocked by T90004?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397129 (https://phabricator.wikimedia.org/T182534) (owner: 10Jon Harald Søby) [14:49:42] (03CR) 10Filippo Giunchedi: [C: 031] cache: install varnishxcps.mtail [puppet] - 10https://gerrit.wikimedia.org/r/397831 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:49:46] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Power supply error on db1055 - https://phabricator.wikimedia.org/T182653#3831142 (10Cmjohnson) @Marostegui Replaced the PSU and both are now redundant Date/Time: 12/12/2017 14:43:15 Source: system Severity: Critical Description: Power supply... [14:51:04] (03PS2) 10Elukey: Rename notebook1002 to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/397534 (https://phabricator.wikimedia.org/T181518) [14:51:07] (03CR) 10Ema: [C: 032] cache: install varnishxcps.mtail [puppet] - 10https://gerrit.wikimedia.org/r/397831 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:51:08] 10Operations, 10ops-eqiad, 10Analytics: Decomission eventlog2001 - https://phabricator.wikimedia.org/T182397#3831144 (10Cmjohnson) p:05Normal>03Low [14:51:22] RECOVERY - IPMI Sensor Status on db1055 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [14:51:56] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Power supply error on db1055 - https://phabricator.wikimedia.org/T182653#3831148 (10Marostegui) 05Open>03Resolved That was fast! Thanks a lot ``` RECOVERY - IPMI Sensor Status on db1055 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK ``` [14:52:04] 10Operations, 10ops-codfw, 10Analytics, 10DC-Ops: Decomission eventlog2001 - https://phabricator.wikimedia.org/T182397#3822380 (10Cmjohnson) a:03Papaul assigning to @papaul and correct data center [14:55:27] (03CR) 10Ottomata: role::cache::canary: add a test Varnishkafka instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [14:58:02] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1260.eqiad.wmnet [14:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:07] (03PS1) 10Rush: openstack: first control node dependency issues [puppet] - 10https://gerrit.wikimedia.org/r/397835 (https://phabricator.wikimedia.org/T171494) [14:59:29] (03CR) 10jerkins-bot: [V: 04-1] openstack: first control node dependency issues [puppet] - 10https://gerrit.wikimedia.org/r/397835 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:59:34] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission osm-cp100[1-4] - https://phabricator.wikimedia.org/T182034#3831168 (10Cmjohnson) p:05Triage>03Low [14:59:54] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission osm-web100[1-4] - https://phabricator.wikimedia.org/T182033#3831169 (10Cmjohnson) p:05Triage>03Low [15:00:05] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission server zinc - https://phabricator.wikimedia.org/T182016#3831170 (10Cmjohnson) p:05Triage>03Low [15:00:15] (03PS1) 10Gehel: maps: extract style used by kartotherian / tilerator as parameter [puppet] - 10https://gerrit.wikimedia.org/r/397836 (https://phabricator.wikimedia.org/T162241) [15:00:29] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission Vanadium - https://phabricator.wikimedia.org/T182015#3831171 (10Cmjohnson) p:05Triage>03Low [15:00:42] 10Operations, 10ops-eqiad, 10DC-Ops: decommission rcs1001/1002 - https://phabricator.wikimedia.org/T181825#3831173 (10Cmjohnson) p:05Triage>03Low [15:00:48] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3831175 (10Cmjohnson) p:05Normal>03Low [15:01:02] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10hardware-requests: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3831180 (10Cmjohnson) p:05Triage>03Low a:03Cmjohnson [15:01:14] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission niobium - https://phabricator.wikimedia.org/T181763#3831182 (10Cmjohnson) p:05Triage>03Low [15:01:17] (03PS4) 10Faidon Liambotis: wmflib: cleanup secret.rb a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359449 [15:01:19] (03PS4) 10Faidon Liambotis: graphite: cleanup configparser_format a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359451 [15:01:21] (03PS5) 10Faidon Liambotis: Fix Style/FormatString RuboCop across all Rakefiles [puppet] - 10https://gerrit.wikimedia.org/r/359452 [15:01:23] (03PS5) 10Faidon Liambotis: wmflib: fix RuboCop infractions in serializers [puppet] - 10https://gerrit.wikimedia.org/r/359453 [15:01:25] (03PS4) 10Faidon Liambotis: Fix more whitespace-related RuboCop across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359478 [15:01:27] (03PS4) 10Faidon Liambotis: Fix Style/RegexpLiteral RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359479 [15:01:29] (03PS3) 10Faidon Liambotis: wmflib: fix another couple minor RuboCop offenses [puppet] - 10https://gerrit.wikimedia.org/r/359480 [15:01:31] (03PS3) 10Faidon Liambotis: wmflib, admin: fix RuboCop Style/For offenses [puppet] - 10https://gerrit.wikimedia.org/r/359481 [15:01:33] 10Operations, 10ops-eqiad, 10DC-Ops: decommission mobile 1004 and mobile1005 - https://phabricator.wikimedia.org/T181750#3831183 (10Cmjohnson) p:05Triage>03Low [15:01:33] (03PS3) 10Faidon Liambotis: base: fix RuboCop MethodCallWithoutArgsParentheses [puppet] - 10https://gerrit.wikimedia.org/r/359482 [15:01:39] (03PS3) 10Faidon Liambotis: utils/expanderrb.rb: fix Style/SpecialGlobalVars [puppet] - 10https://gerrit.wikimedia.org/r/359483 [15:01:41] (03PS3) 10Faidon Liambotis: Fix Style/NegatedIf RuboCop offense across the tree [puppet] - 10https://gerrit.wikimedia.org/r/359484 [15:01:43] (03PS3) 10Faidon Liambotis: rubocop: move three ignores to .rubocop.yml [puppet] - 10https://gerrit.wikimedia.org/r/359485 [15:02:51] !log clear recdns records related to notebook1002/kafka1023 (rec_control wipe-cache kafka1023.eqiad.wmnet kafka1023.mgmt.eqiad.wmnet notebook1002.eqiad.wmnet 14.5.64.10.in-addr.arpa 104.3.65.10.in-addr.arpa) - T181518 [15:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:02] T181518: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518 [15:03:08] just ran this on hydrogen, looks good --^ [15:03:18] (wiped 14 records, 1 negative records, 10 packets) [15:03:55] (03PS2) 10Rush: openstack: first install control node dependency issues [puppet] - 10https://gerrit.wikimedia.org/r/397835 (https://phabricator.wikimedia.org/T171494) [15:04:10] (03PS3) 10Rush: openstack: first install control node dependency issues [puppet] - 10https://gerrit.wikimedia.org/r/397835 (https://phabricator.wikimedia.org/T171494) [15:05:48] (03CR) 10jerkins-bot: [V: 04-1] openstack: first install control node dependency issues [puppet] - 10https://gerrit.wikimedia.org/r/397835 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:06:12] (03CR) 10BBlack: [C: 031] varnish: Don't redirect www.$project.org on mobile [puppet] - 10https://gerrit.wikimedia.org/r/394902 (https://phabricator.wikimedia.org/T154026) (owner: 10EddieGP) [15:06:30] !log Upgrade MySQl on db1084 [15:06:32] (03PS4) 10Rush: openstack: first install control node dependency issues [puppet] - 10https://gerrit.wikimedia.org/r/397835 (https://phabricator.wikimedia.org/T171494) [15:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:45] (03PS5) 10Rush: openstack: first install control node dependency issues [puppet] - 10https://gerrit.wikimedia.org/r/397835 (https://phabricator.wikimedia.org/T171494) [15:07:37] (03PS2) 10Filippo Giunchedi: prometheus: add mtail to varnish jobs [puppet] - 10https://gerrit.wikimedia.org/r/397774 (https://phabricator.wikimedia.org/T177199) [15:08:03] godog: yt? [15:08:13] q about multi DC graphite/statsd stuff [15:08:13] ottomata: yoyo [15:08:17] (03PS6) 10Rush: openstack: first install control node dependency issues [puppet] - 10https://gerrit.wikimedia.org/r/397835 (https://phabricator.wikimedia.org/T171494) [15:08:19] (03CR) 10Gehel: "Puppet compiler agrees this is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/397836 (https://phabricator.wikimedia.org/T162241) (owner: 10Gehel) [15:08:24] so i'm working on https://phabricator.wikimedia.org/T179093#3765016 [15:08:30] getting statsv off of analytics kafka [15:08:32] onto main [15:08:34] for multi dc support [15:08:41] (and also so we can get clients off of analytics kafka) [15:08:41] :) [15:08:53] and, i'm still confused as to how multi DC graphite/statsd works [15:09:04] do all service produce to active DC? [15:09:12] e.g. services in codfw, ulsfo are producing to statsd in eqiad? [15:09:39] (03PS3) 10Elukey: Rename notebook1002 to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/397534 (https://phabricator.wikimedia.org/T181518) [15:09:47] ottomata: that's correct yeah, statsd traffic all flows to statsd.eqiad.wmnet [15:10:03] (03CR) 10Rush: [C: 032] openstack: first install control node dependency issues [puppet] - 10https://gerrit.wikimedia.org/r/397835 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:10:30] and when active DC is changed to codfw, all traffic flows to statsd.codfw.wmnet? [15:10:34] what about graphite? [15:10:35] same? [15:10:46] (03PS4) 10Elukey: Rename notebook1002 to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/397534 (https://phabricator.wikimedia.org/T181518) [15:10:48] does stasd in codfw only produce to codfw graphite? [15:10:48] (03CR) 10Ema: [C: 031] prometheus: add mtail to varnish jobs [puppet] - 10https://gerrit.wikimedia.org/r/397774 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [15:10:55] (03PS1) 10Marostegui: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397839 [15:11:14] for graphite is the same deal as statsd, everything sends to graphite-in.eqiad.wmnet [15:11:17] (03PS2) 10Gehel: maps: extract style used by kartotherian / tilerator as parameter [puppet] - 10https://gerrit.wikimedia.org/r/397836 (https://phabricator.wikimedia.org/T162241) [15:11:38] upon switchover we flip the dns records to point to codfw and producers move over [15:12:06] ok [15:12:15] so does that mean taht after a swtichover to codfw [15:12:18] not a great system since it is dns based and some producers cache the records forever, but meh [15:12:23] graphite is mostly blank? [15:12:31] historical stuff is not there? [15:12:36] (03CR) 10Gehel: [C: 032] maps: extract style used by kartotherian / tilerator as parameter [puppet] - 10https://gerrit.wikimedia.org/r/397836 (https://phabricator.wikimedia.org/T162241) (owner: 10Gehel) [15:12:38] since graphite.wikimeida.org will point at the new graphite instance? [15:12:42] (new == codfw) [15:12:49] no, eqiad graphite also mirrors its metrics to codfw [15:12:52] so the historical stuff is there [15:12:55] ok [15:13:01] and vice versa? [15:13:05] codfw mirros to eqiad? [15:13:10] master master kinda? [15:13:23] (03Abandoned) 10Hashar: graphite: cleanup servers.* [puppet] - 10https://gerrit.wikimedia.org/r/377414 (owner: 10Hashar) [15:13:48] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 35357: Connection refused [15:14:13] ottomata: yes if you send data to graphite codfw it'll end in eqiad too [15:15:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397839 (owner: 10Marostegui) [15:15:37] godog: hmmMMmm i'm still having trouble combining this iwth statsv/kafka. do you have a min for a hangout? [15:15:58] (03CR) 10Ottomata: [C: 031] Rename notebook1002 to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/397534 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [15:16:00] (03CR) 10Volans: [C: 031] "LGTM, please also double check that hiera regexes are correctly including or not including it where needed." [puppet] - 10https://gerrit.wikimedia.org/r/397534 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [15:16:10] ottomata: not now sorry [15:16:13] (03CR) 10Elukey: [C: 032] Rename notebook1002 to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/397534 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [15:16:21] ok [15:16:26] i'll keep typing then :p [15:16:28] (03PS5) 10Elukey: Rename notebook1002 to kafka1023 [puppet] - 10https://gerrit.wikimedia.org/r/397534 (https://phabricator.wikimedia.org/T181518) [15:16:28] got +2 sniped three times in a row :D [15:16:38] so main kafka exists in both and codfw [15:16:43] the way we do multi dc for things like changeprop, etc. [15:16:44] is [15:17:04] producers in codfw produce to codfw kafka with codfw. prefixed topics (and vice versa for eqiad) [15:17:16] then, codfw prefixed topics are mirrored to main kafka eqiad [15:17:21] and vice versa [15:17:30] then consumers in both DCs consume from all prefixed topics [15:17:55] all consumers in codfw get all messages from both eqiad and codfw, but while eqiad is active DC, the codfw prefixed topics will be mostly empty [15:18:12] this is different for statsd, since producers in codfw produce to statsd in eqiad... [15:18:18] and the replication/mirroring is handled by graphite... [15:18:20] hmm. [15:19:13] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397839 (owner: 10Marostegui) [15:19:43] (03CR) 10Filippo Giunchedi: [C: 031] "PCC https://puppet-compiler.wmflabs.org/compiler03/9299/prometheus1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/397774 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [15:19:53] (03PS3) 10Filippo Giunchedi: prometheus: add mtail to varnish jobs [puppet] - 10https://gerrit.wikimedia.org/r/397774 (https://phabricator.wikimedia.org/T177199) [15:20:27] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add mtail to varnish jobs [puppet] - 10https://gerrit.wikimedia.org/r/397774 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [15:20:37] herron: o/ - do you need puppet disabled on install1002 [15:20:39] ? [15:20:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1084 with low weight (duration: 01m 18s) [15:20:58] elukey nope just finished! [15:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:03] re-enabled [15:21:45] (03PS1) 10Cmjohnson: Removing decom'd server eventlog2001 from site.pp and dhcpd file T182397 [puppet] - 10https://gerrit.wikimedia.org/r/397842 [15:22:25] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397814 (owner: 10Marostegui) [15:22:54] herron: thankssss [15:23:26] herron: would you like me to make that reprepro change now? And do we need similar things for stretch and trusty? [15:23:53] andrewbogott it's done for stretch and jessie [15:24:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397843 (https://phabricator.wikimedia.org/T174569) [15:24:18] Oh, so I see! I just needed to do an apt-get update [15:24:25] herron: want to do Trusty too while you're at it? [15:24:31] (03CR) 10jenkins-bot: Add NS aliases for zh_wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393835 (https://phabricator.wikimedia.org/T181374) (owner: 10Urbanecm) [15:24:42] let's see [15:24:45] !log rename notebook1002 -> kafka1023 - step 3, replace notebook1002 with kafka1023 in the puppet config [15:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:17] (03PS1) 10Andrew Bogott: Revert "labs-bootstrapvz-jessie: include puppet pinning in base images" [puppet] - 10https://gerrit.wikimedia.org/r/397844 [15:25:30] ottomata: from my reading of the ticket, it seems that all kafka consumers of statsv topics will need to push to a single statsd endpoint, the consumers themselves could be either all in one dc or two [15:25:36] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1086 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397819 (owner: 10Marostegui) [15:25:40] !log powercycling maps-test2003 [15:25:46] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397822 (owner: 10Jcrespo) [15:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:56] (03CR) 10jenkins-bot: mariadb: Depool db2072 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397828 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [15:25:59] I'll leave trusty alone for now, unless you know of somewhere to easily pull puppet 4 packages from? [15:26:01] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397839 (owner: 10Marostegui) [15:26:38] yeah, but now i'm realizing also that this is different than the other main kafka multi dc uses, since there are producers not in the active DC [15:26:41] (03PS2) 10Cmjohnson: Removing decom'd server eventlog2001 from site.pp and dhcpd file T182397 [puppet] - 10https://gerrit.wikimedia.org/r/397842 [15:26:42] so prefixed topics don't make sense [15:26:43] (03CR) 10Andrew Bogott: [C: 032] Revert "labs-bootstrapvz-jessie: include puppet pinning in base images" [puppet] - 10https://gerrit.wikimedia.org/r/397844 (owner: 10Andrew Bogott) [15:26:48] hm [15:26:51] (03PS2) 10Andrew Bogott: Revert "labs-bootstrapvz-jessie: include puppet pinning in base images" [puppet] - 10https://gerrit.wikimedia.org/r/397844 [15:27:02] hm, or maybe they do, but i'd need to concept of a 'master' kafka cluster [15:27:06] which doesn't really exist. [15:27:14] (03CR) 10jenkins-bot: Lift account registration on en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397795 (https://phabricator.wikimedia.org/T182665) (owner: 10Urbanecm) [15:27:20] i could pick one by lookin gup the current master dc [15:27:28] (03CR) 10Cmjohnson: [C: 032] Removing decom'd server eventlog2001 from site.pp and dhcpd file T182397 [puppet] - 10https://gerrit.wikimedia.org/r/397842 (owner: 10Cmjohnson) [15:27:30] but then the messages wouldn't switch over until puppet runs and bounces varnishkafka [15:27:43] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397843 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [15:28:01] errr i need brain bounce, elukey you got a few mins? :D [15:28:13] (03PS3) 10Andrew Bogott: Revert "labs-bootstrapvz-jessie: include puppet pinning in base images" [puppet] - 10https://gerrit.wikimedia.org/r/397844 [15:28:33] herron: no, we have to just backport those packages ourselves [15:28:59] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3792843 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` notebook1002.eqiad.wmnet ``` The log... [15:29:05] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397843 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [15:30:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397843 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [15:31:05] !log Deploy schema change on s4 db1097:3314 - T174569 [15:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:16] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [15:31:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097:3314 - T174569 (duration: 01m 01s) [15:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:36] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2072 (duration: 00m 56s) [15:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:24] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3599592 (10Imarlier) @aaron - see note from Jaime above, he's waiting on answers fro... [15:35:58] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3831316 (10akosiaris) Pardon me, but I have to ask why a file with timestamps in the log file dating `Dec 11th`, and with a local... [15:36:36] (03PS1) 10Gehel: maps: update style to match latest kartotherian version [puppet] - 10https://gerrit.wikimedia.org/r/397846 (https://phabricator.wikimedia.org/T162241) [15:38:22] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3831321 (10awight) >>! In T181661#3831316, @akosiaris wrote: > Pardon me, but I have to ask why a file with timestamps in the log... [15:39:23] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3831323 (10akosiaris) >>! In T181661#3831321, @awight wrote: >>>! In T181661#3831316, @akosiaris wrote: >> Pardon me, but I have t... [15:39:36] (03PS1) 10Cmjohnson: Removing productin DNS only of eventlog2001 mgmt should stay until unracked T182397 [dns] - 10https://gerrit.wikimedia.org/r/397848 [15:39:53] (03CR) 10Cmjohnson: [C: 032] Removing productin DNS only of eventlog2001 mgmt should stay until unracked T182397 [dns] - 10https://gerrit.wikimedia.org/r/397848 (owner: 10Cmjohnson) [15:40:07] (03PS2) 10Cmjohnson: Removing productin DNS only of eventlog2001 mgmt should stay until unracked T182397 [dns] - 10https://gerrit.wikimedia.org/r/397848 [15:40:17] (03CR) 10Filippo Giunchedi: Add a Prometheus exporter for PDNS recursor (032 comments) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394982 (owner: 10Muehlenhoff) [15:41:44] 10Operations, 10ops-codfw, 10Analytics, 10DC-Ops, 10Patch-For-Review: Decomission eventlog2001 - https://phabricator.wikimedia.org/T182397#3831340 (10Cmjohnson) [15:42:16] 10Operations, 10ops-codfw, 10Analytics, 10DC-Ops, 10Patch-For-Review: Decomission eventlog2001 - https://phabricator.wikimedia.org/T182397#3822380 (10Cmjohnson) Switch port is ge-5/0/9 labeled eventlog2001-decommed @papaul all yours [15:43:28] herron: I'm getting complaints about 'ruby-deep-merge' — I think that probably also needs to be moved over [15:43:55] I'm not sure why? we had only pinned puppet* [15:44:00] hm, well, hang on let me make sure [15:44:10] (03CR) 10Thcipriani: "recheck" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397748 (owner: 10Hashar) [15:44:27] herron: yeah, puppet depends on it [15:44:40] (03CR) 10jerkins-bot: [V: 04-1] Restrict setup.py to python 3.4 or later [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397748 (owner: 10Hashar) [15:44:50] this is for a VM base image with somewhat different repos available [15:44:54] yeah, ruby-deep-merge isn't available on jessie, just jessie-backports [15:44:57] per https://packages.debian.org/search?keywords=ruby-deep-merge [15:45:04] (or "rmadison ruby-deep-merge") [15:45:08] (03CR) 10Filippo Giunchedi: [C: 031] Add a prometheus exporter for ircd [debs/prometheus-ircd-exporter] - 10https://gerrit.wikimedia.org/r/395751 (owner: 10Muehlenhoff) [15:45:20] so the dependency was satisfied only from jessie-backports, thus no pin needed [15:46:01] !log stop, upgrade and reboot db2072 [15:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:20] hm, maybe I should just add backports to this base image [15:47:27] (03PS2) 10Gehel: maps: update style to match latest kartotherian version [puppet] - 10https://gerrit.wikimedia.org/r/397846 (https://phabricator.wikimedia.org/T162241) [15:47:50] yeah, I can do that, I just have things happening out of order here [15:49:38] (03PS2) 10Thcipriani: Restrict setup.py to python 3.4 or later [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397748 (owner: 10Hashar) [15:50:53] (03PS7) 10Andrew Bogott: WMCS: set puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/397711 (https://phabricator.wikimedia.org/T178717) [15:50:55] (03PS1) 10Andrew Bogott: bootstrap-vz: include debian backports in sources.list [puppet] - 10https://gerrit.wikimedia.org/r/397850 (https://phabricator.wikimedia.org/T178717) [15:51:57] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831379 (10Cmjohnson) [15:51:59] (03CR) 10Andrew Bogott: [C: 032] bootstrap-vz: include debian backports in sources.list [puppet] - 10https://gerrit.wikimedia.org/r/397850 (https://phabricator.wikimedia.org/T178717) (owner: 10Andrew Bogott) [15:52:59] (03PS1) 10Filippo Giunchedi: prometheus: add mtail to varnish-upload job [puppet] - 10https://gerrit.wikimedia.org/r/397851 (https://phabricator.wikimedia.org/T177199) [15:53:15] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add mtail to varnish-upload job [puppet] - 10https://gerrit.wikimedia.org/r/397851 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [15:53:42] (03PS2) 10Filippo Giunchedi: prometheus: add mtail to varnish-upload job [puppet] - 10https://gerrit.wikimedia.org/r/397851 (https://phabricator.wikimedia.org/T177199) [15:55:50] 10Operations, 10Datasets-General-or-Unknown, 10monitoring, 10Patch-For-Review: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3831390 (10ArielGlenn) No dumps generation jobs run on the dataset host any more. This host only does rsyncs and we... [15:57:11] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831395 (10Cmjohnson) assigning to @Marostegui for installs [15:57:21] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397853 [15:58:32] PROBLEM - Host boron is DOWN: PING CRITICAL - Packet loss = 100% [15:58:39] herron: is there any long term downside to just leaving Trusty agents on 3.x? [15:58:58] !log akosiaris@tin Started deploy [ores/deploy@b4f2b02]: T181661 [15:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:08] !log akosiaris@tin Finished deploy [ores/deploy@b4f2b02]: T181661 (duration: 00m 09s) [15:59:08] T181661: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661 [15:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:18] !log akosiaris@tin Started deploy [ores/deploy@b4f2b02]: T181661 [15:59:22] !log akosiaris@tin Finished deploy [ores/deploy@b4f2b02]: T181661 (duration: 00m 04s) [15:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397853 (owner: 10Marostegui) [16:00:52] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3831411 (10thcipriani) >>! In T181661#3831316, @akosiaris wrote: > Now for the more interesting stuff. I 've tried to run `/usr/bi... [16:01:01] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 4 minutes ago with 9 failures. Failed resources (up to 3 shown): Service[drbd],Service[nfs-kernel-server],Service[puppet],Service[nscd] [16:01:30] andrewbogott that's what I was thinking since our masters have the rack middleware to support 3.x agents and we would otherwise have to maintain our own puppet 4 packages for trusty [16:01:47] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397853 (owner: 10Marostegui) [16:01:58] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397853 (owner: 10Marostegui) [16:02:32] herron: ok — if there aren't any ready-made packages then I'm fine with that if it doesn't incur limitations on what we can put in puppet manifests [16:03:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1084 (duration: 00m 56s) [16:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:31] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 19 minutes ago with 7 failures. Failed resources (up to 3 shown): Package[nova-api],Package[nova-consoleauth],Package[nova-spiceproxy],Package[websockify] [16:06:39] (03PS1) 10DCausse: [cirrus] tune wikidata similarity configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397855 (https://phabricator.wikimedia.org/T182293) [16:07:38] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] tune wikidata similarity configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397855 (https://phabricator.wikimedia.org/T182293) (owner: 10DCausse) [16:08:31] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:09:46] (03PS2) 10DCausse: [cirrus] tune wikidata similarity configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397855 (https://phabricator.wikimedia.org/T182293) [16:11:30] (03PS1) 10Cmjohnson: Removing site.pp entries and dhcpd file entries for decom'd db1015,21,db104[4-9]50 [puppet] - 10https://gerrit.wikimedia.org/r/397856 [16:12:06] (03CR) 10jerkins-bot: [V: 04-1] Removing site.pp entries and dhcpd file entries for decom'd db1015,21,db104[4-9]50 [puppet] - 10https://gerrit.wikimedia.org/r/397856 (owner: 10Cmjohnson) [16:12:31] (03CR) 10Filippo Giunchedi: Add a Prometheus exporter for PDNS recursor (031 comment) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394982 (owner: 10Muehlenhoff) [16:13:36] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397857 [16:13:40] (03PS2) 10Cmjohnson: Removing site.pp entries and dhcpd file entries for decom'd db1015,21,46-49,50 [puppet] - 10https://gerrit.wikimedia.org/r/397856 [16:14:15] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3831476 (10akosiaris) ganeti1008 a few hours ago. This went largely unnoticed as icinga did not spew any alerts. This time the event has lasted way longer ``` akosiaris@ganeti100... [16:16:02] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3769627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [16:16:08] moritzm: around? [16:17:14] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2072 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397858 [16:18:11] RECOVERY - Host boron is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [16:20:59] volans: yep [16:21:34] moritzm: so we're trying with elukey to reimage notebook1002 into kafka1023 and there was no pending puppet cert to sign [16:21:44] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2072 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397858 (owner: 10Jcrespo) [16:21:47] !log failover boron to ganeti1008 [16:21:50] I manually run puppet agent --test and it generated it and it appeared on the puppetmaster [16:21:54] what on earth.... [16:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:05] (03CR) 10Marostegui: [C: 031] Removing site.pp entries and dhcpd file entries for decom'd db1015,21,46-49,50 [puppet] - 10https://gerrit.wikimedia.org/r/397856 (owner: 10Cmjohnson) [16:22:07] (03PS3) 10Gehel: maps: update style to match latest kartotherian version [puppet] - 10https://gerrit.wikimedia.org/r/397846 (https://phabricator.wikimedia.org/T162241) [16:22:12] I'm wondering if something has changed, given the new image of today [16:22:14] volans: that worked in my earlier reimage of mw1260, though [16:22:23] yeah asking you exactly for this [16:22:34] maybe rather a problem in the code paths handling the rename? [16:22:38] !log akosiaris@tin Started deploy [ores/deploy@b4f2b02]: T181661 [16:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:47] T181661: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661 [16:22:48] I reimaged to the previous name, only with a new OS [16:22:53] (03CR) 10Gehel: [C: 032] maps: update style to match latest kartotherian version [puppet] - 10https://gerrit.wikimedia.org/r/397846 (https://phabricator.wikimedia.org/T162241) (owner: 10Gehel) [16:23:01] but the puppet cert generation should be part of d-i [16:23:04] (03PS3) 10Cmjohnson: Removing site.pp entries and dhcpd file entries for decom'd db1015,21,46-49,50 [puppet] - 10https://gerrit.wikimedia.org/r/397856 [16:23:06] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2072 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397858 (owner: 10Jcrespo) [16:23:09] and have nothing to do with the rename, right? [16:23:23] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2072 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397858 (owner: 10Jcrespo) [16:23:45] (03CR) 10Cmjohnson: [C: 032] Removing site.pp entries and dhcpd file entries for decom'd db1015,21,46-49,50 [puppet] - 10https://gerrit.wikimedia.org/r/397856 (owner: 10Cmjohnson) [16:24:11] (03PS2) 10Marostegui: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397857 [16:24:39] volans: not sure, but I think it's safe to rule out the image refresh, that only touches the d-i base stuff (basically adding the non-free firmware to the stock Debian image) [16:24:46] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2072 (duration: 00m 55s) [16:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:18] * volans wonders if it's a race condition [16:25:38] wouldn't be surprised, reimaging under a few name i rarely exercised code path I guess [16:25:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397857 (owner: 10Marostegui) [16:26:19] there is no difference in the code at the line waiting for the appearance of the puppet cert on puppetmaster though [16:26:24] the rename stuff is earlier [16:27:18] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397857 (owner: 10Marostegui) [16:27:28] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397857 (owner: 10Marostegui) [16:28:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1084 (duration: 00m 53s) [16:28:41] !log gehel@tin Started deploy [kartotherian/deploy@6e223df]: new kartotherian packaging on maps-test2002 [16:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:59] !log gehel@tin Finished deploy [kartotherian/deploy@6e223df]: new kartotherian packaging on maps-test2002 (duration: 00m 18s) [16:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:18] (03CR) 10Jon Harald Søby: "@Steinsplitter: I can't see why it would be?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397129 (https://phabricator.wikimedia.org/T182534) (owner: 10Jon Harald Søby) [16:30:33] !log gehel@tin Started deploy [tilerator/deploy@29d633e]: new tilerator packaging on maps-test2002 [16:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:52] !log gehel@tin Finished deploy [tilerator/deploy@29d633e]: new tilerator packaging on maps-test2002 (duration: 00m 20s) [16:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:27] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml],File[/usr/local/bin/puppet-enabled] [16:32:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:32:55] 503 high [16:34:29] !log gehel@tin Started deploy [tilerator/deploy@29d633e]: new tilerator packaging on maps-test2001 [16:34:37] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml],File[/usr/local/bin/puppet-enabled] [16:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:44] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831544 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'] ``` [16:34:49] !log gehel@tin Finished deploy [tilerator/deploy@29d633e]: new tilerator packaging on maps-test2001 (duration: 00m 20s) [16:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:03] jynus: looks recovered already :( [16:35:07] weird, seems all ulsfo [16:35:24] !log gehel@tin Started deploy [kartotherian/deploy@6e223df]: new kartotherian packaging on maps-test2001 [16:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:37] but what was it, upload, or the errors upload related? [16:35:42] !log gehel@tin Finished deploy [kartotherian/deploy@6e223df]: new kartotherian packaging on maps-test2001 (duration: 00m 19s) [16:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:56] no it was for text ulsfo as elukey mentioned [16:36:11] (03PS3) 10EddieGP: Delete mowiki and mowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923) (owner: 10MarcoAurelio) [16:38:11] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831568 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [16:39:37] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:40:13] (03PS1) 10Marostegui: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397863 [16:40:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:40:37] !log restart and upgrade db1059 (phabricator passive db) [16:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:27] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:42:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397863 (owner: 10Marostegui) [16:43:11] (03Draft1) 10Paladox: gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 [16:43:14] (03PS2) 10Paladox: gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 [16:43:44] (03CR) 10jerkins-bot: [V: 04-1] gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 (owner: 10Paladox) [16:43:46] proxies complaining is normal [16:43:54] see log above [16:44:06] will go back to normal when restart finishes [16:44:19] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397863 (owner: 10Marostegui) [16:44:47] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:45:01] (03PS3) 10Paladox: gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 [16:45:07] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:45:16] (03PS4) 10Paladox: gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 [16:45:38] (03PS5) 10Paladox: gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 [16:45:44] (03CR) 10jerkins-bot: [V: 04-1] gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 (owner: 10Paladox) [16:46:58] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397863 (owner: 10Marostegui) [16:47:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1084 and db1081 original weight (duration: 00m 56s) [16:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:41] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831598 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'] ``` [16:51:47] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [16:52:07] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [16:53:58] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397869 [16:54:01] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397869 [16:54:08] (03CR) 10Marostegui: [C: 04-2] "Wait for the alter to finish" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397869 (owner: 10Marostegui) [16:54:24] (03PS1) 10Cmjohnson: Removing dns entries for decom db's db1015,21,26,44-49,50 [dns] - 10https://gerrit.wikimedia.org/r/397870 [16:55:59] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3831602 (10akosiaris) `akosiaris@tin:/srv/deployment/ores/deploy$ scap deploy -v -l 'ores1004.eqiad.wmnet' T181661` fails reprodu... [16:56:27] (03PS4) 10Ayounsi: [WIP] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [16:56:57] (03CR) 10Marostegui: [C: 031] Removing dns entries for decom db's db1015,21,26,44-49,50 [dns] - 10https://gerrit.wikimedia.org/r/397870 (owner: 10Cmjohnson) [16:59:44] PROBLEM - Host ganeti1006 is DOWN: PING CRITICAL - Packet loss = 100% [16:59:55] (03PS2) 10Cmjohnson: Removing dns entries for decom db's db1015,21,26,44-49,50 [dns] - 10https://gerrit.wikimedia.org/r/397870 [17:00:04] godog, moritzm, and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171212T1700). [17:00:04] no_justification and tgr: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:15] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:25] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:35] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:48] Er, was that carried over from a prior day? [17:00:51] Mine was already merged [17:00:59] mine wasn't [17:01:14] (03PS4) 10ArielGlenn: clean up all references to a 'public dumps dir' on web/nfs dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/397806 [17:01:24] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:25] PROBLEM - DPKG on db1111 is CRITICAL: Return code of 255 is out of bounds [17:01:30] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for decom db's db1015,21,26,44-49,50 [dns] - 10https://gerrit.wikimedia.org/r/397870 (owner: 10Cmjohnson) [17:01:35] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:44] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:13] tgr: Possible I also just put it in the wrong spot or something [17:02:25] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:33] puppetdb barfed [17:02:38] that explains all the above ^ [17:02:44] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:54] puppet failures are due to nitrogen [17:02:57] 502s [17:03:00] yeah [17:03:05] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:11] OOM came out over there as well [17:03:16] PROBLEM - Disk space on db1111 is CRITICAL: Return code of 255 is out of bounds [17:03:21] puppetdb restarted 5m ag [17:03:24] PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:29] !log akosiaris@tin Started deploy [ores/deploy@b4f2b02]: T181661 [17:03:32] (03PS1) 10Chad: group0 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397873 [17:03:34] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:34] (03CR) 10Chad: [C: 04-2] group0 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397873 (owner: 10Chad) [17:03:34] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:40] T181661: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661 [17:04:14] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:17] (03CR) 10Alexandros Kosiaris: Add postgresql::prometheus class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392438 (https://phabricator.wikimedia.org/T177196) (owner: 10Alexandros Kosiaris) [17:04:34] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:44] akosiaris: I'd like to test https://gerrit.wikimedia.org/r/#/c/394966, for the moment I've only played with Riccardo's puppetdb instance but I'd prefer something else. Suggestions? [17:04:49] maybe deployment-prep? [17:04:52] (03PS1) 10Rush: openstack: contain relationship for needed classes [puppet] - 10https://gerrit.wikimedia.org/r/397874 (https://phabricator.wikimedia.org/T171494) [17:04:54] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:15] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:15] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:05:43] elukey: I got nothing [17:06:01] I don't even know if deployment-prep has puppetdb these days [17:06:03] (03Draft1) 10Paladox: gerrit: Fix ipv6 in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/397875 [17:06:07] (03PS2) 10Paladox: gerrit: Fix ipv6 in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/397875 [17:06:24] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:08:01] going also to ping herron (if you have time, https://gerrit.wikimedia.org/r/#/c/394966) [17:08:06] also where to test it properly [17:09:30] (03PS1) 10Ema: varnishxcps.mtail: use prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/397876 (https://phabricator.wikimedia.org/T177199) [17:09:52] (03CR) 10Alexandros Kosiaris: [C: 031] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [17:10:04] PROBLEM - configured eth on db1111 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:11:14] !log akosiaris@tin Started deploy [ores/deploy@b4f2b02]: T181661 [17:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:27] T181661: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661 [17:11:44] PROBLEM - dhclient process on db1111 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:11:50] !log akosiaris@tin Finished deploy [ores/deploy@b4f2b02]: T181661 (duration: 00m 36s) [17:11:59] !log akosiaris@tin Started deploy [ores/deploy@b4f2b02]: T181661 [17:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:02] !log akosiaris@tin Finished deploy [ores/deploy@b4f2b02]: T181661 (duration: 00m 03s) [17:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:34] (03PS2) 10Rush: openstack: contain relationship for needed classes [puppet] - 10https://gerrit.wikimedia.org/r/397874 (https://phabricator.wikimedia.org/T171494) [17:13:34] PROBLEM - puppet last run on db1111 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:13:42] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3831654 (10akosiaris) >>! In T181661#3831411, @thcipriani wrote: >>>! In T181661#3831316, @akosiaris wrote: >> Now for the more in... [17:15:01] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.7 (duration: 08m 45s) [17:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:24] RECOVERY - Host ganeti1006 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [17:19:16] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3831667 (10akosiaris) Moving the `deploy-cache/cache` directory aside did solve the issue (partially?) and moved on until... ```... [17:19:46] thcipriani: nice idea about the deploy-cache/cache directory. Unfortunately no dice yet... [17:20:52] (03CR) 10Dzahn: gerrit: Fix ipv6 in gerrit.config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397875 (owner: 10Paladox) [17:20:59] (03PS1) 10Andrew Bogott: k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) [17:21:21] (03CR) 10jerkins-bot: [V: 04-1] k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) (owner: 10Andrew Bogott) [17:21:58] (03CR) 10Dzahn: "not sure about the best solution yet but hiera lookups in parameters are not supposed to have defaults (per style)" [puppet] - 10https://gerrit.wikimedia.org/r/397865 (owner: 10Paladox) [17:22:24] PROBLEM - Host ganeti1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:22:35] expected ^ not to worry [17:22:42] ok, cool [17:22:49] (03PS2) 10Andrew Bogott: k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) [17:22:51] !log demon@tin Started scap: bootstrap wmf.12 [17:22:53] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler02/9303/" [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [17:22:55] RECOVERY - Host ganeti1006 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [17:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:06] akosiaris: yeah, looking at the logs I'm not sure why it's still failing. It fetched down from tin to the cache dir so the commit should be there... [17:23:12] (03CR) 10jerkins-bot: [V: 04-1] k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) (owner: 10Andrew Bogott) [17:24:19] (03CR) 10Dzahn: "wouldn't this be another "if @ipv6" in an erb template. it probably needs to puppetize Apache's ports.conf" [puppet] - 10https://gerrit.wikimedia.org/r/397865 (owner: 10Paladox) [17:24:23] (03PS8) 10Elukey: role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 [17:24:49] (03PS4) 10Dzahn: wmcs: move standard includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/394625 [17:24:52] (03CR) 10Elukey: "just did a s/Xmx=4G/Xmx4g in puppetdb::app" [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [17:25:42] (03PS3) 10Paladox: gerrit: Fix ipv6 in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/397875 [17:25:44] (03CR) 10Paladox: gerrit: Fix ipv6 in gerrit.config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397875 (owner: 10Paladox) [17:26:07] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831686 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [17:27:09] thcipriani: funny thing is git fetch origin on ores1004 doesn't really bring the object either [17:27:17] so it's not a interrupted fetch [17:27:19] deploy-service@ores1004:/srv/deployment/ores/deploy-cache/cache/editquality$ git fetch origin [17:27:19] deploy-service@ores1004:/srv/deployment/ores/deploy-cache/cache/editquality$ git show 15d5283b7422919d85203b5ba907027f9356e421 [17:27:19] fatal: bad object 15d5283b7422919d85203b5ba907027f9356e421 [17:27:44] PROBLEM - Host ganeti1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:57] what's the remove for that submodule in the .git/config? [17:28:01] er remote [17:28:17] and does it match .gitmodules [17:28:29] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:29:09] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:29:29] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:29:45] 10Operations: Use firmware-enriched Debian installation images - https://phabricator.wikimedia.org/T182699#3831688 (10MoritzMuehlenhoff) [17:29:58] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:30:18] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:30:29] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:30:35] deploy-service@ores1004:/srv/deployment/ores/deploy-cache/cache/editquality$ git remote -v [17:30:35] origin http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/editquality (fetch) [17:30:35] origin http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/editquality (push) [17:30:39] PROBLEM - puppet last run on bast4002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/tftpboot/jessie-installer/debian-installer/amd64/grub/x86_64-efi/mmap.mod] [17:30:39] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:31:26] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3831700 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` kafka1023.eqiad.wmnet ``` The log can... [17:31:30] thcipriani: yeah .git/config and .gitmodules are a match [17:31:36] and it's http://tin.eqiad.wmnet/ores/deploy/.git/modules/submodules/editquality [17:31:48] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:31:48] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:31:49] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:31:58] yeah, which I just checked has the commit that's failing [17:32:06] (03Draft1) 10Paladox: interface: Support false value alongside undef [puppet] - 10https://gerrit.wikimedia.org/r/397882 [17:32:11] that's really weird [17:32:12] (03PS2) 10Paladox: interface: Support false value alongside undef [puppet] - 10https://gerrit.wikimedia.org/r/397882 [17:32:28] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:32:39] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:32:48] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 370.06 seconds [17:33:28] RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:33:28] RECOVERY - Host ganeti1006 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [17:33:39] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:35:54] did we have a deployment or maintenance ongoing? [17:35:59] issues on s7 replication [17:36:14] if something is ongoing ,stop it [17:36:21] (03CR) 10Herron: [C: 031] "Looks good! https://puppet-compiler.wmflabs.org/compiler02/9304/" [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [17:36:22] akosiaris: hrm a lot of the submodule code for the cache directory has changed since I last dug in here, twentyafterfour is probably more in the know than I am about it; however, I can't explain the behavior even now that I'm reading the code. [17:37:05] lol [17:37:21] ok, let's wait for twentyafterfour to chime in [17:37:27] maybe he has some nice insight [17:38:14] The master is showing a lot more activity than usual: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1062&var-port=9104&from=now-3h&to=now [17:38:26] heavy inserts since 17:21 or so [17:38:40] (03PS1) 10Elukey: Remove any trace of notebook1002 records [dns] - 10https://gerrit.wikimedia.org/r/397884 (https://phabricator.wikimedia.org/T181518) [17:38:48] let's check from what [17:38:50] • 17:22 demon@tin: Started scap: bootstrap wmf.12 [17:38:50] • 17:15 demon@tin: Pruned MediaWiki: 1.31.0-wmf.7 (duration: 08m 45s) [17:39:01] That is from SAL [17:39:52] no_justification ^ [17:40:04] It's only on testwiki [17:40:17] And not even done, so wikiversions hasn't recompiled [17:40:20] this is s7, not s3 [17:40:23] so not related [17:40:40] seems to be going down [17:40:47] let's check the actual queries [17:41:08] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.34 seconds [17:41:14] yes, it is going back to more normal values [17:41:29] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9305/" [puppet] - 10https://gerrit.wikimedia.org/r/394625 (owner: 10Dzahn) [17:42:58] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3831730 (10Cmjohnson) [17:43:37] templatelinks or categorylinks on eswiki? [17:43:46] (03PS3) 10Andrew Bogott: k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) [17:44:12] (03CR) 10jerkins-bot: [V: 04-1] k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) (owner: 10Andrew Bogott) [17:44:36] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1021 - https://phabricator.wikimedia.org/T181378#3831735 (10Cmjohnson) [17:45:04] marostegui: it could be https://es.wikipedia.org/w/index.php?title=Plantilla:T%C3%ADtulo_sin_coletilla&action=history [17:45:18] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3831736 (10Cmjohnson) [17:45:57] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1045 - https://phabricator.wikimedia.org/T174806#3831739 (10Cmjohnson) [17:46:01] and/or http://en.wikipedia.org/wiki/Special:Search?go=Go&search=:es:Wikipedia:Páginas_con_bucles_de_plantillas [17:46:06] (03PS4) 10Andrew Bogott: k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) [17:46:23] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3831741 (10Cmjohnson) [17:46:33] (03CR) 10jerkins-bot: [V: 04-1] k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) (owner: 10Andrew Bogott) [17:46:37] Yeah, from the binlogs it looks like templatelinks or categorylinks indeed [17:46:49] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [17:47:01] but that is bad, that shouldn't happen [17:47:24] I am going to file a ticket [17:47:31] yes, it is weird [17:47:39] (03PS2) 10Dzahn: planet: drop duplicate standard include [puppet] - 10https://gerrit.wikimedia.org/r/397734 [17:47:53] I think it didn't set wikis in read only [17:48:01] (03CR) 10Dzahn: [C: 032] planet: drop duplicate standard include [puppet] - 10https://gerrit.wikimedia.org/r/397734 (owner: 10Dzahn) [17:48:01] so maybe the largest dbs survived [17:48:09] (03PS3) 10Dzahn: planet: drop duplicate standard include [puppet] - 10https://gerrit.wikimedia.org/r/397734 [17:49:25] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator, 10hardware-requests: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3831747 (10Cmjohnson) All non-interruptible steps have been completed. Still needs wiping/removal from rack [17:49:55] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1050 - https://phabricator.wikimedia.org/T178162#3831749 (10Cmjohnson) [17:50:01] (03Abandoned) 10Paladox: interface: Support false value alongside undef [puppet] - 10https://gerrit.wikimedia.org/r/397882 (owner: 10Paladox) [17:50:27] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10hardware-requests: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3802296 (10Cmjohnson) [17:51:13] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3798986 (10Cmjohnson) [17:52:27] !log demon@tin Finished scap: bootstrap wmf.12 (duration: 29m 35s) [17:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:48] 10Operations, 10DBA, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#3831770 (10jcrespo) We believe that since s5 was accidentally migrated to ROW, the lag is improved; so it did on labsdbs despite not having any kind of replication control, unlike production. [17:53:54] (03CR) 10Rush: [C: 032] openstack: contain relationship for needed classes [puppet] - 10https://gerrit.wikimedia.org/r/397874 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:54:05] (03PS3) 10Rush: openstack: contain relationship for needed classes [puppet] - 10https://gerrit.wikimedia.org/r/397874 (https://phabricator.wikimedia.org/T171494) [17:55:38] RECOVERY - puppet last run on bast4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:57:03] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831781 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'] ``` [17:57:08] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 92.04 seconds [17:58:08] (03PS6) 10Paladox: gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 [17:58:37] (03CR) 10jerkins-bot: [V: 04-1] gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 (owner: 10Paladox) [17:59:28] (03PS7) 10Paladox: gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 [17:59:55] (03CR) 10jerkins-bot: [V: 04-1] gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 (owner: 10Paladox) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171212T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:31] PROBLEM - Host db1111 is DOWN: PING CRITICAL - Packet loss = 100% [18:01:22] (03PS3) 10Muehlenhoff: Add a Prometheus exporter for PDNS recursor [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394982 [18:01:33] (03PS5) 10Andrew Bogott: tools k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) [18:01:42] (03PS8) 10Paladox: gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 [18:02:00] (03CR) 10jerkins-bot: [V: 04-1] tools k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) (owner: 10Andrew Bogott) [18:02:04] (03CR) 10jerkins-bot: [V: 04-1] gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 (owner: 10Paladox) [18:02:29] (03PS6) 10Andrew Bogott: tools k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) [18:02:36] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3831798 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['kafka1023.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['kafka1023.eqiad.wmnet']... [18:02:58] coordinate parsoid-mcs-restbase deploy happening ... [18:03:00] (03CR) 10jerkins-bot: [V: 04-1] tools k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) (owner: 10Andrew Bogott) [18:03:02] *coordinated [18:03:52] (03PS1) 10Filippo Giunchedi: WIP: rework mtail tests [puppet] - 10https://gerrit.wikimedia.org/r/397889 [18:05:20] (03CR) 10Muehlenhoff: Add a Prometheus exporter for PDNS recursor (032 comments) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394982 (owner: 10Muehlenhoff) [18:05:41] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add a prometheus exporter for ircd [debs/prometheus-ircd-exporter] - 10https://gerrit.wikimedia.org/r/395751 (owner: 10Muehlenhoff) [18:06:05] (03PS7) 10Andrew Bogott: tools k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) [18:06:32] (03CR) 10jerkins-bot: [V: 04-1] tools k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) (owner: 10Andrew Bogott) [18:07:13] (03PS5) 10ArielGlenn: clean up all references to a 'public dumps dir' on web/nfs dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/397806 [18:07:45] (03CR) 10Smalyshev: [C: 031] [cirrus] tune wikidata similarity configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397855 (https://phabricator.wikimedia.org/T182293) (owner: 10DCausse) [18:08:16] (03PS2) 10BBlack: varnish: Don't redirect www.$project.org on mobile [puppet] - 10https://gerrit.wikimedia.org/r/394902 (https://phabricator.wikimedia.org/T154026) (owner: 10EddieGP) [18:09:24] (03CR) 10BBlack: [C: 032] varnish: Don't redirect www.$project.org on mobile [puppet] - 10https://gerrit.wikimedia.org/r/394902 (https://phabricator.wikimedia.org/T154026) (owner: 10EddieGP) [18:12:25] ACKNOWLEDGEMENT - Host db1111 is DOWN: PING CRITICAL - Packet loss = 100% Jcrespo reimage failed, to be fixed [18:13:35] (03PS1) 10Volans: wmf-auto-reimage: fix backward compatibility [puppet] - 10https://gerrit.wikimedia.org/r/397890 [18:17:58] 10Operations, 10Discovery-Search, 10Wikimedia-Logstash, 10Services (watching): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051#3831827 (10debt) p:05High>03Low We'll table this work for now - until the new person that will take over Logstash... [18:21:48] !log uploaded prometheus-ircd-exporter to apt.wikimedia.org [18:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:08] (03CR) 10Rush: [C: 031] wmf-auto-reimage: fix backward compatibility [puppet] - 10https://gerrit.wikimedia.org/r/397890 (owner: 10Volans) [18:22:10] (03PS1) 10Chad: Releases jenkins: Only clone MediaWiki core, and make it a bare clone [puppet] - 10https://gerrit.wikimedia.org/r/397891 [18:23:09] (03PS2) 10Volans: wmf-auto-reimage: fix backward compatibility [puppet] - 10https://gerrit.wikimedia.org/r/397890 [18:24:04] (03CR) 10Volans: [C: 032] wmf-auto-reimage: fix backward compatibility [puppet] - 10https://gerrit.wikimedia.org/r/397890 (owner: 10Volans) [18:25:05] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: Cleanup multiple definitions of logstash endpoint in puppet / hiera - https://phabricator.wikimedia.org/T182304#3819484 (10debt) This appeared to be fairly easy at first, but turned out to be a big re-factoring amoun... [18:25:12] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Cleanup multiple definitions of logstash endpoint in puppet / hiera - https://phabricator.wikimedia.org/T182304#3831859 (10debt) [18:29:19] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3831865 (10aaron) >>! In T175672#3778177, @jcrespo wrote: > @aaron the proxy is inst... [18:29:43] (03CR) 10Madhuvishy: [C: 031] tools k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) (owner: 10Andrew Bogott) [18:29:49] !log running db query as phuser on phab database to get some numbers for T177423#3824056 [18:30:28] 10Operations: Debian Jessie reimage/install end up in kernel panic with 8.9 netboot image - https://phabricator.wikimedia.org/T182702#3831866 (10elukey) p:05Triage>03Normal [18:31:28] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.9 netboot image - https://phabricator.wikimedia.org/T182702#3831881 (10elukey) [18:32:01] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.9 netboot image - https://phabricator.wikimedia.org/T182702#3831866 (10elukey) [18:32:06] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Mobile, and 2 others: On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist - https://phabricator.wikimedia.org/T154026#3831883 (10EddieGP) 05Open>03Resolved a:03EddieGP This is now d... [18:32:27] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3831887 (10MoritzMuehlenhoff) [18:33:42] (03CR) 10Andrew Bogott: [C: 032] "Ignoring the style check for now, as addressing it would involve a considerable refactor." [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) (owner: 10Andrew Bogott) [18:33:54] (03PS8) 10Andrew Bogott: tools k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) [18:34:30] (03CR) 10jerkins-bot: [V: 04-1] tools k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) (owner: 10Andrew Bogott) [18:37:27] 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3831890 (10RobH) I've requested a return tag from OpenGear on our ticekt # http://opengear.zendesk.com/hc/requests/16278 [18:46:56] (03PS1) 10Chad: Releases: Also include release tools for releasing [puppet] - 10https://gerrit.wikimedia.org/r/397895 [18:47:41] 10Operations, 10DBA, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3831913 (10jcrespo) > A local and foreign replica would do it is installed on both... [18:48:03] (03PS9) 10Paladox: gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 [18:48:17] !log aaron@tin Synchronized php-1.31.0-wmf.12/includes/Setup.php: 058c17e702eb0 (duration: 01m 09s) [18:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:29] (03PS4) 10Paladox: gerrit: Fix ipv6 in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/397875 [18:59:34] 10Operations, 10Gerrit: gerrit's ipv6 param is failing on labs - https://phabricator.wikimedia.org/T182705#3831948 (10Paladox) [19:00:45] !log arlolra@tin Started deploy [parsoid/deploy@98139cb]: Updating Parsoid to 741fc5d [19:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:18] !log arlolra@tin Finished deploy [parsoid/deploy@98139cb]: Updating Parsoid to 741fc5d (duration: 05m 33s) [19:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:03] !log Parsoid deploy aborted and rolled back to 01c1fc3 while RESTBase fixes an issue [19:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:03] (03CR) 10Andrew Bogott: [V: 032 C: 032] tools k8s workers: add a mostly-permissive firewall [puppet] - 10https://gerrit.wikimedia.org/r/397879 (https://phabricator.wikimedia.org/T180055) (owner: 10Andrew Bogott) [19:22:48] 10Operations, 10Gerrit: gerrit's ipv6 param is failing on labs - https://phabricator.wikimedia.org/T182705#3832004 (10Dzahn) We talked about this. The problem here is manifold. Some facts: - IPv6 doesn't work in labs, this is causing more and more workarounds but yea, blocked by nova networking afaict - Gerri... [19:24:54] (03PS5) 10Paladox: gerrit: Fix ipv6 in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/397875 (https://phabricator.wikimedia.org/T182705) [19:25:09] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Investigate why ORES logs are being written to syslog despite explicit logging config. Fix. - https://phabricator.wikimedia.org/T182614#3832012 (10awight) a:03awight [19:26:52] !log Running cleanupUsersWithNoId.php on all wikis (this will take a while), see T181731 [19:28:53] !log arlolra@tin Started deploy [parsoid/deploy@98139cb]: (no justification provided) [19:29:03] 10Operations, 10Puppet: puppetdb failures - https://phabricator.wikimedia.org/T178625#3832016 (10herron) 05Open>03Resolved a:03herron [19:34:31] (03PS6) 10Dzahn: gerrit: Fix ipv6 in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/397875 (https://phabricator.wikimedia.org/T182705) (owner: 10Paladox) [19:35:25] Welcome back, stashbot [19:35:26] !log Running cleanupUsersWithNoId.php on all wikis (this will take a while), see T181731 [19:35:35] (03PS7) 10Dzahn: gerrit: if @ipv6 is not set don't let gerrit-sshd listen on it [puppet] - 10https://gerrit.wikimedia.org/r/397875 (https://phabricator.wikimedia.org/T182705) (owner: 10Paladox) [19:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:38] T181731: Run maintenance/cleanupUsersWithNoId.php on all wikis - https://phabricator.wikimedia.org/T181731 [19:36:20] (03CR) 10Dzahn: [C: 032] gerrit: if @ipv6 is not set don't let gerrit-sshd listen on it [puppet] - 10https://gerrit.wikimedia.org/r/397875 (https://phabricator.wikimedia.org/T182705) (owner: 10Paladox) [19:36:26] thanks :) [19:37:59] it adds a newline because of the "<%- " [19:38:03] but whatever :) [19:42:45] (03PS2) 10Dzahn: mwlog: style fixes, move firewall include [puppet] - 10https://gerrit.wikimedia.org/r/397701 [19:43:50] !log arlolra@tin Finished deploy [parsoid/deploy@98139cb]: (no justification provided) (duration: 14m 57s) [19:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:43] (03PS1) 10Rush: openstack: contain classes for dependency handling [puppet] - 10https://gerrit.wikimedia.org/r/397903 (https://phabricator.wikimedia.org/T171494) [19:46:55] (03PS2) 10Rush: openstack: contain classes for dependency handling [puppet] - 10https://gerrit.wikimedia.org/r/397903 (https://phabricator.wikimedia.org/T171494) [19:48:46] (03CR) 10Dzahn: [C: 032] "wmf-style: total violations delta -5" [puppet] - 10https://gerrit.wikimedia.org/r/397701 (owner: 10Dzahn) [19:50:23] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [19:50:23] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [19:50:23] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [19:50:23] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [19:50:23] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [19:50:23] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [19:50:24] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [19:50:43] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [19:50:43] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [19:51:02] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [19:51:02] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [19:51:03] on it ^ [19:51:03] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [19:51:03] (03PS10) 10Paladox: gerrit: Make ipv6 support truly optional [puppet] - 10https://gerrit.wikimedia.org/r/397865 [19:51:21] mobileapps deployment underway [19:51:38] !log mholloway-shell@tin Started deploy [mobileapps/deploy@2690678]: Update mobileapps to 5b8796d [19:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:40] (03PS11) 10Dzahn: gerrit: add fallback default undef for IPv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/397865 (https://phabricator.wikimedia.org/T182705) (owner: 10Paladox) [19:56:58] (03CR) 10Dzahn: [C: 032] gerrit: add fallback default undef for IPv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/397865 (https://phabricator.wikimedia.org/T182705) (owner: 10Paladox) [19:57:04] thanks :) [19:57:07] !log Updated Parsoid to 741fc5d (T114072, T181226, T21910, T152540, T103714, T97093, T118520, T181229, T182338, T182170, T169006) [19:57:16] Sorry for the noise ^ [19:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:28] T181226: Don't output section wrappers in body_only mode - https://phabricator.wikimedia.org/T181226 [19:57:29] T182338: Parsoid: Interwiki links with angle brackets should be invalid - https://phabricator.wikimedia.org/T182338 [19:57:29] T97093: Parsoid should mark interwiki links as such - https://phabricator.wikimedia.org/T97093 [19:57:29] T152540: Migrate to HTML5 section ids - https://phabricator.wikimedia.org/T152540 [19:57:29] T21910: Headings of the form ===+\s+ are not preprocessed correctly. - https://phabricator.wikimedia.org/T21910 [19:57:29] T181229: Content after reference tag in template disappears - https://phabricator.wikimedia.org/T181229 [19:57:29] T182170: Create new high-priority linter category for multiple unclosed formatting tags whose effects accumulate - https://phabricator.wikimedia.org/T182170 [19:57:30] T114072:
tags for MediaWiki sections - https://phabricator.wikimedia.org/T114072 [19:57:30] T103714: Fragment encoding in heading anchors and in links differ - https://phabricator.wikimedia.org/T103714 [19:57:31] T118520: Use instead of for inline figures. - https://phabricator.wikimedia.org/T118520 [19:57:31] T169006: Correctly redirect in Parsoid /transform/wikitext/to/lint endpoint - https://phabricator.wikimedia.org/T169006 [20:00:04] no_justification: #bothumor I � Unicode. All rise for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171212T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:01:25] 10Operations, 10Gerrit, 10Patch-For-Review: gerrit's ipv6 param is failing on labs - https://phabricator.wikimedia.org/T182705#3832298 (10Paladox) 05Open>03Resolved [20:01:27] (03PS3) 10Dzahn: pentest::tools: add missing system::role [puppet] - 10https://gerrit.wikimedia.org/r/397731 [20:01:58] (03PS4) 10Dzahn: pentest::tools: add missing system::role [puppet] - 10https://gerrit.wikimedia.org/r/397731 [20:02:16] (03CR) 10Dzahn: [C: 032] pentest::tools: add missing system::role [puppet] - 10https://gerrit.wikimedia.org/r/397731 (owner: 10Dzahn) [20:04:41] (03PS2) 10Dzahn: mirrors: move standard include out of site [puppet] - 10https://gerrit.wikimedia.org/r/397728 [20:04:54] (03PS3) 10Dzahn: mirrors: move standard include out of site [puppet] - 10https://gerrit.wikimedia.org/r/397728 [20:05:01] (03CR) 10EddieGP: [C: 04-1] "Google for "techblog.wikimedia.org", find the following examples:" [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [20:05:52] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [20:06:12] (03CR) 10Dzahn: [C: 032] mirrors: move standard include out of site [puppet] - 10https://gerrit.wikimedia.org/r/397728 (owner: 10Dzahn) [20:06:27] (03CR) 10EddieGP: [C: 04-1] "> - http://techblog.wikimedia.org/2009/06/wikimediamobile-launch/" [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [20:06:42] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 578 bytes in 11.322 second response time [20:07:39] ^ alert for tools home page possibly a delayed reaction? [20:07:57] (03PS4) 10Dzahn: site: convert "not true"-spare systems to role(test) [puppet] - 10https://gerrit.wikimedia.org/r/394731 [20:08:01] it seems very slow but does come up for me atm [20:09:04] andrewbogott: madhuvishy fyi tools ome page alert above but it did recover so idk [20:09:13] (03PS5) 10Dzahn: site/cp: convert "not true"-spare systems to role(test) [puppet] - 10https://gerrit.wikimedia.org/r/394731 [20:09:20] (03CR) 10Dzahn: [C: 032] site/cp: convert "not true"-spare systems to role(test) [puppet] - 10https://gerrit.wikimedia.org/r/394731 (owner: 10Dzahn) [20:11:08] chasemp: hmmm I see complaints of redis connection failures [20:11:17] seems similar to the stashbot issues [20:11:33] (03PS6) 10Dzahn: site/cp: convert "not true"-spare systems to role(test) [puppet] - 10https://gerrit.wikimedia.org/r/394731 [20:11:50] (03PS7) 10Dzahn: site/cp: convert "not true"-spare systems to role(test) [puppet] - 10https://gerrit.wikimedia.org/r/394731 [20:11:56] but seems fine now [20:13:32] (03PS6) 10Dzahn: contint: remove browsertests role from permanent slaves [puppet] - 10https://gerrit.wikimedia.org/r/397601 (https://phabricator.wikimedia.org/T182642) (owner: 10Paladox) [20:14:38] (03CR) 10Dzahn: [C: 032] contint: remove browsertests role from permanent slaves [puppet] - 10https://gerrit.wikimedia.org/r/397601 (https://phabricator.wikimedia.org/T182642) (owner: 10Paladox) [20:15:23] (03CR) 10Dzahn: "why not?" [puppet] - 10https://gerrit.wikimedia.org/r/397720 (owner: 10Paladox) [20:16:38] (03PS1) 10Cmjohnson: Removing site.pp and dhcpd file entries for mc1001-18 T164341 [puppet] - 10https://gerrit.wikimedia.org/r/397906 [20:16:41] (03PS2) 10Dzahn: logstash: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/392996 [20:17:17] (03CR) 10Rush: "This is probably as complex as we usually get with bash typically. I believe the standard here is 4 spaces and not to use tabs fyi." [puppet] - 10https://gerrit.wikimedia.org/r/394572 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [20:17:55] madhuvishy: one thought is maybe we suddenly got conntrack problems there now? [20:18:03] something to look at if this flaps in this way [20:18:32] (03PS2) 10Cmjohnson: Removing site.pp and dhcpd file entries for mc1001-18 T164341 [puppet] - 10https://gerrit.wikimedia.org/r/397906 [20:19:03] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9309/" [puppet] - 10https://gerrit.wikimedia.org/r/392996 (owner: 10Dzahn) [20:20:43] (03CR) 10Paladox: "> why not?" [puppet] - 10https://gerrit.wikimedia.org/r/397720 (owner: 10Paladox) [20:20:51] (03PS2) 10Dzahn: grafana: add dashboard for cloud-codfw [puppet] - 10https://gerrit.wikimedia.org/r/393698 [20:31:42] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@2690678]: Update mobileapps to 5b8796d (duration: 40m 04s) [20:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:46] !log mholloway-shell@tin Started deploy [mobileapps/deploy@0a9d635]: Update mobileapps to 035608d [20:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:18] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@0a9d635]: Update mobileapps to 035608d (duration: 02m 32s) [20:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:52] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 502 (expecting: 200) [20:37:43] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 173 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:38:02] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [20:38:23] (03CR) 10Dzahn: "before we just do this code change let's find out if this means we need a ticket to get the npm package in stretch or if we stop using it " [puppet] - 10https://gerrit.wikimedia.org/r/397720 (owner: 10Paladox) [20:39:42] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 578 bytes in 0.014 second response time [20:39:58] !log mholloway-shell@tin Started deploy [mobileapps/deploy@b2d5b8e]: Update mobileapps to 172abc7 [20:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:55] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [20:41:50] (03CR) 10Chad: [C: 032] group0 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397873 (owner: 10Chad) [20:42:02] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [20:42:32] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [20:42:43] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 14 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:43:02] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [20:43:03] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [20:43:03] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [20:43:03] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [20:43:29] (03Merged) 10jenkins-bot: group0 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397873 (owner: 10Chad) [20:43:32] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [20:43:42] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [20:43:45] (03CR) 10jenkins-bot: group0 to wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397873 (owner: 10Chad) [20:43:50] (03PS8) 10Andrew Bogott: WMCS: set puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/397711 (https://phabricator.wikimedia.org/T178717) [20:43:52] (03PS1) 10Andrew Bogott: labs-bootstrapvz: include backports in our initial apt sources [puppet] - 10https://gerrit.wikimedia.org/r/397911 [20:44:32] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [20:44:32] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [20:44:46] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@b2d5b8e]: Update mobileapps to 172abc7 (duration: 04m 48s) [20:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:03] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [20:45:50] (03PS2) 10Ottomata: Move statsv varnishkafka and service to use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) [20:45:56] (03CR) 10Andrew Bogott: [C: 032] labs-bootstrapvz: include backports in our initial apt sources [puppet] - 10https://gerrit.wikimedia.org/r/397911 (owner: 10Andrew Bogott) [20:46:24] (03CR) 10jerkins-bot: [V: 04-1] Move statsv varnishkafka and service to use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata) [20:48:11] !log ppchelko@tin Started deploy [restbase/deploy@506047c]: Update expected Parsoid version, switched summary to MCS [20:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:55] 10Operations: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3832373 (10MoritzMuehlenhoff) This is https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=883938 Proposed fix at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=883938#170 [20:50:23] (03PS3) 10Ottomata: Move statsv varnishkafka and service to use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) [20:50:52] (03CR) 10jerkins-bot: [V: 04-1] Move statsv varnishkafka and service to use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata) [20:51:22] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /originalimage/source = http://upload.wikimedia.org/wikipedia/commons/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICA [20:51:22] lected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [20:51:32] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 responds with malformed body (IndexError: list index out of range) [20:51:33] (03PS1) 10Anomie: Fix 'sql' script for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/397912 (https://phabricator.wikimedia.org/T182713) [20:51:33] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /originalimage/source = http://upload.wikimedia.org/wikipedia/commons/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICA [20:51:33] lected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [20:51:35] (03PS1) 10Anomie: Add --replica parameter to sql script [puppet] - 10https://gerrit.wikimedia.org/r/397913 [20:51:40] !log ppchelko@tin Finished deploy [restbase/deploy@506047c]: Update expected Parsoid version, switched summary to MCS (duration: 03m 30s) [20:51:43] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [20:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:12] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /originalimage/source = http://upload.wikimedia.org/wikipedia/commons/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICA [20:52:13] lected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [20:52:13] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [20:52:13] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /originalimage/source = http://upload.wikimedia.org/wikipedia/commons/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CR [20:52:13] ve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [20:52:23] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /originalimage/source = http://upload.wikimedia.org/wikipedia/commons/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICA [20:52:23] lected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [20:52:33] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [20:52:33] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 responds with malformed body (IndexError: list index out of range) [20:52:33] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 responds with malformed body (IndexError: list index out of range) [20:52:48] marostegui, jynus: FYI, I found something that broke with the multi-instance hosts. I also made a patch to fix it: https://gerrit.wikimedia.org/r/#/c/397912/ [20:53:10] !log demon@tin rebuilt and synchronized wikiversions files: group0 to wmf.12 [20:53:12] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 responds with malformed body (IndexError: list index out of range) [20:53:13] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve the selected anniversaries for January 15) is CRITICAL: Test retrieve the selected anniversaries for January 15 responds with malformed body (IndexError: list index out of range) [20:53:13] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 responds with malformed body (IndexError: list index out of range) [20:53:13] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /originalimage/source = http://upload.wikimedia.org/wikipedia/commons/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICA [20:53:13] lected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [20:53:13] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [20:53:13] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [20:53:14] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [20:53:14] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /originalimage/source = http://upload.wikimedia.org/wikipedia/commons/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICA [20:53:15] lected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [20:53:15] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /originalimage/source = http://upload.wikimedia.org/wikipedia/commons/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICA [20:53:16] lected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [20:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:48] !log ppchelko@tin Started deploy [restbase/deploy@d3ca789]: Revert deployment for using MCS for summaries [20:53:56] known ^ [20:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:24] !log ppchelko@tin Finished deploy [restbase/deploy@d3ca789]: Revert deployment for using MCS for summaries (duration: 01m 36s) [20:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:52] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 39 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:00:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 10 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:00:58] (03PS4) 10Ottomata: Move statsv varnishkafka and service to use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) [21:01:19] (03CR) 10jerkins-bot: [V: 04-1] Move statsv varnishkafka and service to use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata) [21:02:32] !log ppchelko@tin Started deploy [restbase/deploy@dceab2e]: Bump expected parsoid version, but do not switch summaries [21:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:35] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 35357: Connection refused [21:04:26] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /originalimage/source = http://upload.wikimedia.org/wikipedia/commons/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICA [21:04:26] lected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:04:26] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /originalimage/source = http://upload.wikimedia.org/wikipedia/commons/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICA [21:04:26] lected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:04:26] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /originalimage/source = http://upload.wikimedia.org/wikipedia/commons/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICA [21:04:26] lected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:04:55] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /originalimage/source = http://upload.wikimedia.org/wikipedia/commons/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICA [21:04:56] lected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:05:16] PROBLEM - keystone public endoint port 5000 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 5000: Connection refused [21:05:26] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [21:05:26] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [21:05:26] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [21:05:26] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [21:05:26] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [21:05:26] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:05:35] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [21:05:35] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [21:05:55] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [21:05:55] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [21:06:06] known ^ [21:06:16] PROBLEM - DPKG on labtestcontrol2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:06:26] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [21:06:26] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [21:06:46] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [21:06:53] (03PS3) 10Zoranzoki21: Redirect techblog.wikimedia.org to blog.wikimedia.org/c/technology [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [21:07:16] RECOVERY - DPKG on labtestcontrol2003 is OK: All packages OK [21:07:55] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 37 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:08:23] !log ppchelko@tin Finished deploy [restbase/deploy@dceab2e]: Bump expected parsoid version, but do not switch summaries (duration: 05m 51s) [21:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:36] !log ppchelko@tin Started deploy [restbase/deploy@dceab2e]: Bump expected parsoid version, but do not switch summaries, take 2 after failed content rerender [21:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:28] (03PS9) 10Andrew Bogott: WMCS: set puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/397711 (https://phabricator.wikimedia.org/T178717) [21:09:30] (03PS1) 10Andrew Bogott: bootstrapvz: s/jessie/stretch in apt sources [puppet] - 10https://gerrit.wikimedia.org/r/397920 [21:12:01] !log mobrovac@tin Started restart [electron-render/deploy@94d27d7]: Electron hanging - T174916 [21:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:12] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [21:12:55] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 11 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:13:39] !log ppchelko@tin Finished deploy [restbase/deploy@dceab2e]: Bump expected parsoid version, but do not switch summaries, take 2 after failed content rerender (duration: 05m 05s) [21:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:51] !log ppchelko@tin Started deploy [restbase/deploy@dceab2e]: Bump expected parsoid version, but do not switch summaries, take 3 [21:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:52] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3832458 (10RobH) a:03Nuria This seems stalled seeking endorsement from @nuria. As such, I've assigned this directly to the user. Please p... [21:15:55] !log ppchelko@tin Finished deploy [restbase/deploy@dceab2e]: Bump expected parsoid version, but do not switch summaries, take 3 (duration: 02m 04s) [21:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:51] !log ppchelko@tin Started deploy [restbase/deploy@dceab2e]: Bump expected parsoid version, but do not switch summaries, take 4 [21:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:36] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:20:36] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:20:45] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:20:51] (03CR) 10Ottomata: "JENKINS YOU LIE!" [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata) [21:21:06] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:21:07] I guess that's from the deploy? [21:21:14] (03PS2) 10Dzahn: Releases jenkins: Only clone MediaWiki core, and make it a bare clone [puppet] - 10https://gerrit.wikimedia.org/r/397891 (owner: 10Chad) [21:21:36] !log ppchelko@tin Finished deploy [restbase/deploy@dceab2e]: Bump expected parsoid version, but do not switch summaries, take 4 (duration: 02m 45s) [21:21:36] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:21:36] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:21:36] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:05] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:22:06] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:22:06] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:22:06] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:22:36] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:22:37] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:22:37] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:22:37] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:22:37] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:22:46] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:22:46] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:23:03] (03CR) 10Andrew Bogott: [C: 032] bootstrapvz: s/jessie/stretch in apt sources [puppet] - 10https://gerrit.wikimedia.org/r/397920 (owner: 10Andrew Bogott) [21:23:17] ^^^ we're on it [21:23:44] k [21:23:51] !log mobrovac@tin Started deploy [restbase/deploy@dceab2e]: Switch to Parsoid content v1.6.0 and switch to Cassandra 3 storage - T179417 [21:23:52] !log mholloway-shell@tin Started deploy [mobileapps/deploy@b2d5b8e]: Update mobileapps to 172abc7 [21:23:56] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:24:01] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3832487 (10RobH) [21:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:04] T179417: Migrate Parsoid from legacy to new storage - https://phabricator.wikimedia.org/T179417 [21:24:04] 10Operations, 10ops-ulsfo: ulsfo pdu 1.22 replacement - https://phabricator.wikimedia.org/T151263#3832488 (10RobH) [21:24:05] 10Operations, 10ops-ulsfo: cp4008 and cp4012 running on single PSU - https://phabricator.wikimedia.org/T151275#3832485 (10RobH) 05stalled>03declined We've replaced these systems. [21:24:06] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:24:06] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:22] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@b2d5b8e]: Update mobileapps to 172abc7 (duration: 00m 30s) [21:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:35] !log mholloway-shell@tin Started deploy [mobileapps/deploy@28bfda3]: Update mobileapps to d0ee651 [21:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:43] (03CR) 10Dzahn: [C: 032] Releases jenkins: Only clone MediaWiki core, and make it a bare clone [puppet] - 10https://gerrit.wikimedia.org/r/397891 (owner: 10Chad) [21:24:44] !log demon@tin Synchronized php-1.31.0-wmf.11/extensions/GlobalBlocking/: (no justification provided) (duration: 01m 08s) [21:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:54] (03PS3) 10Dzahn: Releases jenkins: Only clone MediaWiki core, and make it a bare clone [puppet] - 10https://gerrit.wikimedia.org/r/397891 (owner: 10Chad) [21:25:31] no_justification: I see what you did there [21:25:45] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:26:19] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@28bfda3]: Update mobileapps to d0ee651 (duration: 01m 45s) [21:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:36] (03PS2) 10Dzahn: Releases: Also include release tools for releasing [puppet] - 10https://gerrit.wikimedia.org/r/397895 (owner: 10Chad) [21:26:45] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:27:21] !log demon@tin Synchronized php-1.31.0-wmf.12/extensions/GlobalBlocking/: (no justification provided) (duration: 01m 07s) [21:27:29] (03CR) 10Dzahn: [C: 032] Releases: Also include release tools for releasing [puppet] - 10https://gerrit.wikimedia.org/r/397895 (owner: 10Chad) [21:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:45] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:27:45] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:27:45] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:27:55] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:28:07] !log mobrovac@tin Finished deploy [restbase/deploy@dceab2e]: Switch to Parsoid content v1.6.0 and switch to Cassandra 3 storage - T179417 (duration: 04m 16s) [21:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:48] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:28:48] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:28:48] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:29:05] (03CR) 10Dzahn: "FAIL. Error: /usr/bin/git clone https://gerrit.wikimedia.org/r/mediawiki/core --bare /srv/mediawiki/core returned 128 instead of one of [" [puppet] - 10https://gerrit.wikimedia.org/r/397891 (owner: 10Chad) [21:31:08] PROBLEM - restbase endpoints health on cerium is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:31:09] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:31:09] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:31:48] PROBLEM - restbase endpoints health on xenon is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:31:48] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:31:57] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body (IndexError: list index out of range) [21:31:59] !log releases1001 - rm mediawiki core repo and let puppet try to recreate it (follow-up issue after gerrit:397891) [21:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:42] rb alerts known, acked ^ [21:34:08] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_clone_mediawiki/core] [21:34:27] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_clone_mediawiki/core] [21:35:37] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [21:38:37] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 578 bytes in 18.864 second response time [21:39:08] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:42:07] (03CR) 10Dzahn: "manually deleted all the mediawiki repos and let puppet recreate the core repo" [puppet] - 10https://gerrit.wikimedia.org/r/397891 (owner: 10Chad) [21:46:37] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [21:47:18] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 578 bytes in 0.018 second response time [21:55:20] (03PS2) 10Dzahn: planet: add some missing Hiera calls and rename params [puppet] - 10https://gerrit.wikimedia.org/r/397729 [21:55:49] (03CR) 10jerkins-bot: [V: 04-1] planet: add some missing Hiera calls and rename params [puppet] - 10https://gerrit.wikimedia.org/r/397729 (owner: 10Dzahn) [22:02:01] !log setting compaction throughput to 5 MB/s, restbase1010 [22:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:37] 10Operations, 10Puppet, 10cloud-services-team, 10Puppet-infrastructure-modernization: Stop using etckeeper (at least before/after puppet runs) - https://phabricator.wikimedia.org/T182721#3832637 (10Andrew) [22:10:08] (03PS3) 10Dzahn: planet: add some missing Hiera calls and rename params [puppet] - 10https://gerrit.wikimedia.org/r/397729 [22:11:07] (03PS10) 10Andrew Bogott: WMCS: set puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/397711 (https://phabricator.wikimedia.org/T178717) [22:11:09] (03PS1) 10Andrew Bogott: bootstrapvz: allow default (v4) puppet packages on stretch base images. [puppet] - 10https://gerrit.wikimedia.org/r/397966 (https://phabricator.wikimedia.org/T178717) [22:12:35] (03PS1) 10Andrew Bogott: puppet agent: don't call etckeeper hooks pre- and post-run [puppet] - 10https://gerrit.wikimedia.org/r/397967 (https://phabricator.wikimedia.org/T182721) [22:15:24] (03CR) 10Dzahn: aptrepo: move Hiera calls into parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397730 (owner: 10Dzahn) [22:16:39] 10Operations, 10ops-eqiad, 10Tool-Global-user-contributions: Database error: Unable to connect to s1.web.db.svc.eqiad.wmflabs - https://phabricator.wikimedia.org/T182722#3832649 (10Jeff_G) [22:20:35] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Indeed. looks like this class is only used from modules/role/manifests/aptrepo/wikimedia.pp so adding a profile shouldn't be that difficul" [puppet] - 10https://gerrit.wikimedia.org/r/397730 (owner: 10Dzahn) [22:22:44] (03CR) 10Alexandros Kosiaris: [C: 031] "Yeah I see no reason for us to have those. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/397967 (https://phabricator.wikimedia.org/T182721) (owner: 10Andrew Bogott) [22:24:05] (03PS1) 10EBernhardson: Turn on MLR for most wikis with >1% of search traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397970 [22:24:54] !log mholloway-shell@tin Started deploy [mobileapps/deploy@ea8f05d]: Update mobileapps to 94f267b [22:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:28] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [22:26:28] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [22:27:28] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [22:27:28] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [22:27:28] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [22:27:28] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [22:27:39] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:27:39] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [22:27:39] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [22:27:39] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [22:28:28] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [22:28:32] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [22:28:32] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [22:28:32] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [22:28:32] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [22:28:32] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [22:28:32] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [22:28:34] (03PS3) 10EBernhardson: Setup MLR AB test for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397582 [22:28:38] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [22:28:40] (03PS2) 10EBernhardson: Turn on MLR for most wikis with >1% of search traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397970 [22:28:47] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [22:28:47] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [22:29:28] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [22:29:37] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [22:29:37] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [22:29:47] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [22:29:48] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [22:30:03] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@ea8f05d]: Update mobileapps to 94f267b (duration: 05m 10s) [22:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:37] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [22:31:47] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [22:31:47] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [22:32:47] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [22:32:47] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [22:33:18] (03PS4) 10EBernhardson: Setup MLR AB test for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397582 (https://phabricator.wikimedia.org/T182616) [22:46:46] 10Operations, 10ops-eqiad, 10Tool-Global-user-contributions: Database error: Unable to connect to s1.web.db.svc.eqiad.wmflabs - https://phabricator.wikimedia.org/T182722#3832708 (10Magnus) p:05Triage>03Unbreak! Several of my tools appear to be affected as well, example: ``` ERROR:php_network_getaddresse... [22:55:56] (03PS2) 10Dzahn: prometheus: move duplicate firewall/standard include [puppet] - 10https://gerrit.wikimedia.org/r/397727 [22:57:03] (03PS3) 10Dzahn: prometheus: move duplicate include, use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/397727 [22:59:11] (03CR) 10Dzahn: "ok, let's base access on role::logging::mediawiki::udp2log then. and/or make a new clean "role::logging::server" that is the only role on " [puppet] - 10https://gerrit.wikimedia.org/r/393994 (owner: 10Dzahn) [23:07:48] ACKNOWLEDGEMENT - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Telia outage in progress. Telia Carrier Reference: 00807925 [23:07:49] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Telia outage in progress. Telia Carrier Reference: 00807925 [23:09:35] (03PS3) 10Dzahn: mwlog/xenon: access should be based on role, not host names [puppet] - 10https://gerrit.wikimedia.org/r/393994 [23:15:37] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [23:15:48] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [23:18:17] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [23:20:07] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 578 bytes in 10.857 second response time [23:22:47] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [23:22:49] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [23:22:57] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [23:28:57] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [23:28:58] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [23:29:37] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [23:29:47] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [23:29:48] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [23:29:48] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [23:47:37] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.238 second response time [23:54:38] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.143 second response time [23:59:19] Jhs: around? going to start swat