[00:00:25] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Allow protocol-relative URLs in TemplateStyles (duration: 00m 59s) [00:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:26] 10Operations, 10Performance-Team, 10Wikimedia-Apache-configuration: VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4063305 (10Krinkle) Further testing shows that while it matches the custom default VirtualHost on mwdebug1001 and mwdebug... [00:07:32] Reedy: thanks, works [00:07:37] cool [00:08:01] (03CR) 10Krinkle: "Should the default WMF-specific values be removed from TemplateStyles/extension.json? It seems odd to have both. Overall it seems preferre" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [00:17:27] (03CR) 10Gergő Tisza: "Those seem like reasonable defaults to me (much like how Commons is also the default configuration for foreign image repositories). We don" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [00:20:04] (03CR) 10Krinkle: "Right, but the TemplateStyles repo's default config only specifies upload.wikimedia.org, which is HTTPS-only. So that's safe to move there" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [00:27:32] (03CR) 10Gergő Tisza: "I meant the referring site, but I guess when your whole page is over HTTP, that's not really something to worry about..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [00:28:25] 10Operations, 10ops-codfw, 10netops, 10User-Elukey: Switch port configuration for mw2259-mw2290 - https://phabricator.wikimedia.org/T190115#4063349 (10Papaul) p:05Triage>03Normal [00:30:28] 10Operations, 10Performance-Team, 10Wikimedia-Apache-configuration: VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4063366 (10Krinkle) [00:36:43] 10Operations, 10ops-requests: Upgrade former svn account nojhan for wikitech. - https://phabricator.wikimedia.org/T83042#4063375 (10Dzahn) [00:37:04] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [00:41:00] (03Abandoned) 10Dzahn: decom californium from site and install_server [puppet] - 10https://gerrit.wikimedia.org/r/420147 (https://phabricator.wikimedia.org/T189921) (owner: 10Dzahn) [00:41:11] (03Abandoned) 10Dzahn: remove californium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/420145 (https://phabricator.wikimedia.org/T189921) (owner: 10Dzahn) [00:41:37] (03PS1) 10BryanDavis: toolforge: Add Content-Security-Policy-Report-Only header [puppet] - 10https://gerrit.wikimedia.org/r/420619 (https://phabricator.wikimedia.org/T130748) [00:41:42] (03Abandoned) 10Dzahn: site: remove mapped IPv6 from californium [puppet] - 10https://gerrit.wikimedia.org/r/420144 (https://phabricator.wikimedia.org/T189921) (owner: 10Dzahn) [00:47:58] (03PS1) 10Dzahn: site: unify bast1001/1002 stanza, add IPv6 for bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/420620 (https://phabricator.wikimedia.org/T183412) [00:48:23] (03PS2) 10Dzahn: site: unify bast1001/1002 stanza, add IPv6 for bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/420620 (https://phabricator.wikimedia.org/T183412) [00:49:03] (03CR) 10jerkins-bot: [V: 04-1] site: unify bast1001/1002 stanza, add IPv6 for bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/420620 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [00:49:07] (03CR) 10Dzahn: [C: 032] site: unify bast1001/1002 stanza, add IPv6 for bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/420620 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [00:50:26] 10Operations, 10Cloud-Services, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921#4063407 (10Dzahn) >>! In T189921#4058930, @TerraCodes wrote: > @Dzahn why remove all the subscribers? Because when you click "create subtask" on a parent task... [00:52:38] (03PS3) 10Dzahn: site: unify bast1001/1002 stanza, add IPv6 for bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/420620 (https://phabricator.wikimedia.org/T183412) [00:55:53] (03CR) 10Dzahn: [C: 032] site: unify bast1001/1002 stanza, add IPv6 for bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/420620 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [00:56:14] (03PS4) 10Dzahn: site: unify bast1001/1002 stanza, add IPv6 for bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/420620 (https://phabricator.wikimedia.org/T183412) [01:00:55] 10Operations, 10Cloud-Services: Create Wikitech account for Aryeh Gregor (preexisting SVN login "simetrical") - https://phabricator.wikimedia.org/T189981#4059265 (10tstarling) I took a stab at doing this myself. I added an email address and password to the LDAP user using the tools described at https://wikitec... [01:01:58] (03PS1) 10Dzahn: network::constants: add bast1002 as bastion server [puppet] - 10https://gerrit.wikimedia.org/r/420621 (https://phabricator.wikimedia.org/T183412) [01:05:39] (03PS1) 10Dzahn: fix IPv6 record for bast1002, wrong row [dns] - 10https://gerrit.wikimedia.org/r/420623 (https://phabricator.wikimedia.org/T183412) [01:06:10] (03PS2) 10Dzahn: fix IPv6 record for bast1002, wrong row [dns] - 10https://gerrit.wikimedia.org/r/420623 (https://phabricator.wikimedia.org/T183412) [01:06:34] (03CR) 10Dzahn: [C: 032] fix IPv6 record for bast1002, wrong row [dns] - 10https://gerrit.wikimedia.org/r/420623 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [01:07:45] (03CR) 10Dzahn: [C: 032] "6.8.0.0.4.5.1.0.0.8.0.0.8.0.2.0.3.0.0.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa domain name pointer bast1002.wikimedia.org." [dns] - 10https://gerrit.wikimedia.org/r/420623 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [01:08:35] (03PS2) 10Dzahn: network::constants: add bast1002 as bastion server [puppet] - 10https://gerrit.wikimedia.org/r/420621 (https://phabricator.wikimedia.org/T183412) [01:11:20] (03PS1) 10Dzahn: sync /home dirs from bast1001 to bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/420626 (https://phabricator.wikimedia.org/T183412) [01:20:42] (03PS3) 10Madhuvishy: slow-parse: Turn off rsync from mwlog1001 to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/420408 (https://phabricator.wikimedia.org/T189284) [01:24:20] (03CR) 10Madhuvishy: [C: 032] slow-parse: Turn off rsync from mwlog1001 to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/420408 (https://phabricator.wikimedia.org/T189284) (owner: 10Madhuvishy) [01:28:41] (03PS3) 10Madhuvishy: slow-parse: Remove code for rsync to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/420410 (https://phabricator.wikimedia.org/T189284) [01:30:08] (03CR) 10Madhuvishy: [C: 032] slow-parse: Remove code for rsync to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/420410 (https://phabricator.wikimedia.org/T189284) (owner: 10Madhuvishy) [01:39:28] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4063485 (10Volker_E) [01:41:17] (03PS2) 10Madhuvishy: dumps: Absent slowparse logs rsync config [puppet] - 10https://gerrit.wikimedia.org/r/420411 (https://phabricator.wikimedia.org/T189284) [01:49:23] (03CR) 10Madhuvishy: [C: 032] dumps: Absent slowparse logs rsync config [puppet] - 10https://gerrit.wikimedia.org/r/420411 (https://phabricator.wikimedia.org/T189284) (owner: 10Madhuvishy) [01:52:53] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#4063505 (10Jdlrobson) [01:54:55] (03PS2) 10Madhuvishy: dumps: Remove slowparse rsync related code [puppet] - 10https://gerrit.wikimedia.org/r/420415 (https://phabricator.wikimedia.org/T189284) [01:57:05] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80% [01:59:46] (03CR) 10Madhuvishy: [C: 032] dumps: Remove slowparse rsync related code [puppet] - 10https://gerrit.wikimedia.org/r/420415 (https://phabricator.wikimedia.org/T189284) (owner: 10Madhuvishy) [02:02:05] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:09:27] (03PS3) 10Madhuvishy: dumps: Enable fetcher for labstore1006|7 [puppet] - 10https://gerrit.wikimedia.org/r/416984 (https://phabricator.wikimedia.org/T188727) [02:13:05] 10Operations, 10Cloud-Services, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921#4063564 (10TerraCodes) >>! In T189921#4063407, Dzahn wrote: >>>! In T189921#4058930, @TerraCodes wrote: >> Dzahn why remove all the subscribers? > > Because wh... [02:13:24] (03CR) 10Madhuvishy: [C: 032] dumps: Enable fetcher for labstore1006|7 [puppet] - 10https://gerrit.wikimedia.org/r/416984 (https://phabricator.wikimedia.org/T188727) (owner: 10Madhuvishy) [02:27:05] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80% [02:32:02] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.25) (duration: 06m 40s) [02:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:05] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:47:05] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80% [02:48:04] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [03:02:05] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80% [03:08:05] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [03:12:04] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [03:14:20] (03CR) 10Anomie: "> Should the default WMF-specific values be removed from" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [03:26:22] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 843.89 seconds [03:26:51] (03CR) 10Gergő Tisza: "IMO as long as we can't provide less confusing error messages this is the least bad solution. It would be nice to get rid of it eventually" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [03:52:04] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80% [03:53:05] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [03:56:21] !log Deleting stale webpagetest.* metrics on graphite1001 and graphite2001 (any wsp file last modified 600+ days ago) – T179622 [03:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:28] T179622: Update our Graphite metrics for current retention rules - https://phabricator.wikimedia.org/T179622 [03:59:48] 10Operations, 10Traffic, 10Patch-For-Review: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4063721 (10BBlack) Some notes from digging around in related things: * Varnish docs claim that duplicate probes (e.g. due to vcl reloads) are coalesced into a sin... [04:02:06] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [04:03:05] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [04:04:32] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 168.43 seconds [04:13:16] (03CR) 10Krinkle: [C: 031] Clarify throttling documentation a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415642 (owner: 10Gergő Tisza) [04:13:22] (03PS3) 10Krinkle: Clarify throttling documentation a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415642 (owner: 10Gergő Tisza) [04:13:35] (03CR) 10Krinkle: [C: 032] Clarify throttling documentation a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415642 (owner: 10Gergő Tisza) [04:14:47] (03Merged) 10jenkins-bot: Clarify throttling documentation a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415642 (owner: 10Gergő Tisza) [04:15:12] extented the downtime for the link above 80% [04:17:48] !log krinkle@tin Synchronized wmf-config/throttle.php: (no justification provided) (duration: 00m 58s) [04:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:53] !log krinkle@tin Synchronized wmf-config/throttle-analyze.php: (no justification provided) (duration: 00m 58s) [04:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:46] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4063740 (10Bawolff) [05:15:14] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4063758 (10Prtksxna) Thanks @Dzahn {icon smile-o} [06:18:12] !log Deploy schema change on s4 primary master db1068 - T187089 T185128 T153182 [06:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:20] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:18:20] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:18:20] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:28:42] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt] [06:31:16] (03CR) 10Marostegui: [C: 031] "Yeah, it will probably break hearbeat unless we do the ln" [puppet] - 10https://gerrit.wikimedia.org/r/420331 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [06:32:19] (03CR) 10Marostegui: [C: 031] Add advisorswiki to $private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/420385 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [06:35:25] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4063852 (10Marostegui) db1106 is now catching up after being recloned from db1065. Once it has been replicating for another 24h, I would say we can cha... [06:37:35] (03PS5) 10Elukey: profile::analytics::cluster::client: add check for /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/420335 (https://phabricator.wikimedia.org/T187073) [06:39:58] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10523/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/420335 (https://phabricator.wikimedia.org/T187073) (owner: 10Elukey) [06:43:02] (03PS3) 10Elukey: Refactor stat1005's roles into role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/420383 (https://phabricator.wikimedia.org/T167790) [06:44:26] (03CR) 10Elukey: [C: 032] Refactor stat1005's roles into role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/420383 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [06:51:30] (03PS1) 10Elukey: role::statistics::private: enable alarm for /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/420639 (https://phabricator.wikimedia.org/T187073) [06:55:04] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10524/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/420639 (https://phabricator.wikimedia.org/T187073) (owner: 10Elukey) [06:58:41] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:09:20] (03PS4) 10Giuseppe Lavagetto: hhvm: remove deep hiera_hash call inside the class [puppet] - 10https://gerrit.wikimedia.org/r/415829 [07:38:10] (03PS1) 10Marostegui: db-eqiad.php: Pool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420643 (https://phabricator.wikimedia.org/T183469) [07:39:05] (03PS1) 10Marostegui: db1106.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/420644 (https://phabricator.wikimedia.org/T183469) [07:39:43] (03CR) 10Marostegui: [C: 032] db1106.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/420644 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:42:16] (03PS2) 10Marostegui: db-eqiad.php: Pool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420643 (https://phabricator.wikimedia.org/T183469) [07:43:59] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [07:46:00] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:46:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420643 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:47:40] (03Merged) 10jenkins-bot: db-eqiad.php: Pool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420643 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:49:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db1106 in s1 - T183469 (duration: 00m 58s) [07:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:15] T183469: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469 [07:56:32] (03CR) 10Jcrespo: [C: 031] "Let's take the opportunity to upgrade labsdb1009 and sanitarium hosts." [puppet] - 10https://gerrit.wikimedia.org/r/420385 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [07:56:42] (03PS5) 10Giuseppe Lavagetto: hhvm: remove deep hiera_hash call inside the class [puppet] - 10https://gerrit.wikimedia.org/r/415829 [08:02:01] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420645 [08:03:43] (03PS1) 10Elukey: Refactor the last bits of the Analytics code not following role/profile [puppet] - 10https://gerrit.wikimedia.org/r/420646 (https://phabricator.wikimedia.org/T167790) [08:05:22] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420645 (owner: 10Marostegui) [08:05:45] (03CR) 10jenkins-bot: Log ReadingLists warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420155 (https://phabricator.wikimedia.org/T189340) (owner: 10Gergő Tisza) [08:05:49] (03CR) 10jenkins-bot: Enable Wikidata description override on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418843 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [08:05:54] (03CR) 10jenkins-bot: Allow protocol-relative URLs in TemplateStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [08:06:00] (03CR) 10jenkins-bot: Clarify throttling documentation a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415642 (owner: 10Gergő Tisza) [08:06:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420645 (owner: 10Marostegui) [08:06:06] (03CR) 10jenkins-bot: db-eqiad.php: Pool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420643 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [08:07:20] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420645 (owner: 10Marostegui) [08:08:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool pc1004 for kernel, mariadb and socket location upgrade (duration: 00m 57s) [08:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:20] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:12:29] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [08:12:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420645 (owner: 10Marostegui) [08:13:24] (03PS2) 10Muehlenhoff: mediawiki::packages::fonts: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/380712 [08:14:08] (03PS1) 10Jcrespo: mariadb: Move parsercaches socket location to the default one [puppet] - 10https://gerrit.wikimedia.org/r/420647 (https://phabricator.wikimedia.org/T148507) [08:15:50] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:16:24] criticals [08:16:35] since 7:20 [08:16:54] I am going to guess traffic, ema [08:17:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:19:16] looking [08:22:18] a few fetch failures in text_esams, nothing too crazy so far [08:22:19] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [08:23:03] but it seems to be going up since 5h [08:23:27] !log Stop MySQL on pc1004 for mariadb, kernel and socket location upgrade [08:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:43] (03PS1) 10Jcrespo: dbproxy: Depool labsdb1009 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/420648 (https://phabricator.wikimedia.org/T189181) [08:24:13] (03CR) 10Vgutierrez: [C: 031] "nitpick as inline comments, otherwise LGTM" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/416985 (owner: 10Mark Bergsma) [08:24:29] (03CR) 10Marostegui: [C: 031] dbproxy: Depool labsdb1009 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/420648 (https://phabricator.wikimedia.org/T189181) (owner: 10Jcrespo) [08:24:50] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:24:56] (03PS2) 10Jcrespo: dbproxy: Depool labsdb1009 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/420648 (https://phabricator.wikimedia.org/T189181) [08:25:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:25:36] !log installing curl security updates [08:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:27] (03CR) 10Jcrespo: [C: 032] dbproxy: Depool labsdb1009 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/420648 (https://phabricator.wikimedia.org/T189181) (owner: 10Jcrespo) [08:28:03] (03CR) 10Vgutierrez: [C: 031] "let's merge this, as RobH pointed out, this doesn't need meeting approval." [puppet] - 10https://gerrit.wikimedia.org/r/416996 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [08:28:33] (03PS2) 10Elukey: Refactor the last bits of the Analytics code not following role/profile [puppet] - 10https://gerrit.wikimedia.org/r/420646 (https://phabricator.wikimedia.org/T167790) [08:32:17] (03CR) 10Elukey: [C: 031] mediawiki::packages::fonts: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/380712 (owner: 10Muehlenhoff) [08:32:46] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420649 [08:34:30] !log depool labsdb1009 [08:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:03] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool pc1004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420649 (owner: 10Marostegui) [08:36:54] (03PS3) 10Elukey: Refactor the last bits of the Analytics code not following role/profile [puppet] - 10https://gerrit.wikimedia.org/r/420646 (https://phabricator.wikimedia.org/T167790) [08:38:09] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool pc1004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420649 (owner: 10Marostegui) [08:38:56] (03PS1) 10Jcrespo: Revert "dbproxy: Depool labsdb1009 for upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/420650 [08:39:06] (03PS1) 10Marostegui: db-eqiad.php: Repool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420651 [08:39:36] (03CR) 10Giuseppe Lavagetto: "compiler results: https://puppet-compiler.wmflabs.org/compiler02/" [puppet] - 10https://gerrit.wikimedia.org/r/415829 (owner: 10Giuseppe Lavagetto) [08:41:34] !log upgrade and restart labsdb1009 [08:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420651 (owner: 10Marostegui) [08:42:21] (03PS6) 10Giuseppe Lavagetto: hhvm: remove deep hiera_hash call inside the class [puppet] - 10https://gerrit.wikimedia.org/r/415829 [08:43:13] (03Merged) 10jenkins-bot: db-eqiad.php: Repool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420651 (owner: 10Marostegui) [08:43:28] (03CR) 10jenkins-bot: db-eqiad.php: Repool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420651 (owner: 10Marostegui) [08:43:47] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: remove deep hiera_hash call inside the class [puppet] - 10https://gerrit.wikimedia.org/r/415829 (owner: 10Giuseppe Lavagetto) [08:44:11] (03PS1) 10Muehlenhoff: Fix up Cumin db aliases [puppet] - 10https://gerrit.wikimedia.org/r/420653 [08:44:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool pc1004 after kernel, mariadb and socket location upgrade (duration: 00m 58s) [08:44:39] (03PS1) 10Marostegui: db-eqiad.php: Depool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420654 [08:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:02] (03PS1) 10Marostegui: es1012.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/420655 [08:47:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420654 (owner: 10Marostegui) [08:47:46] PROBLEM - HHVM jobrunner on mw1335 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:48:00] <_joe_> that's expected to happen to some servers ^^ [08:48:18] (03PS2) 10Marostegui: es1012.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/420655 [08:48:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420654 (owner: 10Marostegui) [08:48:46] RECOVERY - HHVM jobrunner on mw1335 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [08:48:56] PROBLEM - HHVM jobrunner on mw1336 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:49:02] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#4063989 (10MoritzMuehlenhoff) a:05chasemp>03RobH Reassigning to Rob since I guess some meta data will need to be updated in Racktables for repurposing the... [08:49:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420654 (owner: 10Marostegui) [08:49:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool es1012 for kernel, mariadb and socket location upgrade (duration: 00m 58s) [08:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:56] RECOVERY - HHVM jobrunner on mw1336 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [08:50:02] !log Stop MySQL on es1012 for mariadb, kernel and socket location upgrade [08:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:36] PROBLEM - HHVM jobrunner on mw1334 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:50:38] Git::Clone[operations/mediawiki-config] failed on puppet, is gerrit ok? [08:51:04] (03CR) 10Marostegui: [C: 032] es1012.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/420655 (owner: 10Marostegui) [08:51:05] <_joe_> not sure [08:51:19] gerrit seems fine, but the clone fails [08:51:19] <_joe_> works for me [08:51:22] it is working fine for me [08:51:27] <_joe_> jynus: on which server? [08:51:35] labsdb1009, just restarted [08:51:36] RECOVERY - HHVM jobrunner on mw1334 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [08:51:44] <_joe_> jynus: uhm ok lemme check [08:51:49] Failed to call refresh: /usr/bin/git submodule update --init returned 1 instead of one of [0] [08:52:09] git is installed [08:52:13] <_joe_> jynus: some things were done on the submodules yesterday IIRC [08:52:26] does it work locally jynus? [08:52:32] <_joe_> so go try that yourself as root in the directory where it's installed [08:52:33] I mean, in your computer for example? [08:53:02] yes it does [08:53:12] I will check changes on the git class [08:53:13] <_joe_> where is the clone on that machine? [08:53:17] <_joe_> I can take a look [08:54:03] I need to check, cannot remember [08:54:08] (03PS1) 10Muehlenhoff: Add Cumin alias for thumbor canary [puppet] - 10https://gerrit.wikimedia.org/r/420657 [08:54:11] <_joe_> I can look then [08:54:14] <_joe_> don't worry [08:54:52] I should be ::role::labs::db::common [08:54:56] PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.004 second response time [08:55:12] <_joe_> /usr/local/lib/mediawiki-config [08:55:15] take care of the job [08:55:19] I will take care of this [08:55:48] <_joe_> modified: portals (new commits) [08:55:48] deleted? [08:55:57] RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [08:55:59] <_joe_> the jobrunners are just restarting [08:56:00] checking mw1318 [08:56:05] ok [08:56:06] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:56:08] <_joe_> because I changed an hhvm setting [08:56:08] ah okok [08:56:17] <_joe_> I added the mysql.connect_timeout at 3 seconds [08:56:21] <_joe_> like everything else [08:57:06] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [08:57:20] oh, I was looking at the wrong server [08:57:29] it seems it gets confused by submodules [08:57:35] which is strange [08:58:04] I merged a change yesterday to the cdh module but didn't see any weirdness so far [08:58:08] becayse if the recurse_submodules => true [08:58:37] PROBLEM - HHVM jobrunner on mw1311 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:58:45] and I saw no errors before restart [08:59:13] Fetched in submodule path 'portals', but it did not contain 2823a787e8fbeb4255273b27ce68652d8a2ee77a. Direct fetching of that commit failed [08:59:25] <_joe_> ok that's interesting [08:59:37] RECOVERY - HHVM jobrunner on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [08:59:39] error: no such remote ref 2823a787e8fbeb4255273b27ce68652d8a2ee77a [09:00:11] (03PS2) 10Muehlenhoff: Add Cumin alias for thumbor canary [puppet] - 10https://gerrit.wikimedia.org/r/420657 [09:00:18] <_joe_> yes, it's telling you that commit is inexistent on the remote [09:00:24] could be some kind of corruption because git was running while rebooting? [09:00:34] but it shouldn't add commits, then [09:01:13] I can give it a manual shakedown [09:01:25] <_joe_> jynus: I just checked out the repo for portals, the commit is there [09:01:25] (03CR) 10Muehlenhoff: [C: 032] Add Cumin alias for thumbor canary [puppet] - 10https://gerrit.wikimedia.org/r/420657 (owner: 10Muehlenhoff) [09:01:31] <_joe_> so yes, something's corrupted there [09:01:32] !log restarting Jenkins for java update [09:01:44] _joe_: is it recent? [09:01:47] PROBLEM - HHVM jobrunner on mw1309 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:55] I can remove it and rebase [09:01:56] <_joe_> jynus: from yesterday [09:02:08] <_joe_> jynus: let me check for a sec [09:02:12] ok, waiting [09:02:47] RECOVERY - HHVM jobrunner on mw1309 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [09:03:47] PROBLEM - HHVM jobrunner on mw1310 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:04:35] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420658 [09:04:47] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:04:47] RECOVERY - HHVM jobrunner on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.009 second response time [09:05:56] (03PS1) 10Muehlenhoff: Extend wdqs Cumin alias to also cover the new wdqs_internal role [puppet] - 10https://gerrit.wikimedia.org/r/420659 [09:06:20] (03PS2) 10Muehlenhoff: Extend wdqs Cumin alias to also cover the new wdqs_internal role [puppet] - 10https://gerrit.wikimedia.org/r/420659 [09:06:34] 10Operations, 10DBA, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4064012 (10Marostegui) [09:06:50] <_joe_> jynus: uhm I can't seem to find out what's wrong there [09:06:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420658 (owner: 10Marostegui) [09:07:00] _joe_: I can drop and reclone- if we know the commit number we can keep it in case it reappears [09:07:03] (03CR) 10Muehlenhoff: [C: 032] Extend wdqs Cumin alias to also cover the new wdqs_internal role [puppet] - 10https://gerrit.wikimedia.org/r/420659 (owner: 10Muehlenhoff) [09:07:31] <_joe_> jynus: something is very, very wrong there [09:07:32] because it appeared on reboot, I would say and update and reboot went wrong [09:07:40] <_joe_> and it's really strange [09:07:45] hopefuly not disk-wrong [09:07:51] <_joe_> I mean I see run_command: 'git-remote-https' 'origin' 'https://gerrit.wikimedia.org/r/wikimedia/portals' [09:08:01] <_joe_> but it doesn't really do that, it seems [09:08:10] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420658 (owner: 10Marostegui) [09:08:26] (03CR) 10Alexandros Kosiaris: "Thanks. A really really long comment inline" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/420305 (owner: 10Alexandros Kosiaris) [09:08:26] is git a module we have created? [09:08:31] or imported? [09:08:41] (talking about puppet) [09:08:47] <_joe_> we created it [09:08:57] <_joe_> and no, it's not that puppet module that's the problem [09:09:04] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420658 (owner: 10Marostegui) [09:09:08] yeah, because it fails manually too [09:09:20] an never failed before [09:09:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool es1012 after kernel, mariadb and socket location upgrade (duration: 00m 57s) [09:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:10] and Last change to that module was Thu Nov 30 [09:10:48] and it didn't fail anywhere else [09:11:06] !log restarting apache on netmon* to pick up curl security updates [09:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:44] (03CR) 10Elukey: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler02/10529/" [puppet] - 10https://gerrit.wikimedia.org/r/420646 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [09:13:32] (03CR) 10Giuseppe Lavagetto: [C: 031] Update mathoid chart to resemble current production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/420305 (owner: 10Alexandros Kosiaris) [09:13:40] _joe_: if you let me, I will drop the dir and let puppet fix it [09:13:47] <_joe_> jynus: go on [09:13:59] not worth looking more into it, as it does look as a client only problem [09:14:02] <_joe_> it's a pretty strange situation, but yes [09:14:03] not puppet or server [09:14:12] <_joe_> no reason to go deeper into it right now [09:14:57] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:15:23] Notice: /Stage[main]/Role::Labs::Db::Common/Git::Clone[operations/mediawiki-config]/Exec[git_clone_operations/mediawiki-config]/returns: executed successfully [09:15:51] (03CR) 10Giuseppe Lavagetto: [C: 031] Fix wrongly indented externalIPs field [deployment-charts] - 10https://gerrit.wikimedia.org/r/420341 (owner: 10Alexandros Kosiaris) [09:15:57] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [09:16:17] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:16:48] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:17:17] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.014 second response time [09:17:48] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [09:19:10] (03CR) 10Alexandros Kosiaris: [C: 032] Fix wrongly indented externalIPs field [deployment-charts] - 10https://gerrit.wikimedia.org/r/420341 (owner: 10Alexandros Kosiaris) [09:19:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix wrongly indented externalIPs field [deployment-charts] - 10https://gerrit.wikimedia.org/r/420341 (owner: 10Alexandros Kosiaris) [09:19:42] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for librenms-ircbot [puppet] - 10https://gerrit.wikimedia.org/r/420660 (https://phabricator.wikimedia.org/T135991) [09:24:47] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:26:43] (03CR) 10Jcrespo: [C: 032] Revert "dbproxy: Depool labsdb1009 for upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/420650 (owner: 10Jcrespo) [09:26:51] (03PS2) 10Jcrespo: Revert "dbproxy: Depool labsdb1009 for upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/420650 [09:28:47] !log repool labsdb1009 after upgrade [09:28:47] (03PS2) 10Muehlenhoff: Add Cumin aliases for ores [puppet] - 10https://gerrit.wikimedia.org/r/420334 [09:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:34] (03CR) 10Muehlenhoff: [C: 032] Add Cumin aliases for ores [puppet] - 10https://gerrit.wikimedia.org/r/420334 (owner: 10Muehlenhoff) [09:33:27] (03CR) 10Jcrespo: "This is mostly right, but not 100%, but if it helps merge it now, and I will fix the other issues later." [puppet] - 10https://gerrit.wikimedia.org/r/420653 (owner: 10Muehlenhoff) [09:41:28] (03PS3) 10Giuseppe Lavagetto: mediawiki::hhvm: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/415830 [09:49:20] (03PS2) 10Filippo Giunchedi: puppetmaster: disable puppet-master service [puppet] - 10https://gerrit.wikimedia.org/r/420351 (https://phabricator.wikimedia.org/T184562) [09:50:42] (03PS1) 10Marostegui: db-eqiad.php: Fully repool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420666 [09:51:34] (03PS2) 10Jcrespo: Add advisorswiki to $private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/420385 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [09:52:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420666 (owner: 10Marostegui) [09:53:22] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420666 (owner: 10Marostegui) [09:53:37] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool es1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420666 (owner: 10Marostegui) [09:53:59] (03CR) 10Jcrespo: [C: 032] Add advisorswiki to $private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/420385 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [09:54:28] (03CR) 10Giuseppe Lavagetto: [C: 032] "change is practically a noop" [puppet] - 10https://gerrit.wikimedia.org/r/415830 (owner: 10Giuseppe Lavagetto) [09:54:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool es1012 after kernel, mariadb and socket location upgrade (duration: 00m 58s) [09:54:38] (03PS4) 10Giuseppe Lavagetto: mediawiki::hhvm: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/415830 [09:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:25] _joe_: actually, same error on db1095, and I haven't restarted it yet [09:56:34] (git::clone) [09:56:39] <_joe_> jynus: meh [09:57:04] basically on all hosts that clone mediawiki-config [09:58:12] <_joe_> well, minus tin [09:58:20] <_joe_> so it's now worth looking into again [09:58:24] tin doesn't "clone it" [09:58:33] well, at least not by puppet [09:58:43] <_joe_> well, someone must have given git submodule update --init there too [09:59:24] could be a permission error [09:59:40] or a bad handling of the source [10:00:06] <_joe_> I don't think it's a permission error [10:00:12] <_joe_> the thing is cloned as root [10:03:50] and now, puppet-merge is failing [10:04:26] it works locally on puppetmaster1001, but while doing the others, it fails [10:04:48] error: Your local changes to the following files would be overwritten by merge: [10:04:50] manifests/realm.pp [10:05:17] is something going on with puppet masters? [10:05:26] do we have a git issue? [10:06:31] jynus: I'll take a look [10:07:14] jynus: can I run puppet-merge myself to reproduce? [10:07:30] you can, but it will ignore the error [10:07:36] (03CR) 10Alexandros Kosiaris: Update mathoid chart to resemble current production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/420305 (owner: 10Alexandros Kosiaris) [10:07:37] as it was merged on 1001 [10:07:46] but I think not on the others [10:08:09] I can fix things manually by going to each server [10:08:36] but why did it fail in the first place? [10:09:42] no idea, the message seems to suggest the git working directory was somehow unclean [10:10:00] could it have been work on upgrade? [10:10:38] giuseppe merge failed, too [10:10:39] it shouldn't since other puppet-merge worked but is is certainly possible [10:10:54] should I retry from puppet 2001? [10:11:01] to try to sync them? [10:11:39] retry puppet-merge you mean? [10:11:51] run puppet-merge so it catches all that failed [10:12:01] mine and joe's [10:12:59] sure, should work the same, make sure to capture the output for debugging [10:13:34] I think it is 1002 that fails [10:13:52] 1001 goes through, 1002 aborts [10:14:03] <_joe_> jynus: I don't think so [10:14:12] <_joe_> also, please all stop [10:14:26] <_joe_> either merging or doing things with git on puppetmaster1001 [10:14:44] <_joe_> so for some reason git log on puppetmaster1001 [10:14:53] <_joe_> shows my change, but not jynus' one [10:14:55] <_joe_> wtf? [10:15:07] <_joe_> oh no, it was reedy [10:15:31] could be aborting on rhodium.eqiad.wmnet and 1002 is ok [10:16:20] probably locally modified realm.pp [10:16:42] that's more likely yeah, I'll check rhodium [10:16:57] <_joe_> yes [10:17:13] probably locally modified realm.pp [10:17:16] modified: realm.pp [10:17:25] ../modules/wmflib/lib/hiera/backend/nuyaml_backend.rb [10:17:30] <_joe_> (2) puppetmaster2002.codfw.wmnet,rhodium.eqiad.wmnet [10:17:41] <_joe_> are the two servers not synced [10:17:51] yeah, it aborted at rhodium [10:17:56] indeed, I'll take rhodium out for now [10:17:58] so last one doesn't sync [10:18:22] <_joe_> The last Puppet run was at Fri Mar 9 14:57:39 UTC 2018 (15560 minutes ago). Puppet is disabled. pointing /etc/puppet/modules at path with change 410050 (upgraded puppetdbquery) --herron [10:18:26] <_joe_> this is rhodium [10:18:30] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:18:48] <_joe_> 10 days without a puppet run on a puppetmaster is unacceptable [10:19:00] PROBLEM - puppet last run on mw2243 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:19:12] godog: do you know how to remove it from clients and from merge trigger? [10:19:20] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:19:30] <_joe_> those are failing for the failed puppet-merge [10:19:41] hopefully not critical [10:19:49] user-facing? [10:19:49] <_joe_> no the file is already there [10:19:51] ok [10:20:05] I guess only a percentage will fail [10:20:15] actually all of codfw [10:20:27] as I think 1001 and 1002 went through [10:20:36] (03PS1) 10Filippo Giunchedi: hieradata: remove rhodium from puppetmaster workers [puppet] - 10https://gerrit.wikimedia.org/r/420667 [10:20:40] <_joe_> no, 2001 went through as well [10:20:45] ah, cook [10:20:56] the error was not very verbose [10:21:01] I'll merge that so rhodium isn't on puppet-merge anymore [10:21:08] I think 2001 could be me [10:21:11] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: remove rhodium from puppetmaster workers [puppet] - 10https://gerrit.wikimedia.org/r/420667 (owner: 10Filippo Giunchedi) [10:21:12] when done it manually from there [10:21:14] <_joe_> cumin 'A:puppetmaster' 'git -C /var/lib/git/operations/puppet/ log --oneline | head -n 3' [10:21:20] PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:21:22] <_joe_> that's how I verified it [10:22:15] <_joe_> godog: what about We first stash the changes on rhodium? [10:22:20] PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:22:20] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:22:25] <_joe_> so that puppet-merge will work this time [10:22:30] PROBLEM - puppet last run on mw2198 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:22:33] I can do that [10:22:50] PROBLEM - puppet last run on mw2139 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:22:56] could that break something, or rothium is supposed to be test only? [10:22:57] _joe_: I took rhodium out of puppet-merge manually and ran puppet-merge on puppetmaster1001 instead [10:23:10] ah, that would work, too [10:23:18] so it breaks less subtely [10:23:33] <_joe_> godog: I don't really see a reason for the current status of rhodium btw [10:23:44] yeah, should be ok now [10:23:50] PROBLEM - puppet last run on mw2175 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:23:52] <_joe_> there is a hiera backend patch applied that is now managed differently by you [10:23:56] I will run puppet-merge on 2002 [10:24:03] <_joe_> jynus: why? [10:24:04] as that was behind [10:24:10] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:24:10] PROBLEM - puppet last run on mw2196 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:24:17] <_joe_> jynus: no, if godog successfully runs puppet-merge [10:24:18] jynus: they should all be sync'd now [10:24:20] PROBLEM - puppet last run on mw2180 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:24:21] <_joe_> it will catch up [10:24:26] not until a new commit happens [10:24:53] I think the trigger only works if new changes are seen locally [10:25:09] jynus: yeah my patch to take out rhodium was that trigger [10:25:10] as 1001 sees no changes, it will not update the otehrs [10:25:15] ah, of course [10:25:20] sorry [10:25:29] <_joe_> ok, puppet failures are now solved [10:25:35] so we should be good now [10:26:00] I broke things but at least I detected there was a breakage- better here than with worse patches [10:26:19] <_joe_> you didn't break anything [10:26:26] well, you get what I mean [10:26:41] to be fair, the change shouldn't be on realm.pp in the first place [10:27:28] herron was the person to ping about puppet being disabled/local drift? [10:27:31] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:27:41] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:28:02] well, at least it seems signed by him [10:28:10] will ping him/create a task [10:28:30] PROBLEM - puppet last run on mw2242 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/hhvm_cleanup_cache] [10:28:50] RECOVERY - puppet last run on mw2175 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:29:22] I think the local changes to the git repo and puppet disabled were two different scopes, I'm checking [10:29:48] ok, I will create a ticket [10:30:15] add heron in case he knows about it, and anyone else related to puppet work that I know of [10:30:35] is not this? T188544#4038449 [10:30:35] T188544: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544 [10:31:24] I see, I will add a comment there instead [10:31:38] thanks, volans [10:32:07] np, sorry for being late, I was not looking at IRC closely [10:32:09] (03CR) 10Mark Bergsma: [C: 031] Split off attributes and exceptions from bgp.py into their own modules (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/416985 (owner: 10Mark Bergsma) [10:34:58] I'll force a puppet run on mw2* failed hosts [10:35:41] 10Operations, 10Puppet, 10Patch-For-Review: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544#4011676 (10jcrespo) rhodium is (was after Filippo's change https://gerrit.wikimedia.org/r/420667 ) a production puppet master in a bad state- local ch... [10:35:47] (03CR) 10Vgutierrez: [C: 031] Split off attributes and exceptions from bgp.py into their own modules (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/416985 (owner: 10Mark Bergsma) [10:36:15] godog: ^ I have sent a comment commenting your patch so no errors happen unintentionally [10:36:25] *about your patch [10:36:32] jynus: thanks! [10:37:40] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:37:50] RECOVERY - puppet last run on mw2139 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:38:21] RECOVERY - puppet last run on mw2242 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [10:38:30] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:39:00] RECOVERY - puppet last run on mw2243 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:39:01] RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:39:10] RECOVERY - puppet last run on mw2196 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:39:20] RECOVERY - puppet last run on mw2180 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:39:20] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:39:55] (03PS3) 10Muehlenhoff: mediawiki::packages::fonts: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/380712 [10:40:57] (03PS1) 10Alexandros Kosiaris: Clone operations/deployment-charts on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/420668 [10:41:08] (03CR) 10Muehlenhoff: [C: 032] mediawiki::packages::fonts: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/380712 (owner: 10Muehlenhoff) [10:41:20] RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:42:20] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:42:20] RECOVERY - puppet last run on mw2216 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:42:31] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:42:31] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:43:20] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:44:06] !log reboot labtestnet2001 for T189722 [10:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:50] !log upgrade and reboot db1102 - this can create tempory lag on wikireplicas [10:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:51] PROBLEM - Host labtestnet2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:46:40] RECOVERY - Host labtestnet2001 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [10:46:49] (03CR) 10Mark Bergsma: [C: 032] Split off attributes and exceptions from bgp.py into their own modules (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/416985 (owner: 10Mark Bergsma) [10:47:20] (03Merged) 10jenkins-bot: Split off attributes and exceptions from bgp.py into their own modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/416985 (owner: 10Mark Bergsma) [10:48:19] (03PS2) 10Alexandros Kosiaris: Clone operations/deployment-charts on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/420668 [10:48:29] (03CR) 10Alexandros Kosiaris: [C: 032] Clone operations/deployment-charts on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/420668 (owner: 10Alexandros Kosiaris) [10:48:39] (03PS1) 10Muehlenhoff: mediawiki::packages::fonts: Consistently use require_package [puppet] - 10https://gerrit.wikimedia.org/r/420670 [10:49:16] (03Abandoned) 10Muehlenhoff: Restart exim daily on Monday to Friday [puppet] - 10https://gerrit.wikimedia.org/r/294929 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:50:23] !log reboot again labtestnet2001 for T189722. Now with a proper grub menu [10:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:47] 10Operations, 10Puppet, 10Patch-For-Review: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544#4064207 (10fgiunchedi) >>! In T188544#4064160, @jcrespo wrote: > rhodium is (was after Filippo's change https://gerrit.wikimedia.org/r/420667 ) a prod... [10:51:10] PROBLEM - Host labtestnet2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:52:57] 10Operations, 10Traffic, 10Goal, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#3901180 (10Vgutierrez) varnishxcps metrics are being used in the following dashboards: * db/tls-ciphersuite-explorer (TLS - Ciphersuite Explorer) * db/tls-ciphers (TLS Ciphers... [10:55:21] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 5.939 second response time [10:56:48] 10Operations, 10netops: Can't login on netbox - https://phabricator.wikimedia.org/T190134#4064218 (10fgiunchedi) p:05Triage>03Normal [10:58:00] RECOVERY - Host labtestnet2001 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms [11:00:30] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:03:53] !log upgrade and reboot db1095 - this can create temp. lag on wikireplicas [11:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:00] (03PS1) 10Daimona Eaytoy: Enable AbuseFilter runtime profile on Italian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420672 (https://phabricator.wikimedia.org/T175954) [11:04:10] (03PS1) 10Elukey: profile::mariadb::misc::el::database: allow secure labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) [11:05:16] 10Operations: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081#4064243 (10fgiunchedi) a:05fgiunchedi>03RobH >>! In T190081#4062149, @RobH wrote: > I'd like to get @fgiunchedi's sign off on our racking proposal, since it affects when the new systems will use their 10G i... [11:06:50] (03CR) 10Elukey: "I am always doubtful when using require_package so this comment might be completely useless :)" [puppet] - 10https://gerrit.wikimedia.org/r/420670 (owner: 10Muehlenhoff) [11:07:34] !log reboot labtestnet2002 for T189722 [11:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:20] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 0.000 second response time [11:09:04] (03CR) 10Elukey: "Jaime/Manuel: whenever you have time let me know if this approach is good or not, and if there is a good/suggested way to safely replicate" [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [11:09:41] PROBLEM - Host labtestnet2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:10:18] arturo: I guess puppet faliures on cloud will be normal because of maintenance? [11:10:26] jynus: T190135 [11:10:27] T190135: toolforge: puppet broken in the fleet due to a font package - https://phabricator.wikimedia.org/T190135 [11:10:38] oh, I see [11:10:43] in any case, known [11:11:31] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:12:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:13:18] arturo: o/ - did you check modules/toollabs/manifests/exec_environ.pp:18:5 ? [11:14:00] RECOVERY - Host labtestnet2002 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [11:15:02] spike of 503s seems related to wikimedia.org [11:15:07] already gone [11:15:20] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:15:40] ah right api [11:15:57] and those seems 504s [11:16:34] elukey: I would try simply dropping the package declaration [11:16:42] (03CR) 10Jcrespo: "> I'd like to deploy in labs a production replica" [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [11:17:45] wow there was a massive spike in page views [11:17:55] https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1521543722301&to=1521544645269 [11:17:56] (03PS1) 10Arturo Borrero Gonzalez: toollabs: exec_eviron: drop fonts-ipafont-gothic package declaration [puppet] - 10https://gerrit.wikimedia.org/r/420676 (https://phabricator.wikimedia.org/T190135) [11:18:20] and no rate-limit kicked it? [11:18:22] *in? [11:18:28] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: exec_eviron: drop fonts-ipafont-gothic package declaration [puppet] - 10https://gerrit.wikimedia.org/r/420676 (https://phabricator.wikimedia.org/T190135) (owner: 10Arturo Borrero Gonzalez) [11:18:37] (03CR) 10Jcrespo: "password::misc::scripts exists on labs, it is just with "fake" passwords (obviously not the same than production), but publicly available " [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [11:19:21] (03PS2) 10Mark Bergsma: Fix Attribute.__eq__ and .__ne__ [debs/pybal] - 10https://gerrit.wikimedia.org/r/420119 [11:19:23] (03PS2) 10Mark Bergsma: Fix MPReachNLRIAttribute AFI_INET construction from tuple [debs/pybal] - 10https://gerrit.wikimedia.org/r/420120 [11:19:27] (03PS1) 10Mark Bergsma: Add new BGP test modules to __init__.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/420677 [11:19:27] probably a lot of ips used, the rate-limit is per ip [11:19:40] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 8.112 second response time [11:20:27] (03CR) 10Mark Bergsma: [C: 032] Add new BGP test modules to __init__.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/420677 (owner: 10Mark Bergsma) [11:20:54] (03Merged) 10jenkins-bot: Add new BGP test modules to __init__.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/420677 (owner: 10Mark Bergsma) [11:22:02] (03CR) 10Elukey: "> > I'd like to deploy in labs a production replica" [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [11:22:44] elukey: maybe popularity from https://youtu.be/iebSSONaNJQ?t=8m17s [11:23:20] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:23:37] aahhah [11:24:16] (03CR) 10Jcrespo: "Ok, then what I understand is that you want to replicate the environment, but not the data, which is private, right?" [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [11:24:37] (03CR) 10Jcrespo: "Or some (fake) data, right?" [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [11:24:39] (03CR) 10Elukey: "> Ok, then what I understand is that you want to replicate the" [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [11:24:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:25:18] elukey: if you tell you want to "replicate a server" to a dba, he will think about mysql replication [11:25:47] 10Operations, 10Traffic, 10Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#4064286 (10ema) [11:25:50] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:25:51] 10Operations, 10Traffic, 10Patch-For-Review: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#4064284 (10ema) 05Resolved>03Open This [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521510908805&to=1521544427097 | occurred again... [11:25:57] jynus: you are completely right, I should've been clearer, will remember next time :) [11:26:50] 10Operations, 10Ops-Access-Requests, 10Discovery-Search (Current work): Google Search Console access for Search Platform team - https://phabricator.wikimedia.org/T188453#4064287 (10mark) Unfortunately we've had an issue with managing access to the admin account for the Google search console for a long time.... [11:27:55] (03PS1) 10Ema: varnishospital: send origin servers health logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/420680 (https://phabricator.wikimedia.org/T174932) [11:28:38] (03CR) 10Jcrespo: profile::mariadb::misc::el::database: allow secure labs deployments (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [11:28:46] !log run compiler-update-facts [11:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:28] (03CR) 10Jcrespo: "In fact, I would not touch the class at all, add the non-production passwords to the right repo, and modify the prometheus class to ignore" [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [11:30:31] (03PS2) 10Daimona Eaytoy: Enable AbuseFilter runtime profile on more Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420672 (https://phabricator.wikimedia.org/T175954) [11:31:47] (03PS2) 10Ema: varnishospital: send origin servers health logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/420680 (https://phabricator.wikimedia.org/T174932) [11:32:26] elukey: you may not know labs/private ? [11:32:37] it is a non-private private repo [11:33:20] https://phabricator.wikimedia.org/diffusion/LPRI/ [11:33:21] yep I am aware of it yes [11:33:29] so it will just be used [11:33:50] but may be missing some keys, you can check which ones are needed [11:34:03] I normally commit to both, but double check, as normally they are not used [11:34:11] (03PS1) 10Muehlenhoff: Revert "mediawiki::packages::fonts: Remove support for trusty" [puppet] - 10https://gerrit.wikimedia.org/r/420681 [11:34:34] it makes sense yes, there will be no cross-private-hiera calls between labs and prod, completely forgot about it [11:34:44] you may need to parametrize the master on the script? [11:34:45] thanks! Will amend the code review [11:34:53] it is probably hardcoded, I don't know [11:34:55] that is on you side [11:35:06] (03CR) 10Muehlenhoff: [C: 032] Revert "mediawiki::packages::fonts: Remove support for trusty" [puppet] - 10https://gerrit.wikimedia.org/r/420681 (owner: 10Muehlenhoff) [11:35:08] sure sure, the replication and cleaning is in another profile iirc [11:35:11] ok [11:35:12] so it shouldn't be an issue [11:35:27] so the prometheus part is the only thing that supposes production [11:35:38] but I would just add an if production on the class and forget [11:35:45] so I can put a guard and forget about it [11:35:49] :) [11:35:53] :) [11:36:01] super, sorry for the PEBKAC! [11:36:13] you can check to add it to the cloud prometheus [11:36:20] but I don't know how that is setup [11:36:29] not super important, I think [11:36:31] (03PS3) 10Volans: Initial import [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419720 [11:36:33] (03PS3) 10Volans: Built artifacts for jessie and stretch [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419721 [11:36:43] definitely not for the moment :) [11:37:13] (03CR) 10Ema: "PCC looks happy https://puppet-compiler.wmflabs.org/compiler02/10535/cp1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/420680 (https://phabricator.wikimedia.org/T174932) (owner: 10Ema) [11:37:45] I was alarmed because at first I read it as "I want to replicate from production to labs" [11:37:55] and that sounded weird [11:38:03] (03CR) 10Volans: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/419762 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [11:39:25] (03PS2) 10Elukey: profile::mariadb::misc::el::database: ease labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) [11:39:51] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 6.573 second response time [11:40:43] (03CR) 10Volans: "replies inline" (032 comments) [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419720 (owner: 10Volans) [11:41:58] (03PS3) 10Volans: Puppetboard: add role, profile and configuration [puppet] - 10https://gerrit.wikimedia.org/r/419762 (https://phabricator.wikimedia.org/T184563) [11:42:00] (03PS3) 10Volans: Puppetboard: add varnish director entries [puppet] - 10https://gerrit.wikimedia.org/r/419763 (https://phabricator.wikimedia.org/T184563) [11:42:15] 10Operations, 10HHVM, 10User-Elukey: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4064319 (10MoritzMuehlenhoff) >>! In T189295#4057840, @Joe wrote: > See https://phabricator.wikimedia.org/T86096#2329554 and https://phabricator.wikimedia.org/T86096#2326032... [11:43:00] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:43:56] (03CR) 10Ema: [C: 031] Add missing reserved LVS IPs comments [dns] - 10https://gerrit.wikimedia.org/r/419799 (owner: 10Volans) [11:45:29] (03PS4) 10Jcrespo: dbproxy: Update m1 proxies to point to db1063 as the primary host [puppet] - 10https://gerrit.wikimedia.org/r/420317 (https://phabricator.wikimedia.org/T189655) [11:45:31] (03PS4) 10Jcrespo: mariadb: Switchover m1 master from db1016 to db1063 [puppet] - 10https://gerrit.wikimedia.org/r/420318 (https://phabricator.wikimedia.org/T189655) [11:45:33] (03PS5) 10Jcrespo: mariadb: Move default socket for misc services to /run [puppet] - 10https://gerrit.wikimedia.org/r/420331 (https://phabricator.wikimedia.org/T148507) [11:45:35] (03PS2) 10Jcrespo: mariadb: Move parsercaches socket location to the default one [puppet] - 10https://gerrit.wikimedia.org/r/420647 (https://phabricator.wikimedia.org/T148507) [11:45:37] (03PS1) 10Jcrespo: mariadb: Do not add host to production prometheus if it is on cloud [puppet] - 10https://gerrit.wikimedia.org/r/420683 (https://phabricator.wikimedia.org/T171203) [11:46:01] (03CR) 10Jcrespo: "I was thinking more like https://gerrit.wikimedia.org/r/420683 and solve it for all classes." [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [11:46:40] (03PS2) 10Jcrespo: mariadb: Do not add host to production prometheus if it is on cloud [puppet] - 10https://gerrit.wikimedia.org/r/420683 (https://phabricator.wikimedia.org/T171203) [11:51:51] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 4.276 second response time [11:58:01] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:01:16] 10Operations, 10Cloud-Services: Create Wikitech account for Aryeh Gregor (preexisting SVN login "simetrical") - https://phabricator.wikimedia.org/T189981#4064335 (10Simetrical) 05Open>03Resolved a:03Simetrical Seems to have worked, thanks! [12:02:34] (03CR) 10Reedy: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/420385 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [12:04:07] !log reboot labtestservices2001 for T189722 [12:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:58] PROBLEM - Host labtestservices2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:07:47] RECOVERY - Host labtestservices2001 is UP: PING OK - Packet loss = 0%, RTA = 36.46 ms [12:10:47] PROBLEM - Host labtestservices2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:11:07] PROBLEM - HP RAID on labstore1007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:11:14] (03PS2) 10Sbisson: [DO NOT MERGE] Configure maps source for localized labels [puppet] - 10https://gerrit.wikimedia.org/r/420315 (https://phabricator.wikimedia.org/T112948) [12:12:55] (03PS1) 10Mforns: Modify eventlogging purging script to read from YAML whitelist [puppet] - 10https://gerrit.wikimedia.org/r/420685 (https://phabricator.wikimedia.org/T189692) [12:12:57] RECOVERY - Host labtestservices2001 is UP: PING OK - Packet loss = 0%, RTA = 36.49 ms [12:13:12] "[12:01] wikibugs Operations, Cloud-Services: Create Wikitech account for Aryeh Gregor (preexisting SVN login "simetrical") -" <-- Wow, is this the return of Aryeh Gregor (!) [12:13:17] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 4.295 second response time [12:16:27] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:18:18] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 0.566 second response time [12:21:28] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:26] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 0.000 second response time [12:32:46] (03CR) 10Muehlenhoff: Switch debdeploy clients to Python 3 (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/413397 (owner: 10Muehlenhoff) [12:33:02] (03PS5) 10Muehlenhoff: Switch debdeploy clients to Python 3 [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/413397 [12:33:22] !log reboot labtestservices2002 for T189722 [12:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:45] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:37:23] (03PS1) 10Daimona Eaytoy: Enable $wgAbuseFilterProfile on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420687 (https://phabricator.wikimedia.org/T190137) [12:38:53] (03Abandoned) 10Odder: Redirect wikiquote.pl to pl.wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/254305 (owner: 10Odder) [12:40:31] (03Abandoned) 10Odder: Update logo in the footer button image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376150 (https://phabricator.wikimedia.org/T174603) (owner: 10Odder) [12:41:04] (03CR) 10Elukey: "> I was thinking more like https://gerrit.wikimedia.org/r/420683 and" [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [12:41:06] (03Abandoned) 10Elukey: profile::mariadb::misc::el::database: ease labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/420673 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [12:49:09] 10Operations, 10HHVM, 10User-Elukey: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4064421 (10Bawolff) There's a gerrit patch somewhere (T158724) that would allow the script to be stopped/restarted with no ill effect, but it stalled due to disagreement amo... [12:49:36] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 0.001 second response time [12:50:13] !log reboot labtestservices2003 for T189722 [12:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:25] RECOVERY - HP RAID on labstore1007 is OK: OK: Slot 1: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK --- Slot 3: OK: 1E:1:1, 1E:1:2, 1E:1:3, 1E:1:4, 1E:1:5, 1E:1:6, 1E:1:7, 1E:1:8, 1E:1:9, 1E:1:10, 1E:1:11, 1E:1:12 - Controller: OK - Battery/Capacitor: OK [12:53:56] !log applying schema change to wikishared.cx_translations T190133 [12:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:02] T190133: Deploy translation_cx_version schema change to production - https://phabricator.wikimedia.org/T190133 [12:57:11] (03PS1) 10Herron: depool codfw puppetmaster (via dns) [dns] - 10https://gerrit.wikimedia.org/r/420691 (https://phabricator.wikimedia.org/T177253) [12:59:14] jouncebot: refresh [12:59:16] I refreshed my knowledge about deployments. [12:59:28] !log restarting apache on bohrium/piwik to pick up curl security update [12:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:42] jouncebot: next [12:59:42] In 0 hour(s) and 0 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180320T1300) [12:59:42] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4064480 (10kchapman) TechCom is declining because the use case is not current. This needs a new owner and use case. There is actually one current use... [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180320T1300). [13:00:04] No GERRIT patches in the queue for this window AFAICS. [13:00:05] (03CR) 10Mark Bergsma: [C: 031] Make log look tidier on pybal start-up [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) (owner: 10Vgutierrez) [13:00:22] jouncebot: nice [13:00:26] (03PS1) 10Muehlenhoff: Extend Cumin alias for misc-apache [puppet] - 10https://gerrit.wikimedia.org/r/420693 [13:01:52] (03CR) 10Muehlenhoff: [C: 032] Extend Cumin alias for misc-apache [puppet] - 10https://gerrit.wikimedia.org/r/420693 (owner: 10Muehlenhoff) [13:05:19] 10Operations: Review/Merge/Deploy advisors.wikimedia.org apache vhost - https://phabricator.wikimedia.org/T190143#4064494 (10Reedy) [13:05:21] 10Operations: Review/Merge/Deploy advisors.wikimedia.org apache vhost - https://phabricator.wikimedia.org/T190143#4064504 (10Reedy) [13:05:31] 10Operations, 10Wikimedia-Apache-configuration: Review/Merge/Deploy advisors.wikimedia.org apache vhost - https://phabricator.wikimedia.org/T190143#4064494 (10Reedy) [13:16:12] (03PS1) 10Muehlenhoff: Various updates to Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/420697 [13:17:16] (03CR) 10Mobrovac: Update mathoid chart to resemble current production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/420305 (owner: 10Alexandros Kosiaris) [13:17:42] (03CR) 10Vgutierrez: [C: 032] Make log look tidier on pybal start-up [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) (owner: 10Vgutierrez) [13:18:08] (03PS6) 10Vgutierrez: Make log look tidier on pybal start-up [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) [13:18:58] (03CR) 10Muehlenhoff: [C: 032] Various updates to Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/420697 (owner: 10Muehlenhoff) [13:19:00] (03CR) 10Hoo man: [C: 031] "Looks good, but should only be merged after the relevant change (3481569d9) hits testwikidata (will happen during today's train)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420336 (https://phabricator.wikimedia.org/T189776) (owner: 10Lucas Werkmeister (WMDE)) [13:22:03] (03PS2) 10Lucas Werkmeister (WMDE): Disable reading wb_terms search fields on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420336 (https://phabricator.wikimedia.org/T189776) [13:22:22] (03CR) 10Lucas Werkmeister (WMDE): "Thanks, good point – I’ve added Depends-On to the commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420336 (https://phabricator.wikimedia.org/T189776) (owner: 10Lucas Werkmeister (WMDE)) [13:29:01] !log depooling codfw puppet masters via dns T177253 [13:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:07] T177253: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253 [13:29:35] (03CR) 10Herron: [C: 032] depool codfw puppetmaster (via dns) [dns] - 10https://gerrit.wikimedia.org/r/420691 (https://phabricator.wikimedia.org/T177253) (owner: 10Herron) [13:35:41] (03CR) 10Ema: [C: 032] Fix memory leak when discarding labels [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/420395 (owner: 10Vgutierrez) [13:35:58] (03PS1) 10Ayounsi: Fix typo (.py.py -> .py) breaking LDAP auth [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/420699 [13:36:15] (03CR) 10Ayounsi: [V: 032 C: 032] Fix typo (.py.py -> .py) breaking LDAP auth [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/420699 (owner: 10Ayounsi) [13:37:49] 10Operations, 10netops: Can't login on netbox - https://phabricator.wikimedia.org/T190134#4064592 (10ayounsi) a:03ayounsi Typo in the config, confirmed working for me now. https://gerrit.wikimedia.org/r/#/c/420699/ [13:39:55] (03PS1) 10Ema: 5.1.3-1wm5: Fix memory leak when discarding labels [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/420701 [13:42:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420702 (https://phabricator.wikimedia.org/T183469) [13:43:55] (03PS1) 10Andrew Bogott: import-wikitech.sh: restart memcached after everything [wikitech-static] - 10https://gerrit.wikimedia.org/r/420703 [13:44:44] (03CR) 10Andrew Bogott: [V: 032 C: 032] import-wikitech.sh: restart memcached after everything [wikitech-static] - 10https://gerrit.wikimedia.org/r/420703 (owner: 10Andrew Bogott) [13:46:27] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4064623 (10fgiunchedi) [13:46:30] 10Operations, 10netops: Can't login on netbox - https://phabricator.wikimedia.org/T190134#4064621 (10fgiunchedi) 05Open>03Resolved Confirmed working! Thanks @ayounsi ! [13:48:27] jouncebot: next [13:48:27] In 2 hour(s) and 11 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180320T1600) [13:48:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420702 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [13:48:52] (03PS1) 10Muehlenhoff: Update Cumin aliases for WMCS puppet changes [puppet] - 10https://gerrit.wikimedia.org/r/420704 [13:50:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420702 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [13:50:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420702 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [13:52:07] (03PS1) 10Filippo Giunchedi: puppetmaster: lock commits on /srv/private on non-master hosts [puppet] - 10https://gerrit.wikimedia.org/r/420705 (https://phabricator.wikimedia.org/T189891) [13:52:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065, give main traffic to db1106 - T183469 (duration: 00m 58s) [13:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:31] T183469: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469 [13:54:08] (03CR) 10Ema: [C: 032] 5.1.3-1wm5: Fix memory leak when discarding labels [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/420701 (owner: 10Ema) [13:54:58] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/10536/" [puppet] - 10https://gerrit.wikimedia.org/r/420351 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [13:55:12] (03CR) 10Muehlenhoff: [C: 032] Update Cumin aliases for WMCS puppet changes [puppet] - 10https://gerrit.wikimedia.org/r/420704 (owner: 10Muehlenhoff) [13:55:14] 10Operations, 10Ops-Access-Requests: Access request to stat1005 and stat1006 for cooltey - https://phabricator.wikimedia.org/T190150#4064646 (10cooltey) [13:59:27] (03PS2) 10Filippo Giunchedi: puppetmaster: lock commits on /srv/private on non-master hosts [puppet] - 10https://gerrit.wikimedia.org/r/420705 (https://phabricator.wikimedia.org/T189891) [13:59:29] (03CR) 10Marostegui: "I will merge this once I am finished with pc1005 and pc1006" [puppet] - 10https://gerrit.wikimedia.org/r/420647 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [13:59:30] !log ayounsi@tin Started deploy [netbox/deploy@7e29963]: Fixing netbox typo LDAP [13:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:59] !log ayounsi@tin Finished deploy [netbox/deploy@7e29963]: Fixing netbox typo LDAP (duration: 00m 28s) [14:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:59] !log upload varnish_5.1.3-1wm5 to apt.w.o [14:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:04] !log cp3007: upgrade varnish to 5.1.3-1wm5 [14:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:00] (03CR) 10Jcrespo: "'k. May need heartbeat restart." [puppet] - 10https://gerrit.wikimedia.org/r/420647 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [14:07:11] 10Operations, 10Ops-Access-Requests: Access request to stat1005 and stat1006 for cooltey - https://phabricator.wikimedia.org/T190150#4064646 (10Fjalapeno) Approving this from the management side. Let me know for you need anything else [14:08:22] (03PS2) 10Herron: puppetdb_upgrade: point codfw puppet masters to puppetdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/420062 (https://phabricator.wikimedia.org/T177253) [14:08:57] (03CR) 10Herron: [C: 032] puppetdb_upgrade: point codfw puppet masters to puppetdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/420062 (https://phabricator.wikimedia.org/T177253) (owner: 10Herron) [14:14:06] !log rolling restart of elasticsearch in deployment-prep for new Java update [14:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:01] (03PS1) 10Giuseppe Lavagetto: Add switch to allow building images that match a glob pattern [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/420711 (https://phabricator.wikimedia.org/T186416) [14:15:50] (03CR) 10Herron: puppetmaster: disable puppet-master service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/420351 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [14:15:52] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/420705 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [14:16:08] (03CR) 10jerkins-bot: [V: 04-1] Add switch to allow building images that match a glob pattern [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/420711 (https://phabricator.wikimedia.org/T186416) (owner: 10Giuseppe Lavagetto) [14:17:52] 10Puppet: Investigate wrong location for /srv/private post-receive hook in puppetmaster::gitclone - https://phabricator.wikimedia.org/T190157#4064782 (10fgiunchedi) [14:19:19] (03PS2) 10Giuseppe Lavagetto: Add switch to allow building images that match a glob pattern [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/420711 (https://phabricator.wikimedia.org/T186416) [14:25:34] (03CR) 10Filippo Giunchedi: [C: 04-2] "There's also T190157 involved in this, so the patch won't do the right thing because the else branch is never taken. Putting this on hold " [puppet] - 10https://gerrit.wikimedia.org/r/420705 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [14:27:37] (03CR) 10Filippo Giunchedi: puppetmaster: disable puppet-master service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/420351 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [14:36:17] (03PS1) 10Andrew Bogott: labtestwikitech db: set proper $basedir [puppet] - 10https://gerrit.wikimedia.org/r/420720 [14:36:38] (03CR) 10jerkins-bot: [V: 04-1] labtestwikitech db: set proper $basedir [puppet] - 10https://gerrit.wikimedia.org/r/420720 (owner: 10Andrew Bogott) [14:37:12] (03PS2) 10Andrew Bogott: labtestwikitech db: set proper $basedir [puppet] - 10https://gerrit.wikimedia.org/r/420720 [14:37:58] (03PS1) 10Filippo Giunchedi: hieradata: use puppetmaster2001 as ca_server [puppet] - 10https://gerrit.wikimedia.org/r/420721 (https://phabricator.wikimedia.org/T189891) [14:38:41] (03CR) 10Filippo Giunchedi: [C: 04-1] "Staging the patch, do not merge" [puppet] - 10https://gerrit.wikimedia.org/r/420721 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [14:39:46] (03CR) 10Andrew Bogott: [C: 032] labtestwikitech db: set proper $basedir [puppet] - 10https://gerrit.wikimedia.org/r/420720 (owner: 10Andrew Bogott) [14:40:23] !log Drop empty (confirmed) table slots from s1 - T190153 [14:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:29] T190153: DROP unused 'slots' table (WAS: In the slots table, replace slot_inherited with slot_origin) - https://phabricator.wikimedia.org/T190153 [14:42:48] !log Drop empty (confirmed) table slots from s2 - T190153 [14:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:56] 10Operations, 10ops-codfw, 10netops, 10User-Elukey: Switch port configuration for mw2259-mw2290 - https://phabricator.wikimedia.org/T190115#4064892 (10Papaul) p:05Normal>03High [14:47:48] !log upload scap 3.7.7-1 - T189306 [14:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:53] T189306: SCAP: Upload debian package version 3.7.7-1 - https://phabricator.wikimedia.org/T189306 [14:48:41] (03PS3) 10Filippo Giunchedi: Bump scap package version to 3.7.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/417943 (https://phabricator.wikimedia.org/T189306) (owner: 1020after4) [14:49:13] twentyafterfour: I'll merge and for a puppet run on tin ^ [14:49:20] (03CR) 10Filippo Giunchedi: [C: 032] Bump scap package version to 3.7.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/417943 (https://phabricator.wikimedia.org/T189306) (owner: 1020after4) [14:52:05] godog: thanks [14:52:07] !log Drop empty (confirmed) table slots from s4 - T190153 [14:52:10] 10Operations, 10ops-codfw, 10netops, 10User-Elukey: Switch port configuration for mw2259-mw2290 - https://phabricator.wikimedia.org/T190115#4064917 (10Papaul) a:05ayounsi>03RobH [14:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:12] T190153: DROP unused 'slots' table (WAS: In the slots table, replace slot_inherited with slot_origin) - https://phabricator.wikimedia.org/T190153 [14:52:23] twentyafterfour: ok good to go on tin [14:52:53] !log Drop empty (confirmed) table slots from s5 - T190153 [14:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:17] !log Drop empty (confirmed) table slots from s8 - T190153 [14:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:46] !log twentyafterfour@tin testing scap on tin [14:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:13] godog: looks good. [14:54:47] \o/ [14:55:56] !log Drop empty (confirmed) table slots from s6 - T190153 [14:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:08] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4064924 (10Anomie) >>! In T66214#4064480, @kchapman wrote: > TechCom is declining because the use case is not current. Everything listed in the task's... [14:56:24] (03PS1) 10Andrew Bogott: labtestwikitech: second attempt to set proper $basedir [puppet] - 10https://gerrit.wikimedia.org/r/420726 [14:57:13] (03CR) 10Andrew Bogott: [C: 032] labtestwikitech: second attempt to set proper $basedir [puppet] - 10https://gerrit.wikimedia.org/r/420726 (owner: 10Andrew Bogott) [14:57:27] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: [Blocked] Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4064932 (10fgiunchedi) [14:58:03] !log Drop empty (confirmed) table slots from s7 - T190153 [14:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:08] T190153: DROP unused 'slots' table (WAS: In the slots table, replace slot_inherited with slot_origin) - https://phabricator.wikimedia.org/T190153 [14:59:57] !log codfw puppet masters upgraded to puppetdb4. placing puppet agents into icinga downtime and beginning puppet —noop runs (to send facts to new puppetdb) T177253 [15:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:03] T177253: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253 [15:01:27] (03PS3) 10RobH: adding ironholds/oliver keyes to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/416996 (https://phabricator.wikimedia.org/T188945) [15:03:36] (03PS4) 10RobH: adding ironholds/oliver keyes to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/416996 (https://phabricator.wikimedia.org/T188945) [15:04:11] (03CR) 10Volans: [C: 031] "LGTM, I think on new installs we'll also need a systemctl reset-failed but is probably not worth to add it into puppet code for now." [puppet] - 10https://gerrit.wikimedia.org/r/420351 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:06:31] (03CR) 10RobH: [C: 032] adding ironholds/oliver keyes to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/416996 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [15:07:24] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Research, and 3 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4064956 (10RobH) [15:07:47] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Research, and 3 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4024896 (10RobH) 05stalled>03Resolved Access has been merged live. [15:08:32] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4064962 (10ayounsi) Netbox has been upgraded to 2.3.1 which supports virtual chassis switches. Updating description for next steps. [15:08:36] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4064963 (10RobH) This will also need approval in the weekly SRE meeting. I'll list this on next Monday's meeting for approval. [15:09:23] hi all, was directed here from #wikipedia, I'm trying to find out why lines 160-165 were added to robots.txt on March 15th/16th (Disallow: /wiki/Special:) [15:10:06] I cannot find any changelog or talk related to the change, and only the Mar 15 mention of "20:38 demon@tin: Synchronized robots.txt: minor tidying (duration: 00m 58s)" in https://wikitech.wikimedia.org/wiki/Server_Admin_Log, but this change seems like a lot more than tidying [15:10:18] 10Operations, 10Ops-Access-Requests: Access request to stat1005 and stat1006 for cooltey - https://phabricator.wikimedia.org/T190150#4064975 (10RobH) [15:12:49] 10Operations, 10Ops-Access-Requests: Access request to stat1005 and stat1006 for cooltey - https://phabricator.wikimedia.org/T190150#4064978 (10RobH) [15:12:59] (03PS3) 10Filippo Giunchedi: puppetmaster: disable puppet-master service [puppet] - 10https://gerrit.wikimedia.org/r/420351 (https://phabricator.wikimedia.org/T184562) [15:14:51] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: use puppetmaster2001 as ca_server [puppet] - 10https://gerrit.wikimedia.org/r/420721 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [15:14:56] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4064982 (10ayounsi) [15:15:00] 10Operations, 10Ops-Access-Requests: Access request to stat1005 and stat1006 for cooltey - https://phabricator.wikimedia.org/T190150#4064984 (10RobH) a:03cooltey In triaging this request, I found that the user is using the same ssh key in both cloud/labs and in production. Please note the L3 is very clear... [15:15:02] (03PS2) 10Filippo Giunchedi: hieradata: use puppetmaster2001 as ca_server [puppet] - 10https://gerrit.wikimedia.org/r/420721 (https://phabricator.wikimedia.org/T189891) [15:15:08] no_l0gic: I found this change in the config repo: https://gerrit.wikimedia.org/r/411255 [15:15:22] thanks Lucas_WMDE, was just looking at that [15:16:02] (03PS1) 10RobH: user cooltey has same key in cloud and production [puppet] - 10https://gerrit.wikimedia.org/r/420728 (https://phabricator.wikimedia.org/T190150) [15:16:34] (03CR) 10BryanDavis: "Reverted by I2d6a3350540b4c87ebd6832dedf3a58a0c6600d9. See T190135" [puppet] - 10https://gerrit.wikimedia.org/r/380712 (owner: 10Muehlenhoff) [15:16:41] Lucas_WMDE: do you know why this change now, or if there is another way to access markup language for a page [15:16:56] (03CR) 10RobH: [C: 032] user cooltey has same key in cloud and production [puppet] - 10https://gerrit.wikimedia.org/r/420728 (https://phabricator.wikimedia.org/T190150) (owner: 10RobH) [15:17:03] I don’t know anything about this, I just looked through the config log when I saw your question [15:17:29] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: disable puppet-master service [puppet] - 10https://gerrit.wikimedia.org/r/420351 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:17:44] (03PS4) 10Filippo Giunchedi: puppetmaster: disable puppet-master service [puppet] - 10https://gerrit.wikimedia.org/r/420351 (https://phabricator.wikimedia.org/T184562) [15:17:50] (03CR) 10Filippo Giunchedi: [C: 04-2] hieradata: use puppetmaster2001 as ca_server [puppet] - 10https://gerrit.wikimedia.org/r/420721 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [15:18:13] Lucas_WMDE: any idea where / who to ask to find out? =) [15:18:47] I think this channel would be the best place – perhaps someone else will respond :) [15:19:06] (03PS1) 10Alexandros Kosiaris: prometheus: Add kubernetes pods jobs [puppet] - 10https://gerrit.wikimedia.org/r/420731 (https://phabricator.wikimedia.org/T184923) [15:19:07] This blocks our years-long use of accessing /wiki/Special:Export from what we thought were well-behaved crawlers [15:19:35] I suppose you could open a phabricator ticket as well [15:20:31] Where would I do that? (sorry, new to this side of things) [15:20:46] !log Drop empty (confirmed) table slots from s3 - T190153 [15:20:48] robh: merging your patch as well [15:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:56] T190153: DROP unused 'slots' table (WAS: In the slots table, replace slot_inherited with slot_origin) - https://phabricator.wikimedia.org/T190153 [15:22:54] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Access request to stat1005 and stat1006 for cooltey - https://phabricator.wikimedia.org/T190150#4065026 (10RobH) Additionally, you are requesting access to stat1005 and stat1006 for the Wikipedia Android app analytics data. Do you happen to know if th... [15:28:25] 10Operations, 10cloud-services-team: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4065078 (10chasemp) @robh just a ping that we are talking about this in our weekly, are you going to have time to check into where to go from here? easy money says maybe we should just connect up... [15:29:52] 10Operations, 10cloud-services-team: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4065102 (10bd808) Summary of current state from @chasemp: * Imaging as Jessie works, but our OpenStack deploy is not ready for mixed Trusty/Jessie * Trusty will only image from eth0 * eth0 on the... [15:30:29] (03PS1) 10Filippo Giunchedi: Depool eqiad puppetmaster [dns] - 10https://gerrit.wikimedia.org/r/420733 (https://phabricator.wikimedia.org/T184562) [15:31:21] (03PS1) 10Filippo Giunchedi: Move config-master to codfw [dns] - 10https://gerrit.wikimedia.org/r/420734 (https://phabricator.wikimedia.org/T184562) [15:31:29] (03CR) 10jerkins-bot: [V: 04-1] Move config-master to codfw [dns] - 10https://gerrit.wikimedia.org/r/420734 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:32:36] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 82447451 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:33:46] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 21058476 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:33:51] hmm interesting [15:34:31] 14400 seconds ? [15:34:33] what ? [15:34:46] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 5827 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:34:55] PROBLEM - puppetmaster backend https on puppetmaster1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:45] PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:45] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 3818 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:40:29] 10Operations, 10netops: Add puppetmaster2001 to analytics-in4 - https://phabricator.wikimedia.org/T190167#4065148 (10fgiunchedi) [15:42:13] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4065195 (10RobH) [15:42:41] <_joe_> godog: uh you already moved the puppetmaster? [15:42:48] _joe_: no [15:42:59] looking at it now, thinking it’s load from forced puppet runs [15:43:00] <_joe_> so why the alarms? [15:43:09] <_joe_> forced puppet runs? [15:43:14] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4065212 (10Nuria) ping @bblack could we possibly get this review done next quarter? [15:43:27] via cumin [15:43:34] <_joe_> what concurrency? [15:43:43] 100 [15:43:46] <_joe_> wat? [15:43:58] <_joe_> 30 at most is advisable :P [15:44:06] <_joe_> of course that would kill the puppetmaster [15:44:52] (03CR) 10Chad: [C: 031] scap prep: Scap-ify the creation of beta's StartProfiler.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416334 (https://phabricator.wikimedia.org/T180766) (owner: 10Krinkle) [15:45:22] haha ok I’ll lower it, thanks _joe_ [15:45:58] (03CR) 10Chad: [V: 032 C: 032] Gerrit 2.14.7 [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/418966 (owner: 10Chad) [15:46:21] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4065219 (10BBlack) Yes, we've had this on the discussion list for ops Q4 goals (the elimination of the need for ipsec for caches<->kafka-brokers... [15:47:49] (03PS2) 10Filippo Giunchedi: Depool eqiad puppetmaster [dns] - 10https://gerrit.wikimedia.org/r/420733 (https://phabricator.wikimedia.org/T184562) [15:47:51] (03PS2) 10Filippo Giunchedi: Move config-master to codfw [dns] - 10https://gerrit.wikimedia.org/r/420734 (https://phabricator.wikimedia.org/T184562) [15:48:18] godog: we are doing the m1 failover at 4pm utc, something we need to be aware of with those changes ^ cc jynus [15:49:16] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4065258 (10ayounsi) a:03faidon [15:49:26] !log demon@tin Started deploy [gerrit/gerrit@09534cb]: gerrit 2.14.7 [15:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:39] !log demon@tin Finished deploy [gerrit/gerrit@09534cb]: gerrit 2.14.7 (duration: 00m 12s) [15:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:48] marostegui: when do you anticipate the failover will be complete? I should hold off on disabling puppet agents until after that I think [15:49:52] 10Operations, 10netops: Add puppetmaster2001 to analytics-in4 - https://phabricator.wikimedia.org/T190167#4065267 (10ayounsi) 05Open>03Resolved a:03ayounsi Done. [15:50:15] <_joe_> yes, it's advisable not to overlap the changes [15:50:19] marostegui: ack, thanks for the heads up! herron is your person of contact for today as you can tell now [15:50:20] herron: if all goes fine, the downtime should be a few seconds and the clean up tasks maybe 10 minutes or so [15:50:31] <_joe_> we need a CAB [15:50:39] <_joe_> to avoid such situations!!1! [15:50:42] marostegui: ok sounds good I’ll keep an eye out for the all clear [15:50:58] herron: ta! [15:51:38] we scheduled it yesterday and asked if anyone had a conflict, plus send an email [15:51:38] !log gerrit: restarting services to pick up 2.14.6 -> 2.14.7 upgrade [15:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:09] and warned 1 week ago [15:52:33] should we reschedule? [15:53:23] jynus: don't think so, herron said he'd wait :) [15:53:49] yes, I could use a coffee anyway! [15:55:16] bacula stopped [15:55:22] sorry if this isn't the right way to ask: https://phabricator.wikimedia.org/rOMWCe8123ad4fa2a6c0f1d9b9eb57b6204503ca9879e#2861691 [15:56:14] (03PS5) 10Marostegui: dbproxy: Update m1 proxies to point to db1063 as the primary host [puppet] - 10https://gerrit.wikimedia.org/r/420317 (https://phabricator.wikimedia.org/T189655) (owner: 10Jcrespo) [15:57:06] PROBLEM - bacula director process on helium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-dir [15:57:32] known ^ ignore [15:57:58] !log changing replication topology of m1 [15:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:05] (03Draft1) 10Paladox: Gerrit: Fix its-phabricator soy templates [puppet] - 10https://gerrit.wikimedia.org/r/420742 (https://phabricator.wikimedia.org/T188033) [15:58:09] (03PS2) 10Paladox: Gerrit: Fix its-phabricator soy templates [puppet] - 10https://gerrit.wikimedia.org/r/420742 (https://phabricator.wikimedia.org/T188033) [15:58:16] no_justification mutante ^^ [15:59:17] <_joe_> herron: ping [15:59:29] <_joe_> herron: is your forced puppet run still on? [15:59:40] <_joe_> because I can't run puppet in eqiad, basically [15:59:41] _joe_: no I stopped it [15:59:51] <_joe_> well, I'm not sure about that [16:00:03] <_joe_> given 141 hosts have the agent run lockfile atm [16:00:04] godog, moritzm, and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180320T1600). [16:00:05] marlier: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:05] !log disable puppet on db1063, db1016 [16:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:33] <_joe_> marlier: I think we have to wait a bit [16:00:36] there's a guy editing Wikidata at 350 edits per minute [16:00:37] (03CR) 10Jcrespo: [C: 032] dbproxy: Update m1 proxies to point to db1063 as the primary host [puppet] - 10https://gerrit.wikimedia.org/r/420317 (https://phabricator.wikimedia.org/T189655) (owner: 10Jcrespo) [16:00:53] yes, 350 epm [16:00:57] _joe_: hmm I see activity in puppetmaster1001 log [16:01:02] should I... block him or such? [16:01:08] <_joe_> herron: how did you "cancel" it? [16:01:12] revi: ask him to stop, block if it is a bot [16:01:29] https://www.wikidata.org/wiki/User_talk:Renamerr asked to stop few seconds ago [16:01:41] _joe_: Okay, just let me know when you're good to go. [16:01:41] and it's QuickStatements [16:01:49] semi-automated [16:02:14] herron: Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': Failed to open TCP connection to puppet:8140 (Connection refused - connect(2) for "puppet" port 8140) on dbproxy1006 and dbproxy1001 [16:02:18] marostegui: first review merge [16:02:21] <_joe_> marlier: yeah, for now there is one transition and some fires to extinguish [16:02:30] <_joe_> marlier: yeah puppet is broken in eqiad [16:02:36] jynus: yeah, I am running puppet on the proxies [16:02:39] <_joe_> err, marostegui [16:02:40] <_joe_> :P [16:02:40] whoops! [16:02:43] Okay [16:02:44] (03CR) 10Chad: [C: 031] Gerrit: Fix its-phabricator soy templates [puppet] - 10https://gerrit.wikimedia.org/r/420742 (https://phabricator.wikimedia.org/T188033) (owner: 10Paladox) [16:02:46] RECOVERY - puppetmaster backend https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 2.877 second response time [16:02:46] RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 399 bytes in 3.726 second response time [16:02:50] I'm blocking him after 15 seconds [16:02:56] !log restarted apache2 on puppetmaster1001 [16:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:01] should I stop the failover- we have not yet reached the moment we cannot go back? [16:03:03] 10Operations, 10ORES, 10Scoring-platform-team (Current): Reboot oresrdb - https://phabricator.wikimedia.org/T189781#4065332 (10Halfak) @akosiaris said he'd write up the incident report. [16:03:04] I'll hold my horses -- if you could mention me (so I get notified) when we're good. [16:03:17] <_joe_> marlier: ok, and sorry for the inconvenience [16:03:25] <_joe_> in the meanwhile, I'll review your patch [16:03:29] jynus: puppet ran on the proxies and configuration looking good [16:03:29] thx [16:03:38] (03PS5) 10Jcrespo: mariadb: Switchover m1 master from db1016 to db1063 [puppet] - 10https://gerrit.wikimedia.org/r/420318 (https://phabricator.wikimedia.org/T189655) [16:03:45] (03CR) 10Jcrespo: [C: 032] mariadb: Switchover m1 master from db1016 to db1063 [puppet] - 10https://gerrit.wikimedia.org/r/420318 (https://phabricator.wikimedia.org/T189655) (owner: 10Jcrespo) [16:04:05] (change visibility) 01:03, 21 March 2018 -revi (A/S) (talk | contribs | block) blocked Renamerr (talk | contribs) with an expiration time of 24 hours (account creation disabled) (Semi-automated editing too fast: 350 edit per minute is completely unacceptable) (unblock | change block) [16:04:24] blocked. [16:06:24] revi: you can unblock it once communication happens [16:06:30] yup :D [16:06:34] from my humble point of view [16:07:07] Hi, I blocked you for 24 hours to stop you editing too quickly. Please respect the server, by editing slowly. Thank you. | [16:07:07] I will unblock you as soon as you leave a note here, that you have acknowledged the message. Thanks. [16:07:13] two comments left on their talk page [16:07:26] marostegui: should I go on, akosiaris? [16:07:27] (03CR) 10Giuseppe Lavagetto: "small nitpick but LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [16:07:35] jynus: ok from my side! [16:07:39] one thing is that it's 1AM and I'm probably going to be sleeping soon :P [16:07:41] jynus: green light from my side [16:07:47] ok, we will go into read only mode soon [16:07:54] jynus: ping me to reload proxies [16:07:54] will log when done [16:07:59] I will [16:08:14] sorry, revi, in the middle of something important [16:08:28] no worries, my stuff is done :D [16:08:52] what I needed was DBA's advice (whether I should forcibly stop him or not) and I got it [16:08:58] so goodnight, thanks as always! [16:09:07] and of course, etherpad breaks :-) [16:09:15] hehe yeah :) [16:09:29] <_joe_> right [16:09:31] (03PS1) 10Filippo Giunchedi: cache: depool puppetmaster1001 from config-master.w.o [puppet] - 10https://gerrit.wikimedia.org/r/420744 (https://phabricator.wikimedia.org/T184562) [16:09:38] !log heartbeat killed on m1-master [16:09:39] jynus: it's on m1 :-) [16:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:46] but it's working for me right now [16:10:03] I will make a local copy, but cannot assure it will work [16:10:16] works for me right now too [16:10:37] !log set m1 in read only [16:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:52] marostegui: time to kill [16:10:56] ok [16:11:06] we are good! [16:11:11] shall I reload proxy? [16:11:22] server catched up [16:11:25] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4065385 (10Volans) [16:11:27] reload and kill [16:11:31] ok! [16:11:41] 1001 done [16:11:46] 1006 done [16:11:48] (03CR) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [16:12:16] new config applied [16:12:27] db1063 showing up as master [16:12:38] cool [16:12:44] did you kill connections? [16:12:45] PROBLEM - Check systemd state on etherpad1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:12:49] ^known [16:12:56] PROBLEM - etherpad_lite_process_running on etherpad1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [16:12:57] on db1016 yes [16:13:15] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: connect to address 10.64.32.177 and port 9001: Connection refused [16:13:15] (03PS1) 10Rduran: Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746 [16:13:16] enabling puppet on db1016 [16:13:21] heh and ofc etherpad died [16:13:34] and will put the new host in read write [16:13:43] nothing is connecting to db1016 anymore, so that looks good [16:13:54] !log db1063 in read-write (m1) again [16:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:11] akosiaris: now it is the time to check services and restart them if needed [16:14:14] e.g. etherpad [16:14:19] ok, doing so [16:14:27] etherpad still failing for me [16:14:35] probably needs a ful restart [16:14:45] I will check racktables [16:14:45] RECOVERY - Check systemd state on etherpad1001 is OK: OK - running: The system is fully operational [16:14:47] and now works \o/ [16:14:47] !log restart etherpad T189655 [16:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:52] T189655: Switchover m1 master from db1016 to db1063 - https://phabricator.wikimedia.org/T189655 [16:14:56] RECOVERY - etherpad_lite_process_running on etherpad1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [16:15:14] seems ok, but didn't try to write, can be checked later [16:15:15] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8179 bytes in 0.004 second response time [16:15:16] what else? [16:15:23] librenms? [16:15:35] rt [16:15:39] robh: ^ [16:15:52] etherpad works for me, I can also create a new page [16:16:06] RECOVERY - bacula director process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir [16:16:22] librenms seems ok, but need someone with more experience on it [16:16:31] marostegui: misc db changeover test? [16:16:40] !log restart bacula-dir T189655 [16:16:41] rt works. [16:16:43] racktables is also fine [16:16:43] robh: time to do that, yes [16:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:47] robh: yeah, we failed over, if you could test that'd be cool [16:16:53] specially writes [16:17:00] (03CR) 10Giuseppe Lavagetto: [C: 032] navtiming.py: Make sure to record country specific when oversampling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [16:17:03] So we never write to RT, EVER ;] but ill make a test task [16:17:11] reads could be fine and writes fail [16:17:16] librenms look ok to me [16:17:16] because conenction handling [16:17:18] ok [16:17:26] ok, it works! can write and read [16:17:30] RT = all good [16:17:33] I will check heartbeat [16:17:57] robh: thanks! [16:18:02] (03PS12) 10Giuseppe Lavagetto: navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [16:18:03] does racktables need testing? [16:18:07] i can do that as well, its reading fine [16:18:10] last write on 2018-03-20T16:18:00.000980 [16:18:11] racktables tested [16:18:12] robh: yeah, that'd nice! [16:18:17] :-) [16:18:17] robh: 2 people said ok, but go on [16:18:29] ah, I missed akosiaris tested it! [16:18:30] ok, just confirmed write also works [16:18:33] cool [16:18:36] puppet, the mysql part [16:18:40] and moritzm (thanks!) [16:18:40] servermon (puppet db) works as well [16:18:43] any way to check it didn't broke? [16:18:48] ah, that^ [16:18:55] cool [16:18:57] \o/ [16:19:10] what's left ? [16:19:10] (no one noticed the migration so you guys did it right ;) [16:19:22] I think we are done, will keep the maintenace on the title until some minutes happen [16:19:29] ok [16:19:29] e.g. strange bacula backup errors [16:19:29] akosiaris: I was checking the service list, and I think we are finished [16:19:41] but we are officially migrated [16:19:48] cool [16:19:51] \o\ |o| /o/ [16:19:54] <_joe_> marlier: I'll merge/apply, it's unrelated enough from the ongoing maintenance to happen now [16:19:55] only some low priority tasks pending [16:20:19] _joe_: Cool, thanks. Appreciate it! [16:20:27] herron: I think you can go on with your maintenance I think [16:20:47] nice works guys! [16:22:01] thanks to you alex, we will try to unload you from some of those services soon [16:22:20] ? [16:22:44] what services ? I like services, it's other things I mind :P [16:23:05] test backup done successfully [16:23:10] the cobalt one [16:23:20] <_joe_> marlier: does navtiming log anything? [16:23:25] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4065416 (10Marostegui) [16:24:22] <_joe_> oh right, syslog [16:24:50] <_joe_> marlier: we should be ok, I checked the navtiming logs on hafnium [16:25:04] Awesome, thanks! [16:25:09] (03CR) 10Dmaza: [C: 031] Enable AbuseFilter runtime profile on more Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420672 (https://phabricator.wikimedia.org/T175954) (owner: 10Daimona Eaytoy) [16:25:30] I'll keep an eye on it, will know pretty quickly if it's not working. The tests are pretty solid on this component so I'm not too worried about. [16:25:46] <_joe_> the change seemed to be safe and simple [16:25:51] (03PS1) 10Andrew Bogott: labtestwikitech: move mysql socket file [puppet] - 10https://gerrit.wikimedia.org/r/420751 [16:26:32] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Access request to stat1005 and stat1006 for cooltey - https://phabricator.wikimedia.org/T190150#4065438 (10cooltey) {F15916344} The user group should be `researchers` Thanks! [16:27:35] (03PS2) 10Andrew Bogott: labtestwikitech: use default mysql socket file [puppet] - 10https://gerrit.wikimedia.org/r/420751 [16:27:50] (03PS4) 10Volans: Initial import [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419720 [16:27:52] (03PS4) 10Volans: Built artifacts for jessie and stretch [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419721 [16:28:18] (03PS4) 10Volans: Puppetboard: add role, profile and configuration [puppet] - 10https://gerrit.wikimedia.org/r/419762 (https://phabricator.wikimedia.org/T184563) [16:28:20] (03PS4) 10Volans: Puppetboard: add varnish director entries [puppet] - 10https://gerrit.wikimedia.org/r/419763 (https://phabricator.wikimedia.org/T184563) [16:28:55] _joe_: confirmed that everything is working as expected. Thanks for the help! [16:29:11] <_joe_> cool! [16:29:18] (03CR) 10Andrew Bogott: [C: 032] labtestwikitech: use default mysql socket file [puppet] - 10https://gerrit.wikimedia.org/r/420751 (owner: 10Andrew Bogott) [16:31:04] marostegui: great thanks! [16:31:43] 10Operations, 10ops-eqiad: Non-redundant power supply on helium - https://phabricator.wikimedia.org/T186808#4065473 (10RobH) [16:32:47] (03PS1) 10Andrew Bogott: Wikitech db config: make socket configurable [puppet] - 10https://gerrit.wikimedia.org/r/420756 [16:33:29] (03CR) 10Andrew Bogott: [C: 032] Wikitech db config: make socket configurable [puppet] - 10https://gerrit.wikimedia.org/r/420756 (owner: 10Andrew Bogott) [16:36:48] (03PS1) 10Ema: 5.1.3-1wm6: avoid OH ref leak, use our own extrachance [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/420758 [16:38:28] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, just extra file dsh group file now" (031 comment) [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419720 (owner: 10Volans) [16:38:50] !log running reset slave all on db1063 T189655 [16:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:55] T189655: Switchover m1 master from db1016 to db1063 - https://phabricator.wikimedia.org/T189655 [16:39:20] (03CR) 10Filippo Giunchedi: [C: 031] Puppetboard: add role, profile and configuration [puppet] - 10https://gerrit.wikimedia.org/r/419762 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [16:40:21] (03PS1) 10Andrew Bogott: Update entries for new labtestweb services [dns] - 10https://gerrit.wikimedia.org/r/420760 [16:40:25] (03PS1) 10Andrew Bogott: remove entries for spiceproxy. [dns] - 10https://gerrit.wikimedia.org/r/420761 [16:40:37] 10Operations, 10Toolforge, 10cloud-services-team: implement renewed *.tools.wmflabs.org cert/key pair - https://phabricator.wikimedia.org/T190182#4065524 (10RobH) p:05Triage>03Normal [16:40:43] (03PS1) 10RobH: new cert for *.tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/420762 (https://phabricator.wikimedia.org/T190182) [16:40:53] (03CR) 10Andrew Bogott: [C: 032] Update entries for new labtestweb services [dns] - 10https://gerrit.wikimedia.org/r/420760 (owner: 10Andrew Bogott) [16:41:07] (03CR) 10Andrew Bogott: [C: 032] remove entries for spiceproxy. [dns] - 10https://gerrit.wikimedia.org/r/420761 (owner: 10Andrew Bogott) [16:41:39] 10Operations, 10Toolforge, 10cloud-services-team, 10Patch-For-Review: implement renewed *.tools.wmflabs.org cert/key pair - https://phabricator.wikimedia.org/T190182#4065551 (10RobH) a:05RobH>03chasemp [16:43:37] (03PS1) 10Andrew Bogott: labtest: Add misc-web config for labtestwikitech and labtesttoolsadmin [puppet] - 10https://gerrit.wikimedia.org/r/420763 [16:44:21] (03CR) 10Andrew Bogott: [C: 032] labtest: Add misc-web config for labtestwikitech and labtesttoolsadmin [puppet] - 10https://gerrit.wikimedia.org/r/420763 (owner: 10Andrew Bogott) [16:46:26] jynus: when you are free, can you write your opinion (I think your opinion will be more authentic (and you know what's best for servers) in https://www.wikidata.org/wiki/User_talk:Renamerr ? I'm unblocking them anyway [16:49:04] 10Operations, 10netops: Netbox: setup backups - https://phabricator.wikimedia.org/T190184#4065562 (10Volans) p:05Triage>03Normal [16:49:05] revi: I will write there, the guidelines, as documented, is to not perform requests in parallel [16:49:09] 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#4065574 (10Volans) [16:49:11] thanks! [16:49:21] 400 cannot be done without parallelization [16:52:48] there seemed to be a considerable spike of errors at 16:09 [16:52:54] of misc services [16:53:23] I wonder what could make 10 requests per second etherpad? [16:54:07] (probably, because of javascript) [16:58:38] (03PS6) 10Jcrespo: mariadb: Move default socket for misc services to /run [puppet] - 10https://gerrit.wikimedia.org/r/420331 (https://phabricator.wikimedia.org/T148507) [16:58:40] (03PS3) 10Jcrespo: mariadb: Move parsercaches socket location to the default one [puppet] - 10https://gerrit.wikimedia.org/r/420647 (https://phabricator.wikimedia.org/T148507) [16:58:42] (03PS3) 10Jcrespo: mariadb: Do not add host to production prometheus if it is on cloud [puppet] - 10https://gerrit.wikimedia.org/r/420683 (https://phabricator.wikimedia.org/T171203) [16:58:44] (03PS1) 10Jcrespo: mariadb: Update m1 master tag on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/420765 (https://phabricator.wikimedia.org/T189655) [16:59:06] (03PS2) 10Jcrespo: mariadb: Update m1 master tag on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/420765 (https://phabricator.wikimedia.org/T189655) [16:59:41] (03CR) 10Muehlenhoff: "Ok,thanks. I'll merge and we can define in followup patches." [puppet] - 10https://gerrit.wikimedia.org/r/420653 (owner: 10Muehlenhoff) [16:59:46] (03PS2) 10Muehlenhoff: Fix up Cumin db aliases [puppet] - 10https://gerrit.wikimedia.org/r/420653 [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180320T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:29] (03CR) 10Jcrespo: [C: 032] mariadb: Update m1 master tag on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/420765 (https://phabricator.wikimedia.org/T189655) (owner: 10Jcrespo) [17:00:53] (03CR) 10Muehlenhoff: [C: 032] Fix up Cumin db aliases [puppet] - 10https://gerrit.wikimedia.org/r/420653 (owner: 10Muehlenhoff) [17:01:01] (03PS3) 10Muehlenhoff: Fix up Cumin db aliases [puppet] - 10https://gerrit.wikimedia.org/r/420653 [17:04:24] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.22 (duration: 02m 57s) [17:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:21] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.24 [keeping static files] (duration: 01m 23s) [17:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:40] (03PS1) 10Jcrespo: dblists: Remove dbstore1001 for the list of hosts [software] - 10https://gerrit.wikimedia.org/r/420767 (https://phabricator.wikimedia.org/T186596) [17:07:23] (03CR) 10Jcrespo: [V: 032 C: 032] dblists: Remove dbstore1001 for the list of hosts [software] - 10https://gerrit.wikimedia.org/r/420767 (https://phabricator.wikimedia.org/T186596) (owner: 10Jcrespo) [17:08:17] (03PS1) 10Giuseppe Lavagetto: Honour .dockerignore when copying the build context [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/420769 (https://phabricator.wikimedia.org/T183546) [17:08:24] (03PS5) 10Volans: Initial import [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419720 [17:08:26] (03PS5) 10Volans: Built artifacts for jessie and stretch [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419721 [17:08:27] <_joe_> hasharAway: ^^ this is for your request [17:08:39] (03PS1) 10Jcrespo: dblists: Set the new m1 master (db1063) at the botton [software] - 10https://gerrit.wikimedia.org/r/420770 (https://phabricator.wikimedia.org/T189655) [17:08:44] 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): implement renewed *.tools.wmflabs.org cert/key pair - https://phabricator.wikimedia.org/T190182#4065619 (10bd808) [17:08:46] (03CR) 10jerkins-bot: [V: 04-1] Honour .dockerignore when copying the build context [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/420769 (https://phabricator.wikimedia.org/T183546) (owner: 10Giuseppe Lavagetto) [17:09:08] 10Operations, 10Traffic, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4065621 (10ayounsi) [17:09:32] (03CR) 10Jcrespo: [C: 032] dblists: Set the new m1 master (db1063) at the botton [software] - 10https://gerrit.wikimedia.org/r/420770 (https://phabricator.wikimedia.org/T189655) (owner: 10Jcrespo) [17:10:00] (03CR) 10Volans: "> Patch Set 4: Code-Review+1" (031 comment) [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419720 (owner: 10Volans) [17:10:34] (03CR) 10Volans: [V: 032 C: 032] Initial import [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419720 (owner: 10Volans) [17:11:07] (03CR) 10Volans: [V: 032 C: 032] Built artifacts for jessie and stretch [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419721 (owner: 10Volans) [17:12:15] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4065652 (10jcrespo) [17:12:18] 10Operations, 10DBA, 10Patch-For-Review: Switchover m1 master from db1016 to db1063 - https://phabricator.wikimedia.org/T189655#4065649 (10jcrespo) 05Open>03Resolved a:03jcrespo Considered done: ``` == m1 == * Disable GTID on db1063, connect db2078 and db1001 to db1063 DONE * Disable puppet @db1016,... [17:12:29] (03PS1) 10Muehlenhoff: Fix Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/420771 [17:13:08] (03CR) 10Volans: "nitpick inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/420771 (owner: 10Muehlenhoff) [17:13:42] 10Operations, 10netops: Netbox: setup backups - https://phabricator.wikimedia.org/T190184#4065669 (10ayounsi) Correct, plus: /srv/deployment/netbox/deploy/netbox/netbox/media/ and /srv/deployment/netbox/deploy/netbox/netbox/reports/ [17:13:56] 10Operations, 10Traffic, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4065670 (10BBlack) [17:14:34] (03PS2) 10Muehlenhoff: Fix Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/420771 [17:15:19] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4065674 (10Marostegui) [17:16:12] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4065679 (10Marostegui) [17:16:32] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4065683 (10Marostegui) [17:19:15] (03PS3) 10Muehlenhoff: Fix Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/420771 [17:20:37] _joe_: you are awesome :] [17:20:47] (03CR) 10Herron: [C: 031] Puppetboard: add role, profile and configuration [puppet] - 10https://gerrit.wikimedia.org/r/419762 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [17:22:41] 10Operations, 10Puppet, 10Patch-For-Review: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891#4065693 (10fgiunchedi) [17:24:14] (03CR) 10Ayounsi: [C: 031] Enable base::service_auto_restart for librenms-ircbot [puppet] - 10https://gerrit.wikimedia.org/r/420660 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:24:26] !log mholloway-shell@tin Started deploy [mobileapps/deploy@fad1009]: Update mobileapps to 634a15f [17:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:37] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4065707 (10Cmjohnson) The disks are showing up on the server, I can confirm all 4 disks are being seen during post, they're all green and in the SATA settings they are all on with the... [17:26:30] 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): implement renewed *.tools.wmflabs.org cert/key pair - https://phabricator.wikimedia.org/T190182#4065716 (10chasemp) a:05chasemp>03aborrero @madhuvishy graciously said she would walk @aborrero through the process from last ye... [17:29:49] !log test a depool/repool action for kafka1001 (eventbus/jobqueue) - part of an investigation to figure out where timeouts come from [17:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:00] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@fad1009]: Update mobileapps to 634a15f (duration: 05m 34s) [17:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:54] (03CR) 10BBlack: [C: 031] varnishospital: send origin servers health logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/420680 (https://phabricator.wikimedia.org/T174932) (owner: 10Ema) [17:34:43] ema: 10+ for the name [17:35:25] (03PS3) 10Ema: varnishospital: send origin servers health logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/420680 (https://phabricator.wikimedia.org/T174932) [17:35:46] elukey: I also did initially call the dstat plugin for varnish hitrate --varnishit [17:36:17] then I thought the name could have been misinterpreted [17:38:48] (and went for --varnish-hit instead) [17:38:59] hahahahah [17:40:58] slowclap.gif [17:43:13] (03CR) 10Volans: varnishospital: send origin servers health logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/420680 (https://phabricator.wikimedia.org/T174932) (owner: 10Ema) [17:47:13] (03CR) 10Ema: varnishospital: send origin servers health logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/420680 (https://phabricator.wikimedia.org/T174932) (owner: 10Ema) [17:47:14] (03PS5) 10Volans: Puppetboard: add role, profile and configuration [puppet] - 10https://gerrit.wikimedia.org/r/419762 (https://phabricator.wikimedia.org/T184563) [17:48:21] (03CR) 10Volans: [C: 032] Puppetboard: add role, profile and configuration [puppet] - 10https://gerrit.wikimedia.org/r/419762 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [17:53:15] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 43380383 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:57:16] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 6041 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:58:15] PROBLEM - Keyholder SSH agent on naos is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [17:58:20] (03PS1) 10Andrew Bogott: labtestwiki: use instanct commons, just like labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420788 [17:59:22] (03PS2) 10Andrew Bogott: labtestwiki: use instant commons, just like labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420788 [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180320T1800) [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:16] RECOVERY - Keyholder SSH agent on naos is OK: OK: Keyholder is armed with all configured keys. [18:00:16] that was me, armed (naos keyholder) [18:01:54] (03PS1) 10Chad: Group0 to wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420791 [18:01:56] (03CR) 10Chad: [C: 04-2] Group0 to wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420791 (owner: 10Chad) [18:06:31] (03CR) 10Chad: [C: 032] labtestwiki: use instant commons, just like labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420788 (owner: 10Andrew Bogott) [18:07:46] (03CR) 10Ema: [C: 032] varnishospital: send origin servers health logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/420680 (https://phabricator.wikimedia.org/T174932) (owner: 10Ema) [18:07:55] (03PS4) 10Ema: varnishospital: send origin servers health logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/420680 (https://phabricator.wikimedia.org/T174932) [18:07:59] (03Merged) 10jenkins-bot: labtestwiki: use instant commons, just like labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420788 (owner: 10Andrew Bogott) [18:08:01] (03CR) 10Ema: [V: 032 C: 032] varnishospital: send origin servers health logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/420680 (https://phabricator.wikimedia.org/T174932) (owner: 10Ema) [18:09:21] (03CR) 10jenkins-bot: labtestwiki: use instant commons, just like labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420788 (owner: 10Andrew Bogott) [18:10:36] !log demon@tin Synchronized wmf-config/CommonSettings.php: instantcommons for labstestwiki (duration: 01m 58s) [18:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:03] andrewbogott: You're live ^^^ [18:12:09] thanks! [18:12:21] !log demon@tin Started scap: bootstrap wmf.26 [18:12:24] yw [18:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:25] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [18:12:25] RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational [18:15:25] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:15:26] PROBLEM - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:15:56] PROBLEM - puppet last run on puppetboard2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:16:40] the last one is me, fixing [18:16:45] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:17:18] urandom: --^ [18:17:56] restbase unit failed, puppet disabled, so probably some tests [18:18:16] elukey: yeah, sorry, i thought that was still under maintenance [18:18:36] elukey: aka, not production, on it, sorry for the noise :) [18:18:42] !log eevans@tin Started deploy [restbase/deploy@8dbc93c] (dev-cluster): Update Dev environment to current production (T186751) [18:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:48] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [18:19:45] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [18:19:46] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for librenms-ircbot [puppet] - 10https://gerrit.wikimedia.org/r/420660 (https://phabricator.wikimedia.org/T135991) [18:19:46] bblack ema just to double-check, there's currently no easy way for Varnish to trigger an eventlogging event, correct? [18:19:52] thanks in advance! [18:20:15] ACKNOWLEDGEMENT - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans Deploying... [18:20:15] ACKNOWLEDGEMENT - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans Deploying... [18:20:55] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for librenms-ircbot [puppet] - 10https://gerrit.wikimedia.org/r/420660 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:21:20] AndyRussG: not that I know of [18:21:52] AndyRussG: but this sounds like the making of an X-Y problem :) [18:22:42] AndyRussG: (in other words - what do you really want that eventlogging-from-varnish is suspected to help with?) [18:23:04] (03CR) 10Ema: [C: 032] 5.1.3-1wm6: avoid OH ref leak, use our own extrachance [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/420758 (owner: 10Ema) [18:24:40] !log eevans@tin Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): Update Dev environment to current production (T186751) (duration: 05m 58s) [18:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:46] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [18:25:16] (03PS1) 10Volans: Puppetboard: add missing call to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/420798 (https://phabricator.wikimedia.org/T184563) [18:26:53] (03PS3) 10Dzahn: network::constants: add bast1002 as bastion server [puppet] - 10https://gerrit.wikimedia.org/r/420621 (https://phabricator.wikimedia.org/T183412) [18:27:26] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [18:27:53] !log eevans@tin Started deploy [restbase/deploy@8dbc93c] (dev-cluster): Update Dev environment to current production (T186751) [18:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:24] (03PS1) 10Catrope: Disable Flow on tswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420799 (https://phabricator.wikimedia.org/T188815) [18:29:45] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [18:30:06] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [18:30:06] (03PS1) 10MaxSem: Deploy GlobalPreferences to test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420801 (https://phabricator.wikimedia.org/T184121) [18:30:26] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [18:30:26] RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational [18:30:35] !log eevans@tin Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): Update Dev environment to current production (T186751) (duration: 02m 43s) [18:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:41] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [18:30:49] (03CR) 10RobH: [C: 031] "this matches both dns and the host's ips, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/420621 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [18:31:53] (03CR) 10Dzahn: [C: 032] network::constants: add bast1002 as bastion server [puppet] - 10https://gerrit.wikimedia.org/r/420621 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [18:33:28] !log varnish 5.1.3-1wm6 uploaded to apt.w.o [18:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:15] (03PS1) 10RobH: user ctooley provided a dedicated ssh pubkey [puppet] - 10https://gerrit.wikimedia.org/r/420803 (https://phabricator.wikimedia.org/T190150) [18:34:47] (03PS2) 10RobH: user ctooley provided a dedicated ssh pubkey [puppet] - 10https://gerrit.wikimedia.org/r/420803 (https://phabricator.wikimedia.org/T190150) [18:35:21] (03CR) 10RobH: [C: 032] user ctooley provided a dedicated ssh pubkey [puppet] - 10https://gerrit.wikimedia.org/r/420803 (https://phabricator.wikimedia.org/T190150) (owner: 10RobH) [18:35:35] (03PS1) 10Papaul: DHCP: Add MAC address entries for mw2259-mw2290 [puppet] - 10https://gerrit.wikimedia.org/r/420804 [18:35:56] (03PS1) 10Andrew Bogott: Post-silver cleanups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420805 (https://phabricator.wikimedia.org/T168470) [18:36:12] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Add MAC address entries for mw2259-mw2290 [puppet] - 10https://gerrit.wikimedia.org/r/420804 (owner: 10Papaul) [18:36:38] (03PS3) 10Paladox: Gerrit: Fix its-phabricator soy templates [puppet] - 10https://gerrit.wikimedia.org/r/420742 (https://phabricator.wikimedia.org/T188033) [18:36:46] (03PS21) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [18:37:29] (03PS2) 10Andrew Bogott: Post-silver cleanups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420805 (https://phabricator.wikimedia.org/T168470) [18:38:12] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Access request to stat1005 and stat1006 for cooltey - https://phabricator.wikimedia.org/T190150#4066198 (10RobH) a:05cooltey>03None I've restored @cooltey's shell access to what was there before I revoked it (due to shared key between production an... [18:38:43] (03CR) 10jerkins-bot: [V: 04-1] Post-silver cleanups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420805 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [18:39:41] (03PS2) 10Dzahn: DHCP: Add MAC address entries for mw2259-mw2290 [puppet] - 10https://gerrit.wikimedia.org/r/420804 (https://phabricator.wikimedia.org/T188301) (owner: 10Papaul) [18:39:51] (03PS1) 10Zoranzoki21: Add new throttle rule and add task for one in comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420807 (https://phabricator.wikimedia.org/T190206) [18:39:56] (03PS3) 10Andrew Bogott: Post-silver cleanups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420805 (https://phabricator.wikimedia.org/T168470) [18:40:20] (03PS1) 10RobH: adding cooltey to reserachers [puppet] - 10https://gerrit.wikimedia.org/r/420809 (https://phabricator.wikimedia.org/T190150) [18:41:18] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Access request to stat1005 and stat1006 for cooltey - https://phabricator.wikimedia.org/T190150#4066214 (10RobH) p:05Triage>03Normal [18:42:00] (03PS2) 10Volans: Puppetboard: add missing call to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/420798 (https://phabricator.wikimedia.org/T184563) [18:42:36] bblack: basically it's just that the current plan involves sending an EL event from the client, containing information already available on the server. So at least performance-wise, for both client and cluster, it seems wasteful [18:42:46] but if it's still the best way, that's cool! not a huge deal [18:42:54] (03PS2) 10Dzahn: bastionhost: sync /home from bast1001 to bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/420626 (https://phabricator.wikimedia.org/T183412) [18:43:01] bblack: https://phabricator.wikimedia.org/T185933#4056789 [18:43:04] thx!!!! [18:43:05] (03Abandoned) 10Andrew Bogott: fix up svg retrieval [wikitech-static] - 10https://gerrit.wikimedia.org/r/420027 (owner: 10ArielGlenn) [18:43:12] (03CR) 10Volans: [C: 032] "compiler results: https://puppet-compiler.wmflabs.org/compiler02/10545/puppetboard1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/420798 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [18:43:44] (03PS1) 10Muehlenhoff: Validate SSH keys in account cross check [puppet] - 10https://gerrit.wikimedia.org/r/420810 (https://phabricator.wikimedia.org/T189890) [18:46:13] (03PS3) 10Dzahn: DHCP: Add MAC address entries for mw2259-mw2290 [puppet] - 10https://gerrit.wikimedia.org/r/420804 (https://phabricator.wikimedia.org/T188301) (owner: 10Papaul) [18:46:18] (03CR) 10Zoranzoki21: "Can I abandon this because is task closed as resolved?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [18:47:09] (03CR) 10Dzahn: [C: 032] "lgtm, also checked random samples and MAC address matching first embedded nic" [puppet] - 10https://gerrit.wikimedia.org/r/420804 (https://phabricator.wikimedia.org/T188301) (owner: 10Papaul) [18:48:55] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 52.18, 23.86, 15.16 [18:49:55] mw1227 still accessible and running, loading going down [18:50:49] (03PS3) 10Dzahn: bastionhost: sync /home from bast1001 to bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/420626 (https://phabricator.wikimedia.org/T183412) [18:50:55] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 18.02, 21.62, 15.50 [18:51:28] (03CR) 10Dzahn: [C: 032] bastionhost: sync /home from bast1001 to bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/420626 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [18:54:38] !log demon@tin Finished scap: bootstrap wmf.26 (duration: 42m 16s) [18:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] no_justification: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180320T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:29] Only one *eyeroll* [19:04:00] PROBLEM - Check size of conntrack table on puppetboard2001 is CRITICAL: NRPE: Command check_conntrack_table_size not defined [19:05:10] 10Operations, 10Cloud-Services, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921#4066291 (10RobH) [19:07:20] PROBLEM - Check whether ferm is active by checking the default input chain on puppetboard2001 is CRITICAL: NRPE: Command check_ferm_active not defined [19:12:45] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4066328 (10RobH) [19:15:04] (03PS1) 10RobH: decom db1009 [puppet] - 10https://gerrit.wikimedia.org/r/420821 (https://phabricator.wikimedia.org/T189216) [19:15:21] (03CR) 10RobH: [C: 032] decom db1009 [puppet] - 10https://gerrit.wikimedia.org/r/420821 (https://phabricator.wikimedia.org/T189216) (owner: 10RobH) [19:15:33] (03CR) 10Chad: [C: 032] Group0 to wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420791 (owner: 10Chad) [19:17:00] (03Merged) 10jenkins-bot: Group0 to wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420791 (owner: 10Chad) [19:19:21] (03PS1) 10RobH: decom db1009 prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/420822 (https://phabricator.wikimedia.org/T189216) [19:20:15] (03CR) 10jenkins-bot: Group0 to wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420791 (owner: 10Chad) [19:20:43] (03CR) 10RobH: [C: 032] decom db1009 prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/420822 (https://phabricator.wikimedia.org/T189216) (owner: 10RobH) [19:23:23] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4066353 (10RobH) [19:23:46] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4035423 (10RobH) a:05RobH>03Cmjohnson ready for onsite disk wipe and completion of steps [19:24:10] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - kubelet_operational_latencies is 69262 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:25:23] (03PS1) 10Ema: varnishospital: include layer information [puppet] - 10https://gerrit.wikimedia.org/r/420823 [19:28:45] !log demon@tin rebuilt and synchronized wikiversions files: group0 to wmf.26 [19:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:20] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - kubelet_operational_latencies is 2427 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:29:22] (03CR) 10BBlack: [C: 031] varnishospital: include layer information [puppet] - 10https://gerrit.wikimedia.org/r/420823 (owner: 10Ema) [19:32:09] (03CR) 10Ema: [C: 032] varnishospital: include layer information [puppet] - 10https://gerrit.wikimedia.org/r/420823 (owner: 10Ema) [19:33:34] (03PS1) 10Herron: depool eqiad puppetmaster and point puppet service records to codfw [dns] - 10https://gerrit.wikimedia.org/r/420827 (https://phabricator.wikimedia.org/T177253) [19:34:38] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4066407 (10Dzahn) @Robh I'm thinking next we should just try repeating the install one more time and see if grub install still fails and if it still does we can try with jessie instead... [19:36:51] !log temporarily disabling puppet agents for puppetdb upgrade [19:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:21] (03CR) 10Herron: [C: 032] depool eqiad puppetmaster and point puppet service records to codfw [dns] - 10https://gerrit.wikimedia.org/r/420827 (https://phabricator.wikimedia.org/T177253) (owner: 10Herron) [19:41:39] (03PS9) 10Herron: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) [19:42:20] (03CR) 10jerkins-bot: [V: 04-1] puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron) [19:42:57] (03CR) 10Herron: [V: 032 C: 032] puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron) [19:47:27] (03PS14) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [19:48:05] (03CR) 10jerkins-bot: [V: 04-1] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [19:53:08] (03PS1) 10Phedenskog: Performance: Collect Navigation Timing gaps [puppet] - 10https://gerrit.wikimedia.org/r/420831 [19:53:40] (03CR) 10jerkins-bot: [V: 04-1] Performance: Collect Navigation Timing gaps [puppet] - 10https://gerrit.wikimedia.org/r/420831 (owner: 10Phedenskog) [19:53:56] (03PS15) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [19:54:24] (03CR) 10jerkins-bot: [V: 04-1] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [19:55:01] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:51] (03PS1) 10Vgutierrez: prometheus: aggregate varnish_x_cache metrics [puppet] - 10https://gerrit.wikimedia.org/r/420832 (https://phabricator.wikimedia.org/T184942) [19:56:14] (03PS2) 10Phedenskog: Performance: Collect Navigation Timing gaps [puppet] - 10https://gerrit.wikimedia.org/r/420831 (https://phabricator.wikimedia.org/T104902) [19:56:42] (03CR) 10jerkins-bot: [V: 04-1] Performance: Collect Navigation Timing gaps [puppet] - 10https://gerrit.wikimedia.org/r/420831 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [19:58:48] (03PS3) 10Phedenskog: Performance: Collect Navigation Timing gaps [puppet] - 10https://gerrit.wikimedia.org/r/420831 (https://phabricator.wikimedia.org/T104902) [19:59:00] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 4.963 second response time [20:01:13] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4066449 (10Vgutierrez) varnishxcache metrics (varnish.xcache.*) are being used in the dashboard db/varnish-caching (Varnish Caching). Migration to Promet... [20:02:10] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:39] (03PS16) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [20:05:11] (03CR) 10jerkins-bot: [V: 04-1] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [20:05:15] (03CR) 10Dzahn: [C: 032] Gerrit: Fix its-phabricator soy templates [puppet] - 10https://gerrit.wikimedia.org/r/420742 (https://phabricator.wikimedia.org/T188033) (owner: 10Paladox) [20:06:09] thanks :) [20:06:10] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 7.628 second response time [20:07:52] (03PS17) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [20:07:54] (03CR) 10Lucas Werkmeister (WMDE): "Train has advanced, apparently without problems so far… how do you want this to be deployed, btw? SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420336 (https://phabricator.wikimedia.org/T189776) (owner: 10Lucas Werkmeister (WMDE)) [20:15:45] (03PS1) 10Cmjohnson: Removing mgmt dns wtp1001-1024 [dns] - 10https://gerrit.wikimedia.org/r/420838 (https://phabricator.wikimedia.org/T177374) [20:16:11] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:08] (03PS2) 10Cmjohnson: Removing mgmt dns wtp1001-1024 [dns] - 10https://gerrit.wikimedia.org/r/420838 (https://phabricator.wikimedia.org/T177374) [20:18:38] (03Abandoned) 10Cmjohnson: Removing mgmt dns wtp1001-1024 [dns] - 10https://gerrit.wikimedia.org/r/416707 (https://phabricator.wikimedia.org/T177374) (owner: 10Cmjohnson) [20:19:11] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 9.569 second response time [20:19:28] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns wtp1001-1024 [dns] - 10https://gerrit.wikimedia.org/r/420838 (https://phabricator.wikimedia.org/T177374) (owner: 10Cmjohnson) [20:22:11] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:27] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4066537 (10Papaul) [20:24:11] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 5.520 second response time [20:24:16] 10Operations, 10ops-codfw, 10netops, 10User-Elukey: Switch port configuration for mw2259-mw2290 - https://phabricator.wikimedia.org/T190115#4066539 (10RobH) Done with b3 > asw-b3-codfw > mw2259 ge-3/0/8 > mw2260 ge-3/0/9 > mw2261 ge-3/0/10 > mw2262 ge-3/0/11 > mw2263 ge-3/0/12 > mw2264 ge-3/0/13 > mw22... [20:28:20] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:28:56] !log OS install on mw2259-mw2290 [20:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:11] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 2.771 second response time [20:32:21] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:54] (03CR) 10Krinkle: [C: 04-1] Performance: Collect Navigation Timing gaps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/420831 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [20:37:42] (03CR) 10Niharika29: "Why just test wikis and mw.org?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420801 (https://phabricator.wikimedia.org/T184121) (owner: 10MaxSem) [20:40:15] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4066622 (10RobH) [20:40:18] 10Operations, 10ops-codfw, 10netops, 10User-Elukey: Switch port configuration for mw2259-mw2290 - https://phabricator.wikimedia.org/T190115#4066620 (10RobH) 05Open>03Resolved The remainder are now done as well. ``` [edit interfaces interface-range vlan-private1-d-codfw] member-range ge-1/0/0 { ..... [20:42:31] !log volans@tin Started deploy [puppetboard/deploy@0975558]: Initial sync [20:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:50] !log volans@tin Finished deploy [puppetboard/deploy@0975558]: Initial sync (duration: 02m 19s) [20:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:22] (03CR) 10MaxSem: "Because we need to test first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420801 (https://phabricator.wikimedia.org/T184121) (owner: 10MaxSem) [20:46:53] (03CR) 10Niharika29: "Are you going to create another patch for deploying everywhere then?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420801 (https://phabricator.wikimedia.org/T184121) (owner: 10MaxSem) [20:47:31] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 1.189 second response time [20:47:35] (03CR) 10MaxSem: "Yes, but not today. Going everywhere at once is just out of the question." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420801 (https://phabricator.wikimedia.org/T184121) (owner: 10MaxSem) [20:51:44] (03PS4) 10Phedenskog: Performance: Collect Navigation Timing gaps [puppet] - 10https://gerrit.wikimedia.org/r/420831 (https://phabricator.wikimedia.org/T104902) [20:52:40] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:53:15] (03PS1) 10Andrew Bogott: labtestweb: override keystone url for striker [puppet] - 10https://gerrit.wikimedia.org/r/420848 [20:54:10] (03CR) 10Andrew Bogott: [C: 032] labtestweb: override keystone url for striker [puppet] - 10https://gerrit.wikimedia.org/r/420848 (owner: 10Andrew Bogott) [20:55:41] !log volans@tin Started deploy [puppetboard/deploy@0975558]: Initial sync [20:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:51] !log volans@tin Finished deploy [puppetboard/deploy@0975558]: Initial sync (duration: 00m 09s) [20:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:30] RECOVERY - Check size of conntrack table on puppetboard2001 is OK: OK: nf_conntrack is 0 % full [20:57:40] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 1.289 second response time [20:57:41] RECOVERY - Check whether ferm is active by checking the default input chain on puppetboard2001 is OK: OK ferm input default policy is set [21:01:00] RECOVERY - puppet last run on puppetboard2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:01:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsoid, 10Patch-For-Review: decom wtp1001-wtp1024 - https://phabricator.wikimedia.org/T177374#4066711 (10Cmjohnson) [21:01:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsoid, 10Patch-For-Review: decom wtp1001-wtp1024 - https://phabricator.wikimedia.org/T177374#3656431 (10Cmjohnson) 05Open>03Resolved [21:02:20] !log volans@tin Started deploy [puppetboard/deploy@0975558]: Initial sync [21:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:46] !log volans@tin Finished deploy [puppetboard/deploy@0975558]: Initial sync (duration: 00m 26s) [21:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:01] PROBLEM - Check systemd state on puppetboard1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:04:11] (03CR) 10Bstorm: [V: 032 C: 031] Add Chicocvenancio's key for Cloud Services [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273) (owner: 10Chico Venancio) [21:04:42] (03CR) 10Bstorm: [V: 032 C: 032] Add Chicocvenancio's key for Cloud Services [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273) (owner: 10Chico Venancio) [21:05:28] (03PS2) 10Cmjohnson: Adding dns entries wdqs1006-8 [dns] - 10https://gerrit.wikimedia.org/r/417352 (https://phabricator.wikimedia.org/T188432) [21:06:50] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:07:50] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 8.721 second response time [21:10:50] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:11:50] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 5.619 second response time [21:13:25] (03CR) 10Jforrester: Post-silver cleanups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420805 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [21:14:06] (03PS5) 10Krinkle: webperf: Collect Navigation Timing gaps [puppet] - 10https://gerrit.wikimedia.org/r/420831 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [21:14:14] I switched to a new ssh key in the gerrit web interface (https://gerrit.wikimedia.org/r/#/settings/ssh-keys), but when I try to git clone from gerrit.wikimedia.org, it's still trying to use my old keys for some reason. Do I need to do something else? [21:14:22] (03CR) 10Krinkle: [C: 031] "Should be safe to roll out anytime. Is conditional and back/future compat." [puppet] - 10https://gerrit.wikimedia.org/r/420831 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [21:14:51] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:59] kaldari: your local .ssh/config probably [21:16:04] 10Operations: Decommission unused host wmf3565 - https://phabricator.wikimedia.org/T190225#4066759 (10Cmjohnson) [21:16:18] 10Operations, 10ops-eqiad: Decommission unused host wmf3565 - https://phabricator.wikimedia.org/T190225#4066771 (10Cmjohnson) [21:16:50] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 373 bytes in 0.519 second response time [21:16:59] volans, anything wrong here: [21:17:02] https://www.irccloud.com/pastebin/XGG2F5RV/ [21:17:18] (03CR) 10RobH: [C: 031] Adding dns entries wdqs1006-8 [dns] - 10https://gerrit.wikimedia.org/r/417352 (https://phabricator.wikimedia.org/T188432) (owner: 10Cmjohnson) [21:17:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4066774 (10Cmjohnson) [21:17:23] 10Operations, 10ops-eqiad: Decommission unused host wmf3565 - https://phabricator.wikimedia.org/T190225#4066759 (10Cmjohnson) [21:17:43] (03CR) 10Cmjohnson: [C: 032] Adding dns entries wdqs1006-8 [dns] - 10https://gerrit.wikimedia.org/r/417352 (https://phabricator.wikimedia.org/T188432) (owner: 10Cmjohnson) [21:18:03] (03CR) 10Andrew Bogott: Post-silver cleanups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420805 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [21:18:25] oh prolly need to add !gerrit.wikimedia.org !git-ssh.wikimedia.org [21:18:33] kaldari: depends if your new key is ~/.ssh/id_rsa or not ;) also you might have a separate stanza for gerrit, as it should use the cloud key and not the prod one [21:19:21] thanks [21:20:51] (03PS1) 10Dzahn: site: add mw2259 thru mw2290 with role spare [puppet] - 10https://gerrit.wikimedia.org/r/420905 (https://phabricator.wikimedia.org/T188301) [21:21:21] volans: it's working now that I added !gerrit.wikimedia.org !git-ssh.wikimedia.org to the default *.wikimedia.org stanza. Thanks again! [21:21:43] yw [21:23:37] (03PS1) 10Volans: Puppetboard: add make, required for its deployment [puppet] - 10https://gerrit.wikimedia.org/r/420906 (https://phabricator.wikimedia.org/T184563) [21:25:42] (03CR) 10Volans: [C: 032] Puppetboard: add make, required for its deployment [puppet] - 10https://gerrit.wikimedia.org/r/420906 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [21:26:47] 10Operations, 10ops-eqiad, 10Packaging, 10hardware-requests: Decommission host copper.eqiad.wmnet - https://phabricator.wikimedia.org/T176957#4066807 (10Krinkle) [21:26:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: decom iodine - https://phabricator.wikimedia.org/T126483#4066808 (10Krinkle) [21:26:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4066806 (10Krinkle) [21:27:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: wipe/reclaim iodine - https://phabricator.wikimedia.org/T126483#2015562 (10Krinkle) [21:27:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#3976323 (10Krinkle) [21:27:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: wipe/reclaim iodine - https://phabricator.wikimedia.org/T126483#2015562 (10Krinkle) [21:29:03] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4066814 (10Papaul) [21:35:00] 10Operations, 10Cloud-Services, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921#4066835 (10Andrew) [21:41:36] (03PS1) 10Andrew Bogott: Horizon: remove singular horizon_host hiera setting and all uses [puppet] - 10https://gerrit.wikimedia.org/r/420908 (https://phabricator.wikimedia.org/T168470) [21:43:36] (03CR) 10Krinkle: [C: 031] Post-silver cleanups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420805 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [21:44:22] (03PS2) 10Dzahn: site: add mw2259-mw2290 with role spare, per row [puppet] - 10https://gerrit.wikimedia.org/r/420905 (https://phabricator.wikimedia.org/T188301) [21:46:44] (03CR) 10Andrew Bogott: "This needs a close reading in the puppet compiler to make sure it's not breaking firewall things." [puppet] - 10https://gerrit.wikimedia.org/r/420908 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [21:48:42] (03PS1) 10Imarlier: config: Enable testwiki NavTiming oversample for a bunch more countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420909 (https://phabricator.wikimedia.org/T190229) [21:49:42] (03PS3) 10Dzahn: site: add mw2259-mw2290 with role spare, per row [puppet] - 10https://gerrit.wikimedia.org/r/420905 (https://phabricator.wikimedia.org/T188301) [21:50:35] (03CR) 10Dzahn: [C: 032] site: add mw2259-mw2290 with role spare, per row [puppet] - 10https://gerrit.wikimedia.org/r/420905 (https://phabricator.wikimedia.org/T188301) (owner: 10Dzahn) [21:52:23] (03PS1) 10Imarlier: config: enable NavTiming oversample in a bunch of countries as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420910 (https://phabricator.wikimedia.org/T190229) [21:56:02] (03CR) 10Jforrester: Post-silver cleanups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420805 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [21:59:40] (03CR) 10Alex Monk: Post-silver cleanups (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420805 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [22:00:04] MaxSem: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Deploy GlobalPreferences . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180320T2200). [22:00:04] No GERRIT patches in the queue for this window AFAICS. [22:02:53] (03PS1) 10Dzahn: add deployment_server role to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/420912 (https://phabricator.wikimedia.org/T175288) [22:04:33] 10Operations, 10Traffic, 10netops: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4066932 (10ayounsi) [22:09:25] (03PS1) 10Dzahn: switch deployment server from tin to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/420914 (https://phabricator.wikimedia.org/T175288) [22:11:06] (03PS2) 10MaxSem: Deploy GlobalPreferences to test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420801 (https://phabricator.wikimedia.org/T184121) [22:11:11] (03CR) 10MaxSem: [C: 032] Deploy GlobalPreferences to test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420801 (https://phabricator.wikimedia.org/T184121) (owner: 10MaxSem) [22:12:33] (03Merged) 10jenkins-bot: Deploy GlobalPreferences to test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420801 (https://phabricator.wikimedia.org/T184121) (owner: 10MaxSem) [22:12:48] (03CR) 10jenkins-bot: Deploy GlobalPreferences to test wikis and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420801 (https://phabricator.wikimedia.org/T184121) (owner: 10MaxSem) [22:14:50] !log maxsem@tin Started scap: Test deployment of GlobalPreferences [22:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:50] (03PS1) 10Dzahn: decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) [22:16:36] (03CR) 10Krinkle: [C: 031] config: Enable testwiki NavTiming oversample for a bunch more countries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420909 (https://phabricator.wikimedia.org/T190229) (owner: 10Imarlier) [22:16:53] (03CR) 10Krinkle: [C: 031] config: enable NavTiming oversample in a bunch of countries as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420910 (https://phabricator.wikimedia.org/T190229) (owner: 10Imarlier) [22:17:44] 10Operations, 10ops-eqiad: install conf1004-6 ssd upgrades - https://phabricator.wikimedia.org/T190230#4066951 (10RobH) p:05Triage>03Normal [22:22:38] (03PS1) 10Dzahn: network::constants: add deploy1001 as deployment server [puppet] - 10https://gerrit.wikimedia.org/r/420919 (https://phabricator.wikimedia.org/T175288) [22:25:46] (03PS1) 10Dzahn: tcpircbot: add deploy1001 as allowed host [puppet] - 10https://gerrit.wikimedia.org/r/420920 (https://phabricator.wikimedia.org/T175288) [22:31:00] (03PS1) 10MaxSem: Revert "Deploy GlobalPreferences to test wikis and mw.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420921 [22:31:15] ^ that's for the future [22:34:40] (03PS22) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [22:35:40] (03PS9) 10Paladox: ircecho: Support auth over irc [puppet] - 10https://gerrit.wikimedia.org/r/405594 [22:37:26] (03CR) 10Volans: [C: 032] tcpircbot: add deploy1001 as allowed host [puppet] - 10https://gerrit.wikimedia.org/r/420920 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [22:40:00] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:40:25] what now... checking [22:41:19] !log volans@tin Started deploy [puppetboard/deploy@0975558]: Initial sync [22:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:26] !log volans@tin Finished deploy [puppetboard/deploy@0975558]: Initial sync (duration: 00m 07s) [22:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:54] (03PS1) 10Paladox: Revert "Gerrit: Fix its-phabricator soy templates" [puppet] - 10https://gerrit.wikimedia.org/r/420922 (https://phabricator.wikimedia.org/T188033) [22:43:58] (03Abandoned) 10Paladox: Revert "Gerrit: Fix its-phabricator soy templates" [puppet] - 10https://gerrit.wikimedia.org/r/420922 (https://phabricator.wikimedia.org/T188033) (owner: 10Paladox) [22:44:52] wondering could someone open up https://phabricator.wikimedia.org/T188033 as it is now resolved please? [22:45:00] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [22:45:23] paladox: opened. why [22:45:43] (03PS2) 10Ayounsi: Add mr1-eqsin <-> cr1-eqsin IPv6 [dns] - 10https://gerrit.wikimedia.org/r/416882 [22:45:45] mutante not sure, did you change the status of the task without realising? [22:46:10] paladox: no, you closed it along with your comment "this works now" [22:46:14] yep [22:46:37] ? [22:47:04] but you found out it's not actually resolved? [22:47:07] mutante ah woops, my message was wrong [22:47:12] i meant to unmark it as security [22:47:17] oooh [22:47:25] sorry i doin't know why i said open, that was my wrong. [22:47:30] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 51.31, 23.11, 15.42 [22:47:43] (03PS1) 10Ayounsi: Ping offload test, reserve IPs [dns] - 10https://gerrit.wikimedia.org/r/420923 (https://phabricator.wikimedia.org/T190090) [22:47:48] open to the public, heh [22:47:53] thanks :) [22:48:30] done [22:48:40] thanks :) [22:48:51] (03CR) 10Ayounsi: [C: 032] Add mr1-eqsin <-> cr1-eqsin IPv6 [dns] - 10https://gerrit.wikimedia.org/r/416882 (owner: 10Ayounsi) [22:49:31] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 24.36, 24.00, 16.82 [22:52:20] 10Operations, 10Traffic, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4067011 (10Krinkle) [22:53:16] 10Operations, 10Traffic, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4036653 (10Krinkle) (Added a Status column to separate network analysis [Y/N/?] from current status [blocked/planned/done].) [22:54:21] !log maxsem@tin Finished scap: Test deployment of GlobalPreferences (duration: 39m 31s) [22:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:09] (03CR) 10MaxSem: [C: 032] Revert "Deploy GlobalPreferences to test wikis and mw.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420921 (owner: 10MaxSem) [22:55:14] !log volans@tin Started deploy [puppetboard/deploy@0975558]: Initial sync [22:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:19] !log volans@tin Finished deploy [puppetboard/deploy@0975558]: Initial sync (duration: 00m 05s) [22:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:24] (03Merged) 10jenkins-bot: Revert "Deploy GlobalPreferences to test wikis and mw.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420921 (owner: 10MaxSem) [22:56:34] hi [22:58:32] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: Revert GlobalPreferences (duration: 01m 17s) [22:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:06] (03CR) 10jenkins-bot: Revert "Deploy GlobalPreferences to test wikis and mw.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420921 (owner: 10MaxSem) [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 8 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180320T2300). [23:00:05] Zoranzoki21: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:12] !log Cleaned centralauth.global_preferences after testing [23:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:17] * MaxSem is done [23:01:34] MaxSem: Can you deploy my change for swat? [23:02:23] (03CR) 10BBlack: [C: 031] Ping offload test, reserve IPs [dns] - 10https://gerrit.wikimedia.org/r/420923 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [23:02:26] sorry, I have a meeting now:( [23:02:37] (03PS1) 10BBlack: VCL: fixed 20m grace value regardless of health [puppet] - 10https://gerrit.wikimedia.org/r/420927 (https://phabricator.wikimedia.org/T189892) [23:02:39] (03PS1) 10BBlack: VCL: only keep objects worth keep-ing [puppet] - 10https://gerrit.wikimedia.org/r/420928 (https://phabricator.wikimedia.org/T189892) [23:02:40] PROBLEM - Check systemd state on labtestweb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:07:05] (03CR) 10Ayounsi: [C: 032] Ping offload test, reserve IPs [dns] - 10https://gerrit.wikimedia.org/r/420923 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [23:07:09] (03PS2) 10Ayounsi: Ping offload test, reserve IPs [dns] - 10https://gerrit.wikimedia.org/r/420923 (https://phabricator.wikimedia.org/T190090) [23:08:05] XioNoX: where are you merging now? [23:08:22] mutante: ? [23:09:20] mutante: today still puppetmaster1001, we'll notify in advance when that will change ;) [and this is a DNS patch] [23:09:48] oh, ok! so i can use it again? [23:10:21] sure [23:10:44] :) 'k [23:13:32] (03CR) 10Paladox: "I've currently have this cherry picked onto puppet-phabricator which is applied to phabricator-stretch4 (and puppet passes on both the jes" [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) (owner: 10Paladox) [23:14:36] What happening with SWAT? [23:14:57] Is all ok with things related to SWAT Deployment? [23:17:46] Zoranzoki21: sorry, some days there aren't enough people around to do a swat deployment. What's the patch you need deployed? [23:18:07] patch about throttle rule [23:18:12] jouncebot: now [23:18:12] For the next 0 hour(s) and 41 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180320T2300) [23:18:38] This: https://gerrit.wikimedia.org/r/420807 [23:19:00] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:19:20] RECOVERY - Check systemd state on puppetboard1001 is OK: OK - running: The system is fully operational [23:19:58] can swat be postponed while working out an issue with puppetdb? some ssh host keys are out of sync [23:20:26] A /16 is huuuge [23:20:37] herron: Yeah, nothing looks urgent [23:20:43] what he said ^ [23:20:44] What I should do now? [23:20:52] I will reschedule [23:20:59] ok thank you [23:21:08] mutante: do I have to specify something special in dhcp to have a "regular" ganeti vm? Like a option pxelinux.pathprefix etc ? [23:21:44] XioNoX: no, it's just like a physical machine for DHCP [23:21:55] XioNoX: https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM [23:22:06] partman recipe you probably want the line with "flat.cfg" [23:22:23] volans: but it says "Same as usual." and I don't know what is usual [23:22:23] Will problem with puppet be fixed until next swat in this time because I can reschedule for only evening swat? [23:22:29] like (all?most) other ganeti VMs [23:22:39] Zoranzoki21: I don't think that patch is needed [23:23:27] XioNoX: in netboot.cfg " Do however add virtual.cfg to the configuration for a specific VM." [23:23:38] XioNoX: look gerrit changes 419193, 419247, 419248 [23:23:40] Reedy: It is configuration change, and this have to be deployed in swat deployment.. I only asking now, will problems be fixed until next evening swat? [23:24:17] thanks [23:24:49] yw :) [23:27:43] (03PS1) 10Ayounsi: Add DHCP and partman for ping1001 [puppet] - 10https://gerrit.wikimedia.org/r/420933 (https://phabricator.wikimedia.org/T190090) [23:28:00] volans: https://gerrit.wikimedia.org/r/#/c/420933/ [23:29:05] XioNoX: here's more example, https://phabricator.wikimedia.org/T188163 [23:29:23] (03CR) 10Dzahn: [C: 031] Add DHCP and partman for ping1001 [puppet] - 10https://gerrit.wikimedia.org/r/420933 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [23:29:28] mutante: https://gerrit.wikimedia.org/r/#/c/420933/ :) [23:30:47] lgtm, will get you stretch [23:31:19] just, you're supposed to use the template for "hardware-requests" for VMs [23:33:26] https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?title=Site:%20(QUANTITY)%20VM%20%request%20for%20SERVICE%5bS%5d&projects=operations,vm-requests&description=Labs%20Project%20Tested%3A%20%3Cproject_name%3E%0ASite%2FLocation%3A%3CEQIAD%7CCODFW%3E%20%0ANumber%20of%20systems%3A%20%3C%23%20of%20VMs%3E%20%0AService%3A%20%3Cservice%20name%3E%0ANetworking%20Requirements%3A%20%3Cinternal% [23:33:32] 7Cexternal%20IP%3E%2C%20%3Cspecific%20networking%20access%20needed%3E%20%0AProcessor%20Requirements%3A%20%3CNumber%20of%20Virtual%20CPUS%3E%0AMemory%3A%20%0ADisks%3A%20%3CCapacity%20only%3E%0AOther%20Requirements%3A%20%0A [23:33:53] fail, this link is from https://wikitech.wikimedia.org/wiki/Operations_requests#Virtual_machine_requests_(Production) [23:34:17] ok, thanks [23:35:01] (03PS1) 10Odder: Update logo for Dutch Low Saxon Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420934 (https://phabricator.wikimedia.org/T190051) [23:35:05] I have to run, will merge when I'm back [23:35:08] last time i just took that template, pasted into https://phabricator.wikimedia.org/T189899 and then i self-created it [23:35:18] 'kk, cya [23:36:23] mornin' guys, is SWAT proceeding as normal? [23:36:35] No [23:37:31] //quit [23:37:33] :) [23:39:08] * odder will schedule a patch for a different window then [23:40:26] Probably a better idea :) [23:40:36] Ops are working on a few things [23:41:02] !log Mass no-op resizing of Whisper files on graphite2001 and graphite1001 for T179622 (webpagetest.* namespace) [23:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:08] T179622: Update our Graphite metrics for current retention rules - https://phabricator.wikimedia.org/T179622 [23:42:31] (03CR) 10Reedy: [C: 04-1] "CR-1 pending a reply to https://phabricator.wikimedia.org/T190206#4067118" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420807 (https://phabricator.wikimedia.org/T190206) (owner: 10Zoranzoki21) [23:42:34] (03PS1) 10RobH: add mmiller to admins module [puppet] - 10https://gerrit.wikimedia.org/r/420935 (https://phabricator.wikimedia.org/T190096) [23:43:07] (03CR) 10RobH: [C: 032] add mmiller to admins module [puppet] - 10https://gerrit.wikimedia.org/r/420935 (https://phabricator.wikimedia.org/T190096) (owner: 10RobH) [23:50:31] PROBLEM - puppet last run on analytics1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:58:49] (03PS1) 10BryanDavis: wiki replicas: Fix stale logging_whitelist reference [puppet] - 10https://gerrit.wikimedia.org/r/420937 [23:59:32] (03CR) 10BryanDavis: "$ sudo maintain-views --table global_preferences --database centralauth --d" [puppet] - 10https://gerrit.wikimedia.org/r/420937 (owner: 10BryanDavis)