[00:03:58] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) is CRITICAL: Test Retrieve all events for Jan 15 returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 re [00:03:59] ed body (AttributeError: NoneType object has no attribute get) [00:03:59] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get) [00:05:09] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [00:05:09] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [00:13:33] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@c81dd9e]: Redeploy Updater for removal of props channel [00:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:18] PROBLEM - High lag on wdqs1003 is CRITICAL: 6191 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:22:23] (03CR) 10Smalyshev: [C: 031] wdqs: auto restart wdqs-updater on config changes [puppet] - 10https://gerrit.wikimedia.org/r/467420 (owner: 10Gehel) [00:23:54] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@c81dd9e]: Redeploy Updater for removal of props channel (duration: 10m 21s) [00:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:33] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) @Andrew thanks, I'll start testing on it tomorrow. [00:44:49] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: [00:44:49] ktionary definitions for cat returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{dom [00:44:49] -html/{title}{/revision}{/tid} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{dom [00:45:58] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [00:59:38] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) is CRITICAL: Test Retrieve all events for Jan 15 returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 re [00:59:38] ed body (AttributeError: NoneType object has no attribute get) [01:00:48] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [01:18:09] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1147 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:12:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:23:19] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:37:54] 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10Krinkle) [03:05:59] (03CR) 10jenkins-bot: Enable reading from new backend of change_tag in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467315 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [03:12:49] (03PS1) 10KartikMistry: hfst: New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/467556 (https://phabricator.wikimedia.org/T206439) [03:13:07] (03CR) 10jerkins-bot: [V: 04-1] hfst: New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/467556 (https://phabricator.wikimedia.org/T206439) (owner: 10KartikMistry) [03:16:08] (03PS2) 10KartikMistry: hfst: New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/467556 (https://phabricator.wikimedia.org/T206439) [03:24:27] (03PS1) 10Milimetric: Remove jobs for decomissioned dashboards [puppet] - 10https://gerrit.wikimedia.org/r/467558 (https://phabricator.wikimedia.org/T199340) [03:25:04] (03CR) 10Milimetric: [C: 031] "this can be merged anytime" [puppet] - 10https://gerrit.wikimedia.org/r/467558 (https://phabricator.wikimedia.org/T199340) (owner: 10Milimetric) [03:30:18] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 791.26 seconds [03:32:55] (03CR) 10jerkins-bot: [V: 04-1] hfst: New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/467556 (https://phabricator.wikimedia.org/T206439) (owner: 10KartikMistry) [03:51:40] (03PS3) 10KartikMistry: hfst: New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/467556 (https://phabricator.wikimedia.org/T206439) [03:52:08] (03PS8) 10Herron: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) [03:52:52] (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [04:02:30] (03PS9) 10Herron: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) [04:02:34] (03CR) 10Herron: smarthost: create mail smarthost role/profile (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [04:03:15] (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [04:12:48] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 289.75 seconds [05:07:18] (03PS1) 10Marostegui: db-eqiad.php: db1092 remove BBU comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467568 [05:08:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: db1092 remove BBU comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467568 (owner: 10Marostegui) [05:09:35] (03Merged) 10jenkins-bot: db-eqiad.php: db1092 remove BBU comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467568 (owner: 10Marostegui) [05:10:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1092 BBU comments after BBU replacement (duration: 00m 52s) [05:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:48] (03CR) 10jenkins-bot: db-eqiad.php: db1092 remove BBU comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467568 (owner: 10Marostegui) [05:22:06] (03PS1) 10Giuseppe Lavagetto: mediawiki: install and start php-fpm on the mwdebug* servers [puppet] - 10https://gerrit.wikimedia.org/r/467570 (https://phabricator.wikimedia.org/T201140) [05:22:08] (03PS1) 10Giuseppe Lavagetto: mediawiki: convert to http::site [puppet] - 10https://gerrit.wikimedia.org/r/467571 [05:22:10] (03PS1) 10Giuseppe Lavagetto: mediawiki: convert more apache defines to httpd [puppet] - 10https://gerrit.wikimedia.org/r/467572 [05:27:16] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) @Andrew I logged in there and I see this set of disks: ``` Filesyste... [05:40:39] (03Abandoned) 10KartikMistry: apertium-apy: Set locale to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/464482 (owner: 10KartikMistry) [05:48:26] (03CR) 10Elukey: [C: 032] Remove jobs for decomissioned dashboards [puppet] - 10https://gerrit.wikimedia.org/r/467558 (https://phabricator.wikimedia.org/T199340) (owner: 10Milimetric) [05:49:43] (03PS1) 10KartikMistry: cg3: New upstream release [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/467573 (https://phabricator.wikimedia.org/T206439) [05:52:29] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [05:52:58] uh? [05:54:48] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [05:55:01] marostegui: db1061.eqiad.wmnet. ? [05:55:26] Expectation (masterConns <= 0) by MediaWiki::restInPeace not met (actual: 1): [05:55:26] yeah, I am checking that [05:55:29] [connect to 10.64.32.227 (frwiki)] [05:55:31] it is s6 master [05:56:12] I cannot see anything wrong with it so far [06:05:03] !log stopping db1092 and db1087 in sync T206743 [06:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:06] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [06:15:49] (03CR) 10Elukey: [C: 032] "Removed manually the crons from stat1006 since they were still there after the puppet run.." [puppet] - 10https://gerrit.wikimedia.org/r/467558 (https://phabricator.wikimedia.org/T199340) (owner: 10Milimetric) [06:19:08] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [06:23:38] <_joe_> we're having 200 fatals per minute [06:24:04] eh [06:24:13] <_joe_> InvalidArgumentException from line 2139 of /srv/mediawiki/php-1.32.0-wmf.24/includes/libs/rdbms/database/Database.php: Wikimedia\Rdbms\Database::makeList: empty input for field ct_tag_id [06:24:25] <_joe_> uhm [06:24:28] Amir1: ^ [06:24:57] hewiki is s7, and I believe he enabled that on s7 [06:25:28] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/467315/ [06:25:35] 10Operations, 10Parsoid, 10Datacenter-Switchover-2018: Parsoid no longer active-active - https://phabricator.wikimedia.org/T207091 (10akosiaris) Yes this indeed has happened and it's true for all services. We meant to return to the normal state on Monday (yesterday) but we didn't for unrelated to this reason... [06:26:09] so....revert? [06:26:16] legoktm: I would say so [06:27:17] (03CR) 10Alexandros Kosiaris: "Both will be needed, since, in puppet, we are fully overriding the systemd unit file provided by the package" [puppet] - 10https://gerrit.wikimedia.org/r/464482 (owner: 10KartikMistry) [06:27:21] (03Restored) 10Alexandros Kosiaris: apertium-apy: Set locale to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/464482 (owner: 10KartikMistry) [06:27:44] (03CR) 10Alexandros Kosiaris: [C: 031] "Do we need to upgrade before merging this?" [puppet] - 10https://gerrit.wikimedia.org/r/464482 (owner: 10KartikMistry) [06:28:00] (03PS1) 10Legoktm: Revert "Enable reading from new backend of change_tag in s7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467575 [06:28:35] (03CR) 10Legoktm: [C: 032] Revert "Enable reading from new backend of change_tag in s7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467575 (owner: 10Legoktm) [06:29:39] (03Merged) 10jenkins-bot: Revert "Enable reading from new backend of change_tag in s7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467575 (owner: 10Legoktm) [06:29:39] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:30:30] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:30:31] tested on mwdebug1002, fixes the issue [06:32:00] (03CR) 10Alexandros Kosiaris: [C: 031] Enable base::service_auto_restart for squid/url downloaders [puppet] - 10https://gerrit.wikimedia.org/r/466880 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:32:09] !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert "Enable reading from new backend of change_tag in s7" (T194164) (duration: 00m 50s) [06:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:13] T194164: Start reading from change_tag_def in production - https://phabricator.wikimedia.org/T194164 [06:32:37] marostegui, _joe_: exceptions have stopped [06:33:47] legoktm: thanks a lot for the revert [06:34:04] (03PS2) 10Alexandros Kosiaris: uwsgi: Remove the uwsgi-dbg package [puppet] - 10https://gerrit.wikimedia.org/r/466723 [06:34:16] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] uwsgi: Remove the uwsgi-dbg package [puppet] - 10https://gerrit.wikimedia.org/r/466723 (owner: 10Alexandros Kosiaris) [06:34:31] np [06:35:41] I am going to comment on the ticket [06:35:53] Ah, you already did <3 [06:38:46] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Marostegui) Incident report (please feel free to add or modify whatever you feel it needs some changes!): https://wikitech.w... [06:38:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [06:39:08] (03CR) 10Alexandros Kosiaris: [C: 031] monitoring::service: fix $cluster FIXME [puppet] - 10https://gerrit.wikimedia.org/r/459660 (owner: 10Dzahn) [06:39:18] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20181016-eqiad_parsercache_empty_post-switchover [06:40:03] (03PS2) 10Mathew.onipe: scap::target: added additional_services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) [06:44:32] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10elukey) p:05Triage>03High [06:45:24] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10elukey) @Cmjohnson this server is OOW but the replacement will take time to arrive (still in procurement..) and this host is really important for the research users. Do we have a spare disk tha... [06:45:52] (03CR) 10jenkins-bot: Revert "Enable reading from new backend of change_tag in s7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467575 (owner: 10Legoktm) [06:47:43] (03CR) 10Alexandros Kosiaris: [C: 031] base/icinga: use monitoring_hosts constant as NRPE allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [06:48:25] (03CR) 10KartikMistry: "> Do we need to upgrade before merging this?" [puppet] - 10https://gerrit.wikimedia.org/r/464482 (owner: 10KartikMistry) [06:49:42] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-apy: Set locale to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/464482 (owner: 10KartikMistry) [06:49:49] (03PS3) 10Alexandros Kosiaris: apertium-apy: Set locale to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/464482 (owner: 10KartikMistry) [06:49:52] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] apertium-apy: Set locale to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/464482 (owner: 10KartikMistry) [06:52:48] marostegui: that's ms, make a ticket and I fix it [06:53:02] Amir1: lego commented on the ticket already [06:53:03] (03CR) 10Muehlenhoff: mediawiki::web::prod_sites: convert wikidata.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462494 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:53:59] kart_: ok merged and deployed [06:55:10] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:55:31] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/463745 (https://phabricator.wikimedia.org/T199447) (owner: 10KartikMistry) [06:55:40] Thanks [06:56:00] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:59:39] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for squid/url downloaders [puppet] - 10https://gerrit.wikimedia.org/r/466880 (https://phabricator.wikimedia.org/T135991) [07:00:48] PROBLEM - Check size of conntrack table on cumin1001 is CRITICAL: Return code of 255 is out of bounds [07:00:59] PROBLEM - configured eth on cumin1001 is CRITICAL: Return code of 255 is out of bounds [07:01:08] PROBLEM - Check whether ferm is active by checking the default input chain on cumin1001 is CRITICAL: Return code of 255 is out of bounds [07:01:28] PROBLEM - Keyholder SSH agent on cumin1001 is CRITICAL: Return code of 255 is out of bounds [07:01:38] PROBLEM - MD RAID on cumin1001 is CRITICAL: Return code of 255 is out of bounds [07:01:39] PROBLEM - Disk space on cumin1001 is CRITICAL: Return code of 255 is out of bounds [07:01:48] PROBLEM - Check systemd state on cumin1001 is CRITICAL: Return code of 255 is out of bounds [07:01:48] PROBLEM - dhclient process on cumin1001 is CRITICAL: Return code of 255 is out of bounds [07:01:58] PROBLEM - DPKG on cumin1001 is CRITICAL: Return code of 255 is out of bounds [07:02:30] marostegui: cumin cumin? --^ [07:03:33] [Tue Oct 16 05:50:51 2018] Out of memory: Kill process 54397 (mysql) score 469 or sacrifice child [07:04:38] PROBLEM - puppet last run on cumin1001 is CRITICAL: Return code of 255 is out of bounds [07:04:49] RECOVERY - Keyholder SSH agent on cumin1001 is OK: OK: Keyholder is armed with all configured keys. [07:04:59] RECOVERY - MD RAID on cumin1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [07:04:59] RECOVERY - Disk space on cumin1001 is OK: DISK OK [07:05:08] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational [07:05:09] RECOVERY - dhclient process on cumin1001 is OK: PROCS OK: 0 processes with command name dhclient [07:05:18] RECOVERY - DPKG on cumin1001 is OK: All packages OK [07:05:19] RECOVERY - Check size of conntrack table on cumin1001 is OK: OK: nf_conntrack is 0 % full [07:05:29] RECOVERY - configured eth on cumin1001 is OK: OK - interfaces up [07:05:39] RECOVERY - Check whether ferm is active by checking the default input chain on cumin1001 is OK: OK ferm input default policy is set [07:06:22] jynus: were you using mysql (client) on cumin1001 by any chance? --^ [07:06:28] (trying to understand what happened) [07:06:48] (03CR) 10Mathew.onipe: scap::target: added additional_services_names param (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [07:09:39] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:09:59] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for squid/url downloaders [puppet] - 10https://gerrit.wikimedia.org/r/466880 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:12:23] akosiaris: thanks. Feel free to upgrade -apy package next. [07:12:41] oh, that too :) [07:12:51] elukey: I wasn't [07:13:25] marostegui: no no for you it was only the usual "cumin cumin", I saw Jaime in 'last' and pinged him :) [07:13:40] elukey: cumin days [07:13:48] precisely [07:14:10] * elukey sends wikilove to volans [07:17:07] !log installing net-snmp security updates [07:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:32] !log akosiaris@deploy1001 scap-helm mathoid upgrade [namespace: mathoid, clusters: codfw] [07:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:49] !log akosiaris@deploy1001 scap-helm mathoid upgrade production stable/mathoid [namespace: mathoid, clusters: codfw] [07:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:54] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [07:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:01] !log akosiaris@deploy1001 scap-helm mathoid finished [07:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:45] (03CR) 10Marostegui: "I made a few comments, the debian changelog, probably moritzm can comment further." (032 comments) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/466886 (https://phabricator.wikimedia.org/T206521) (owner: 10Banyek) [07:21:31] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10ayounsi) What should be the runbook/actions when this alert goes off? [07:22:23] !log upload apertium-apy_0.11.4-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main [07:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:43] !log upload apertium-apy_0.11.4-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main T199447 [07:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:46] T199447: apertium-apy doesn't honor -P / --log-path - https://phabricator.wikimedia.org/T199447 [07:24:01] (03PS1) 10Muehlenhoff: Add library hints for net-snmp [puppet] - 10https://gerrit.wikimedia.org/r/467631 [07:25:39] (03CR) 10Muehlenhoff: [C: 032] Add library hints for net-snmp [puppet] - 10https://gerrit.wikimedia.org/r/467631 (owner: 10Muehlenhoff) [07:46:43] !log upgrade apertium-apy throught the fleet [07:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:57] !log upgrade apertium-apy throught the fleet T199447 [07:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:01] T199447: apertium-apy doesn't honor -P / --log-path - https://phabricator.wikimedia.org/T199447 [07:51:10] (03PS2) 10Gehel: wdqs: auto restart wdqs-updater on config changes [puppet] - 10https://gerrit.wikimedia.org/r/467420 [07:53:56] (03CR) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [07:54:00] (03CR) 10Gehel: [C: 032] wdqs: auto restart wdqs-updater on config changes [puppet] - 10https://gerrit.wikimedia.org/r/467420 (owner: 10Gehel) [07:54:25] kart_: upgrade complete [07:54:57] (03PS4) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) [07:55:43] (03CR) 10jerkins-bot: [V: 04-1] profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [07:57:44] akosiaris: cool. Thanks a lot! [07:57:46] (03PS1) 10KartikMistry: hfst-ospell: New upstream release [debs/contenttranslation/hfst-ospell] - 10https://gerrit.wikimedia.org/r/467637 (https://phabricator.wikimedia.org/T206439) [07:58:53] (03PS5) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) [07:59:28] (03CR) 10jerkins-bot: [V: 04-1] hfst-ospell: New upstream release [debs/contenttranslation/hfst-ospell] - 10https://gerrit.wikimedia.org/r/467637 (https://phabricator.wikimedia.org/T206439) (owner: 10KartikMistry) [08:00:47] (03CR) 10Alexandros Kosiaris: [C: 032] mathoid: Add nominal resource requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/464483 (owner: 10Alexandros Kosiaris) [08:00:49] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] mathoid: Add nominal resource requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/464483 (owner: 10Alexandros Kosiaris) [08:01:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] mathoid: Switch liveness probe into tcpSocket [deployment-charts] - 10https://gerrit.wikimedia.org/r/464504 (owner: 10Alexandros Kosiaris) [08:01:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Set the scaffolding's livenessProbe to tcpSocket [deployment-charts] - 10https://gerrit.wikimedia.org/r/464505 (owner: 10Alexandros Kosiaris) [08:01:24] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] scaffold: Add some sample requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/464578 (owner: 10Alexandros Kosiaris) [08:01:43] (03PS2) 10KartikMistry: hfst-ospell: New upstream release [debs/contenttranslation/hfst-ospell] - 10https://gerrit.wikimedia.org/r/467637 (https://phabricator.wikimedia.org/T206439) [08:01:43] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] mathoid: Bump num_workers to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/464579 (owner: 10Alexandros Kosiaris) [08:01:45] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] mathoid: Bump chart version to 0.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/464580 (owner: 10Alexandros Kosiaris) [08:02:27] 10Operations, 10Discovery-Search (Current work), 10Epic, 10Patch-For-Review: Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10Mathew.onipe) [08:03:01] 10Operations, 10netops: cr2-eqdfw (MX204) vhclient log noise - https://phabricator.wikimedia.org/T203261 (10ayounsi) Steps for the upgrade: [] Verify image checksum and validate `request system software validate /var/tmp/junos-vmhost-install-mx-x86-64-17.4R2.4.tgz` [] Start upgrade process `request system soft... [08:04:12] (03CR) 10Filippo Giunchedi: [C: 031] "This is great, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/467337 (owner: 10Gehel) [08:05:32] (03CR) 10Gehel: [C: 04-1] "There are a few problems, see comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [08:05:47] (03PS2) 10Gehel: monitoring::check_prometheus: fix compiler warnings, adding type constraints [puppet] - 10https://gerrit.wikimedia.org/r/467337 [08:07:24] (03PS1) 10Elukey: profile::memcached::instance: set the -R arg via hiera [puppet] - 10https://gerrit.wikimedia.org/r/467639 (https://phabricator.wikimedia.org/T203786) [08:07:26] (03CR) 10Gehel: [C: 032] monitoring::check_prometheus: fix compiler warnings, adding type constraints [puppet] - 10https://gerrit.wikimedia.org/r/467337 (owner: 10Gehel) [08:07:28] (03PS1) 10Elukey: Apply -R 200 to mc1035's memcached instance as perf test [puppet] - 10https://gerrit.wikimedia.org/r/467640 (https://phabricator.wikimedia.org/T203786) [08:13:32] (03PS2) 10Elukey: profile::memcached::instance: set the -R arg via hiera [puppet] - 10https://gerrit.wikimedia.org/r/467639 (https://phabricator.wikimedia.org/T203786) [08:13:34] (03PS2) 10Elukey: Apply -R 200 to mc1035's memcached instance as perf test [puppet] - 10https://gerrit.wikimedia.org/r/467640 (https://phabricator.wikimedia.org/T203786) [08:14:58] (03PS2) 10Gehel: tilerator: removed rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/467319 (https://phabricator.wikimedia.org/T204240) (owner: 10Mathew.onipe) [08:15:52] (03CR) 10Gehel: [C: 032] tilerator: removed rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/467319 (https://phabricator.wikimedia.org/T204240) (owner: 10Mathew.onipe) [08:17:33] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12935/" [puppet] - 10https://gerrit.wikimedia.org/r/467640 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [08:20:20] (03CR) 10Faidon Liambotis: [C: 04-1] "Left a few more comments, sorry doing this in multiple passes!" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [08:22:59] PROBLEM - Keyholder SSH agent on cumin1001 is CRITICAL: Return code of 255 is out of bounds [08:23:18] PROBLEM - Disk space on cumin1001 is CRITICAL: Return code of 255 is out of bounds [08:23:18] PROBLEM - MD RAID on cumin1001 is CRITICAL: Return code of 255 is out of bounds [08:23:19] PROBLEM - puppet last run on cumin1001 is CRITICAL: Return code of 255 is out of bounds [08:23:19] PROBLEM - Check systemd state on cumin1001 is CRITICAL: Return code of 255 is out of bounds [08:23:19] PROBLEM - dhclient process on cumin1001 is CRITICAL: Return code of 255 is out of bounds [08:23:29] PROBLEM - DPKG on cumin1001 is CRITICAL: Return code of 255 is out of bounds [08:23:29] PROBLEM - Check size of conntrack table on cumin1001 is CRITICAL: Return code of 255 is out of bounds [08:23:38] ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 1.2e+05 ge 3600 Gehel testing in progress https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:23:48] PROBLEM - configured eth on cumin1001 is CRITICAL: Return code of 255 is out of bounds [08:23:49] PROBLEM - Check whether ferm is active by checking the default input chain on cumin1001 is CRITICAL: Return code of 255 is out of bounds [08:24:17] [Tue Oct 16 07:11:15 2018] Out of memory: Kill process 213890 (mysql) score 359 or sacrifice child [08:25:55] (03PS2) 10Giuseppe Lavagetto: mediawiki: convert to http::site [puppet] - 10https://gerrit.wikimedia.org/r/467571 [08:25:57] (03PS1) 10Giuseppe Lavagetto: mediawiki::syslog: stop looking variables up the scope [puppet] - 10https://gerrit.wikimedia.org/r/467641 [08:25:59] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::common: stop using class 'mediawiki' [puppet] - 10https://gerrit.wikimedia.org/r/467642 [08:26:01] (03PS1) 10Giuseppe Lavagetto: mediawiki: add httpd class, alternative to mediawiki::web [puppet] - 10https://gerrit.wikimedia.org/r/467643 [08:26:03] (03PS1) 10Giuseppe Lavagetto: mediawiki::webserver: introduce profile, use it on mwdebug* [puppet] - 10https://gerrit.wikimedia.org/r/467644 [08:26:48] PROBLEM - Check the NTP synchronisation status of timesyncd on cumin1001 is CRITICAL: Return code of 255 is out of bounds [08:26:51] (03PS3) 10Mathew.onipe: scap::target: added additional_services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) [08:27:34] elukey: was your comment about cumin1001? [08:27:36] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add httpd class, alternative to mediawiki::web [puppet] - 10https://gerrit.wikimedia.org/r/467643 (owner: 10Giuseppe Lavagetto) [08:27:52] (03CR) 10Mathew.onipe: scap::target: added additional_services_names param (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [08:28:14] (03PS5) 10Filippo Giunchedi: New class: prometheus::statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870) [08:28:16] (03PS4) 10Filippo Giunchedi: thumbor: add prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870) [08:28:30] (03CR) 10Filippo Giunchedi: New class: prometheus::statsd_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [08:28:51] gehel: yeah, I think that Jaime has some background tasks ongoing [08:30:07] (03CR) 10Gehel: [C: 04-1] scap::target: added additional_services_names param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [08:30:55] (03PS4) 10Mathew.onipe: scap::target: added additional_services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) [08:32:05] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/12937/ seems like a noop." [puppet] - 10https://gerrit.wikimedia.org/r/467641 (owner: 10Giuseppe Lavagetto) [08:32:35] (03CR) 10Mathew.onipe: "> Patch Set 2: Code-Review-1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [08:34:22] (03PS1) 10Elukey: profile::analytics::refinery::job::camus: use camus to backup el-client-side [puppet] - 10https://gerrit.wikimedia.org/r/467646 (https://phabricator.wikimedia.org/T206542) [08:35:09] RECOVERY - Keyholder SSH agent on cumin1001 is OK: OK: Keyholder is armed with all configured keys. [08:35:28] RECOVERY - Disk space on cumin1001 is OK: DISK OK [08:35:28] RECOVERY - MD RAID on cumin1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [08:35:29] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational [08:35:29] RECOVERY - dhclient process on cumin1001 is OK: PROCS OK: 0 processes with command name dhclient [08:35:39] RECOVERY - DPKG on cumin1001 is OK: All packages OK [08:35:39] RECOVERY - Check size of conntrack table on cumin1001 is OK: OK: nf_conntrack is 0 % full [08:35:58] RECOVERY - configured eth on cumin1001 is OK: OK - interfaces up [08:35:59] RECOVERY - Check whether ferm is active by checking the default input chain on cumin1001 is OK: OK ferm input default policy is set [08:38:35] (03CR) 10Muehlenhoff: [C: 031] "One nit, looks good to me!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [08:38:38] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:38:59] (03PS2) 10Giuseppe Lavagetto: mediawiki::syslog: stop looking variables up the scope [puppet] - 10https://gerrit.wikimedia.org/r/467641 [08:39:01] (03PS3) 10Giuseppe Lavagetto: mediawiki: convert to http::site [puppet] - 10https://gerrit.wikimedia.org/r/467571 [08:39:03] (03PS2) 10Giuseppe Lavagetto: mediawiki: add httpd class, alternative to mediawiki::web [puppet] - 10https://gerrit.wikimedia.org/r/467643 [08:39:05] (03PS2) 10Giuseppe Lavagetto: mediawiki::webserver: introduce profile, use it on mwdebug* [puppet] - 10https://gerrit.wikimedia.org/r/467644 [08:39:07] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::common: stop using class 'mediawiki' [puppet] - 10https://gerrit.wikimedia.org/r/467642 [08:40:17] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add httpd class, alternative to mediawiki::web [puppet] - 10https://gerrit.wikimedia.org/r/467643 (owner: 10Giuseppe Lavagetto) [08:40:23] (03PS2) 10Elukey: profile::analytics::refinery::job::camus: use camus to backup el-client-side [puppet] - 10https://gerrit.wikimedia.org/r/467646 (https://phabricator.wikimedia.org/T206542) [08:40:25] (03PS1) 10Elukey: role::eventlogging::analytics::files: lower down retention [puppet] - 10https://gerrit.wikimedia.org/r/467648 (https://phabricator.wikimedia.org/T206542) [08:42:03] !log removed mwmaint1001 from debmonitor (T192457) [08:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:06] T192457: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 [08:48:42] (03PS3) 10Elukey: profile::analytics::refinery::job::camus: use camus to backup el-client-side [puppet] - 10https://gerrit.wikimedia.org/r/467646 (https://phabricator.wikimedia.org/T206542) [08:48:44] (03PS2) 10Elukey: role::eventlogging::analytics::files: lower down retention [puppet] - 10https://gerrit.wikimedia.org/r/467648 (https://phabricator.wikimedia.org/T206542) [08:48:58] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect, AS1299/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:50:28] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 50.93 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:52:18] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 152 probes of 343 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:52:18] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 12, down: 0, shutdown: 6 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:52:38] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 77.34 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:54:32] !stopping pc2004 -> pc1004 replication (T206740) [08:54:32] T206740: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 [08:54:55] banyek: missed the !log ;) [08:56:38] !log stopping pc2004 -> pc1004 replication (T206740) [08:56:40] tx [08:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:59] RECOVERY - Check the NTP synchronisation status of timesyncd on cumin1001 is OK: OK: synced at Tue 2018-10-16 08:56:51 UTC. [08:59:18] (03PS7) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [09:00:30] (03CR) 10jerkins-bot: [V: 04-1] relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [09:02:29] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 343 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [09:03:04] !log rolling reboot of thumbor in codfw for kernel security updates [09:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:04] (03CR) 10Giuseppe Lavagetto: [C: 031] profile::memcached::instance: set the -R arg via hiera [puppet] - 10https://gerrit.wikimedia.org/r/467639 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [09:06:48] (03CR) 10Giuseppe Lavagetto: [C: 031] Apply -R 200 to mc1035's memcached instance as perf test [puppet] - 10https://gerrit.wikimedia.org/r/467640 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [09:07:21] (03PS3) 10Elukey: profile::memcached::instance: set the -R arg via hiera [puppet] - 10https://gerrit.wikimedia.org/r/467639 (https://phabricator.wikimedia.org/T203786) [09:08:14] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1001/12941/ seems ok indeed." [puppet] - 10https://gerrit.wikimedia.org/r/467571 (owner: 10Giuseppe Lavagetto) [09:08:23] (03CR) 10Gilles: [C: 031] thumbor: add prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [09:11:55] 10Operations, 10Traffic: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10faidon) p:05Triage>03Normal [09:12:01] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12945/ - no op as expected but I'll merge with puppet disabled JUST IN CASE :)" [puppet] - 10https://gerrit.wikimedia.org/r/467639 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [09:15:07] 10Operations, 10Traffic, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10faidon) p:05Triage>03High [09:16:30] gehel: shall I merge yours too? [09:16:54] elukey: oops, yes, please! [09:17:00] super thanks [09:17:51] (03PS1) 10Gehel: prometheus: fix deprecation warnings in prometheus::node_puppet_agent [puppet] - 10https://gerrit.wikimedia.org/r/467654 [09:18:36] (03PS6) 10Filippo Giunchedi: New class: prometheus::statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870) [09:18:38] (03PS5) 10Filippo Giunchedi: thumbor: add prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870) [09:19:29] 10Operations, 10Traffic: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10faidon) This refers to power connections specifically as it's a subtask of the power incident, but that spreadsheet covers patches as well, and we should probably document these as well. Also, I'... [09:20:54] (03CR) 10Gehel: [C: 04-1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [09:23:42] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) On pc1004 the slave status before 'reset slave all': ``` Slave_IO_State: Master_Host: pc200... [09:32:43] I'm testing some stuff in mediawiki.org you might see some fatals for that [09:34:39] (03CR) 10Mobrovac: [C: 04-1] scap::target: added additional_services_names param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [09:35:02] _joe_: https://phabricator.wikimedia.org/T194164#4669720 [09:35:02] hey [09:38:07] <_joe_> Amir1: oh nice. I agree with you about the fatals :) [09:39:29] (03PS2) 10Elukey: role::prometheus::ops: collect memcached stats from thumbor/swift [puppet] - 10https://gerrit.wikimedia.org/r/466828 [09:39:48] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) Regarding the information_schema there will be not too much data to reclaim: ``` MariaDB [(none)]> select SUM(DATA_LENGTH)/1024/... [09:41:28] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) Pick some of the biggest tables and try to alter them to see how much you get and then we can probably extrapolate. If it i... [09:42:55] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) for luck all the tables around 6,5 Gb there, so yeah, I guess picking one is good enough. I do it now, then provide the data here [09:43:28] (03CR) 10Filippo Giunchedi: [C: 031] role::prometheus::ops: collect memcached stats from thumbor/swift [puppet] - 10https://gerrit.wikimedia.org/r/466828 (owner: 10Elukey) [09:43:51] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) >>! In T206740#4669725, @Banyek wrote: > for luck all the tables around 6,5 Gb there, so yeah, I guess picking one is good e... [09:44:20] (03CR) 10Elukey: [C: 032] role::prometheus::ops: collect memcached stats from thumbor/swift [puppet] - 10https://gerrit.wikimedia.org/r/466828 (owner: 10Elukey) [09:46:51] (03PS3) 10Giuseppe Lavagetto: mediawiki: add httpd class, alternative to mediawiki::web [puppet] - 10https://gerrit.wikimedia.org/r/467643 [09:46:53] (03PS3) 10Giuseppe Lavagetto: mediawiki::webserver: introduce profile, use it on mwdebug* [puppet] - 10https://gerrit.wikimedia.org/r/467644 [09:46:55] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::common: stop using class 'mediawiki' [puppet] - 10https://gerrit.wikimedia.org/r/467642 [09:47:26] (03PS6) 10Filippo Giunchedi: thumbor: add prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870) [09:47:32] (03PS7) 10Filippo Giunchedi: New class: prometheus::statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870) [09:49:18] 10Operations, 10Cloud-Services, 10netops: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406 (10faidon) I think this is now done with Neutron, and while the old space remains for now, the migration is underway, so this task can be closed. @ayounsi, @aborrero, @ch... [09:50:49] (03CR) 10Filippo Giunchedi: [C: 032] New class: prometheus::statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [09:51:02] (03PS7) 10Filippo Giunchedi: thumbor: add prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870) [09:51:20] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] thumbor: add prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [09:53:29] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) the result is 5.9 Gb, which is more than expected, but still not too much - less than 10% - with the whole operation we can free... [09:53:49] !log upload blubber_0.6.0-1_amd64 to apt.wikimedia.org/jessie-wikimedia/main and apt.wikimedia.org/stretch-wikimedia/main T206766 [09:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:53] T206766: Update Debian package of Blubber (0.6.0-1) - https://phabricator.wikimedia.org/T206766 [09:54:06] 10Operations, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Watching / External): Update Debian package of Blubber (0.6.0-1) - https://phabricator.wikimedia.org/T206766 (10akosiaris) 05Open>03Resolved a:03akosiaris Package built and uploaded to both jessie-wikimedia and stretch-wikimedia [09:56:41] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/466886 (https://phabricator.wikimedia.org/T206521) (owner: 10Banyek) [09:57:21] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) Considering we are at around 70% at eqiad, and should remain like that, I don't think it is worth the hassle and the risk of... [09:59:17] (03PS1) 10Filippo Giunchedi: statsd_exporter: fix commandline flags [puppet] - 10https://gerrit.wikimedia.org/r/467659 (https://phabricator.wikimedia.org/T205870) [09:59:19] (03CR) 10Banyek: wmf-pt-kill: logrotate feature added (032 comments) [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/466886 (https://phabricator.wikimedia.org/T206521) (owner: 10Banyek) [09:59:19] PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:59:26] that's me ^ [09:59:42] (03CR) 10Filippo Giunchedi: [C: 032] statsd_exporter: fix commandline flags [puppet] - 10https://gerrit.wikimedia.org/r/467659 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [10:02:39] RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational [10:06:40] (03PS2) 10Banyek: wmf-pt-kill: logrotate feature added [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/466886 (https://phabricator.wikimedia.org/T206521) [10:07:39] PROBLEM - puppet last run on cumin1001 is CRITICAL: Return code of 255 is out of bounds [10:07:39] PROBLEM - Keyholder SSH agent on cumin1001 is CRITICAL: Return code of 255 is out of bounds [10:07:49] PROBLEM - Disk space on cumin1001 is CRITICAL: Return code of 255 is out of bounds [10:07:59] PROBLEM - MD RAID on cumin1001 is CRITICAL: Return code of 255 is out of bounds [10:07:59] PROBLEM - Check systemd state on cumin1001 is CRITICAL: Return code of 255 is out of bounds [10:07:59] PROBLEM - dhclient process on cumin1001 is CRITICAL: Return code of 255 is out of bounds [10:08:01] wut? moritzm ^^^ are you doing anything? [10:08:09] PROBLEM - Check size of conntrack table on cumin1001 is CRITICAL: Return code of 255 is out of bounds [10:08:09] PROBLEM - DPKG on cumin1001 is CRITICAL: Return code of 255 is out of bounds [10:08:15] volans: it has been happening all morning, it is mysql [10:08:23] (client) [10:08:27] diamon and nrpe down [10:08:28] PROBLEM - configured eth on cumin1001 is CRITICAL: Return code of 255 is out of bounds [10:08:29] PROBLEM - Check whether ferm is active by checking the default input chain on cumin1001 is CRITICAL: Return code of 255 is out of bounds [10:08:38] elukey: what do you mean mysql client? [10:08:48] (03PS2) 10Muehlenhoff: Switch yhsm_aead_sync to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/466867 [10:08:53] fork() failed with error 12, bailing out... [10:09:16] volans: if you check dmesg the oom killer should have done its work, the past two times were (mysql) [10:09:17] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) ok. I'll truncate the tables then in pc2004 today, 2005 tomorrow and 2006 the day after tomorrow. I also stop the binlog purgers... [10:09:29] (I mean this morning, didn't check now) [10:09:44] sure, but why is going OOM? [10:09:45] (03CR) 10Muehlenhoff: [C: 032] Switch yhsm_aead_sync to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/466867 (owner: 10Muehlenhoff) [10:10:07] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) You can probably disconnect pc1005 and pc1006 today already, but keep executing the truncating without binlog, just in case. [10:10:07] /usr/local/sbin/mysql.py using 61GB of RAM [10:10:11] WUT>!>!>!>! [10:10:15] volans: I have no idea :D There seems to be some root processes doing stuff that I have no idea about [10:10:35] marostegui, banyek : do you know anything about it? ^^^ [10:10:35] I pinged Jaime because last showed him, and Manuel didn't start anything [10:10:50] "mysql.py --host=db1092.eqiad.wmnet wikidatawiki -e SELECT * FROM" [10:11:00] I am running those [10:11:05] I am fixing wikidatawiki [10:11:13] the OOM killer is killing it though [10:11:14] I need to select sometimes gb of data into memory [10:11:18] as it's using all the memory [10:11:35] (03PS1) 10Elukey: Revert "role::prometheus::ops: collect memcached stats from thumbor/swift" [puppet] - 10https://gerrit.wikimedia.org/r/467660 [10:11:37] wow [10:11:39] yes, it is what it takes to recover 2TB of memory with 64GB [10:11:53] (03PS2) 10Elukey: Revert "role::prometheus::ops: collect memcached stats from thumbor/swift" [puppet] - 10https://gerrit.wikimedia.org/r/467660 [10:12:40] 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Kanban (Doing), and 2 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10mobrovac) >>! In T205452#4615789, @jcrespo wrote: > For the firewall, I need to know the source (mysql client) ips. That's th... [10:12:41] can you do this in a way that doesn't trigger the OOM killer and kills all kinds of random daemons? [10:12:44] but I guess each time the OOM killer kills it you have to restart [10:12:46] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) [10:12:49] 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Kanban (Doing), and 2 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10mobrovac) a:03mobrovac [10:12:54] volans: no it is done manually [10:12:58] also, I see nothing in SAL [10:13:05] (03CR) 10Elukey: [C: 032] Revert "role::prometheus::ops: collect memcached stats from thumbor/swift" [puppet] - 10https://gerrit.wikimedia.org/r/467660 (owner: 10Elukey) [10:13:06] I started this morning [10:13:46] I can stop doing it [10:13:59] paravoid: we normally don't log actions like that (like comparing tables or running big queries) [10:14:40] well ok, but this sounds like it wasn't normal? [10:14:51] wasn't expected to be normal, I mean [10:14:54] I'm wondering if we could use any big fat host with 512GB of RAM to make it easier and quicker, seems a painful work to do at 64GB at a time [10:15:11] volans jynus maybe db1118? [10:15:16] this is causing issues on cumin1001 which is a shared resource across the team, so the team needs to be aware :) [10:15:21] those are the only hosts I hav eacces to all myswl servers [10:15:34] what if someone else wanted to run something else that also wanted to use 60GB of RAM? :) [10:15:41] if I use db1118 I it will delay me by 2 days [10:15:42] paravoid: Yeah, we are trying to fix wikidatawiki which unfortunately has some massive tables [10:15:48] and we are in a data loss situation [10:15:55] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) ok, good idea [10:16:22] basically, in an outage were wikidatawiki can stop working at any time due to replication breaking [10:16:23] (03PS1) 10Mobrovac: Recommendation API: Add MySQL connection config [puppet] - 10https://gerrit.wikimedia.org/r/467661 (https://phabricator.wikimedia.org/T205452) [10:16:44] jynus: do you want me to grant all s8 grants from db1118 maybe temporarily? [10:16:45] I understand, and I'm grateful you guys are working on it, but that's even more reason to overcommunicate [10:16:58] I logged it [10:17:22] I am not looking at the IRC becaue it requires my full concentration [10:18:51] db1118 is a test host that shouldn't contain the root password for the whole fleet [10:19:16] Not the whole fleet, just s8 or the hosts you need from s8, which I believe db1092 and db1087 for now? [10:19:21] FOr the comparison [10:19:52] I need to compare other hosts too [10:20:19] I don't think it's a problem per se to continue using cumin1001, just with a few adjustments maybe? [10:20:44] if possible to not trigger the OOM killer (so that random daemons and/or others' jobs aren't being killed off) [10:21:03] I cannog guarantee it, obviously I don't do it on purpose [10:21:19] (03PS1) 10Muehlenhoff: Fix ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/467662 [10:21:22] it is a completly manual process, row by row [10:21:43] and maybe !log something like "running database maintenance tasks on cumin1001, expect very high memory usage" or something like that [10:21:48] <_joe_backup_> I'll just use cumin2001 for today [10:22:01] !log running database maintenance tasks on cumin1001, expect very high memory usage [10:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:07] lol [10:22:13] what? [10:22:15] and we still have (not for long) sarin/neodymium too fwiw [10:22:23] but those have only 32G RAM [10:22:30] moritzm: for us [10:23:08] ah, ok. but we can simply use cumin2001 instead? [10:23:28] sure, because of the extra roundtrips, my queries will go 10-100x slower [10:23:52] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/compiler1002/12951/" [puppet] - 10https://gerrit.wikimedia.org/r/467661 (https://phabricator.wikimedia.org/T205452) (owner: 10Mobrovac) [10:23:54] I think moritzm etc. was asking if the rest of the team can use cumin2001 while you're using cumin1001 :) [10:23:59] yep [10:24:24] I will be using neodymium, so this host doesn't feel along, we have spent so many hours togethers that I feel bad for it [10:24:32] *alone [10:24:37] cumin2001 covers all we need, in fact I'm using it for everything all the time already [10:24:44] it's ok guys, whatever you need to deal with this [10:24:51] the rest of the team is here to help, not be in the way [10:24:59] I appreciate that [10:25:05] paravoid: <3 [10:25:09] and I need this is disrupting, but I really need to get this done [10:25:16] know [10:25:52] just to clarify, I was just trying to give you more memory, not to free cumin1001 [10:25:53] it's ok, my only ask is to just be verbose so that we find ways around stuff and don't get secondary effects and get deeper in this hole :) [10:26:52] !stopping pc2005 -> pc1005 replication (T206740) [10:26:52] T206740: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 [10:27:40] (03CR) 10Marostegui: [C: 031] wmf-pt-kill: logrotate feature added [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/466886 (https://phabricator.wikimedia.org/T206521) (owner: 10Banyek) [10:29:19] PROBLEM - Check the NTP synchronisation status of timesyncd on cumin1001 is CRITICAL: Return code of 255 is out of bounds [10:29:36] (03CR) 10Muehlenhoff: [C: 032] Fix ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/467662 (owner: 10Muehlenhoff) [10:31:46] (03CR) 10Vgutierrez: "pcc results for PS64 looks promising: https://puppet-compiler.wmflabs.org/compiler1002/12952/" [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [10:33:02] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) pc1005 replication before reset slave all: ``` Master_Host: pc2005.codfw.wmnet Master_Log_Fi... [10:33:59] PROBLEM - IPMI Sensor Status on cumin1001 is CRITICAL: Return code of 255 is out of bounds [10:34:05] 10Operations, 10SRE-Access-Requests: Requesting deployment access to servers for Performance Team task for perf-roots - https://phabricator.wikimedia.org/T207090 (10Vgutierrez) p:05Triage>03Normal [10:35:38] RECOVERY - Keyholder SSH agent on cumin1001 is OK: OK: Keyholder is armed with all configured keys. [10:35:48] RECOVERY - Disk space on cumin1001 is OK: DISK OK [10:35:49] RECOVERY - MD RAID on cumin1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [10:35:58] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational [10:35:58] RECOVERY - dhclient process on cumin1001 is OK: PROCS OK: 0 processes with command name dhclient [10:36:08] RECOVERY - Check size of conntrack table on cumin1001 is OK: OK: nf_conntrack is 0 % full [10:36:08] RECOVERY - DPKG on cumin1001 is OK: All packages OK [10:36:18] RECOVERY - configured eth on cumin1001 is OK: OK - interfaces up [10:36:19] RECOVERY - Check whether ferm is active by checking the default input chain on cumin1001 is OK: OK ferm input default policy is set [10:36:24] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/467661 (https://phabricator.wikimedia.org/T205452) (owner: 10Mobrovac) [10:36:54] !stopping pc2006 -> pc1006 replication (T206740) [10:36:54] T206740: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 [10:37:03] banyek: !log [10:37:11] !log stopping pc2006 -> pc1006 replication (T206740) [10:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:21] !log stopping pc2005 -> pc1005 replication (T206740) [10:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:58] _joe_backup_: [10:37:58] PHP 7.3 now supported in http://DEB.SURY.ORG PPA [10:38:05] says twitter [10:38:19] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:41:34] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) pc1006 replication before reset slave all: ``` Master_Host: pc2006.codfw.wmnet Master_Log_File: pc2006-bin.154452 Exec_Master_Lo... [10:41:45] <_joe_> paravoid: now we just have to wait for MediaWiki to support it :P [10:44:32] (03PS2) 10Filippo Giunchedi: Recommendation API: Add MySQL connection config [puppet] - 10https://gerrit.wikimedia.org/r/467661 (https://phabricator.wikimedia.org/T205452) (owner: 10Mobrovac) [10:45:51] jouncebot, next [10:45:51] In 0 hour(s) and 14 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181016T1100) [10:46:02] (03CR) 10Filippo Giunchedi: [C: 032] Recommendation API: Add MySQL connection config [puppet] - 10https://gerrit.wikimedia.org/r/467661 (https://phabricator.wikimedia.org/T205452) (owner: 10Mobrovac) [10:47:05] I gave cumin1001 12h of downtime and also set a temporary MOTD [10:48:51] <_joe_> marostegui: that should help [10:48:58] thanks marostegui! [10:50:26] !log run puppet on scb to deploy db configuration for recommendation-service [10:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:52] mobrovac: ^ [10:51:08] perfect, thnx godog [10:51:57] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) I'll start truncating the tables on pc2004 with: ```for TABLE in $(mysql --skip-ssl -BN -e "show tables" parsercache); do my... [10:52:02] !log rolling reboot of thumbor in eqiad for kernel security updates [10:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:33] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [10:54:55] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [10:54:56] (03CR) 10Zfilipin: "The patch has a rebase conflict." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467343 (https://phabricator.wikimedia.org/T207019) (owner: 10Addshore) [10:55:19] (03PS4) 10Zfilipin: Enable Translate on idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460602 (https://phabricator.wikimedia.org/T204292) (owner: 10Urbanecm) [10:57:23] (03PS4) 10Addshore: Enable WBQualityConstraintsSuggestionsBetaFeature on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467343 (https://phabricator.wikimedia.org/T207019) [10:58:49] (03CR) 10Zfilipin: [C: 031] Remove expired throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467316 (https://phabricator.wikimedia.org/T207015) (owner: 10Urbanecm) [10:59:24] (03CR) 10Zfilipin: [C: 031] Add throttle rule for "Night of the Digital Language" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467321 (https://phabricator.wikimedia.org/T206408) (owner: 10Urbanecm) [10:59:29] RECOVERY - Check the NTP synchronisation status of timesyncd on cumin1001 is OK: OK: synced at Tue 2018-10-16 10:59:27 UTC. [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181016T1100). [11:00:04] Jonas_WMDE and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:07] here [11:00:11] o/ [11:00:14] I can SWAT today [11:00:41] Jonas_WMDE: around for SWAT? [11:00:55] zeljkof: he is not [11:01:01] is it the patch I just rebased? [11:01:13] addshore: yes, do you want to deploy it? [11:01:18] if so, you can ignore that one for this swta sesison! [11:01:24] *swat session [11:01:26] ah, ok, ignoring then :D [11:01:49] Urbanecm: please stand by, I'll ping you when I deploy all throttle rules :D [11:02:11] !log truncating tables in parsecache@pc2004 (T206740) [11:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:14] T206740: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 [11:02:26] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467316 (https://phabricator.wikimedia.org/T207015) (owner: 10Urbanecm) [11:02:36] zeljkof, if you'll have time, I have bunch of non-throttle patches, but throttle patches are a little bit more urgent :D [11:02:39] PROBLEM - Check systemd state on db1107 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:03:09] Urbanecm: sure, I'll deploy all throttle ones, then we'll see how much can fit in [11:03:13] ok [11:03:18] PROBLEM - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:03:28] feel free to add a few more to the calendar [11:03:30] (03Merged) 10jenkins-bot: Remove expired throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467316 (https://phabricator.wikimedia.org/T207015) (owner: 10Urbanecm) [11:03:34] and I'll deploy as much as possible [11:03:43] Ok, will do. I didn't add it before, because the SWAT was full policy-side [11:03:49] jynus: marostegui: are you doing stuff on db1117 and db1118 ? [11:04:09] RECOVERY - IPMI Sensor Status on cumin1001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [11:04:14] Urbanecm: it's ok to add to much, but we can deploy 6 usually [11:04:18] banyek: nope, those are eventlogging, elukey ^ what is what has failed? [11:04:21] ACKNOWLEDGEMENT - Check systemd state on db1107 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Banyek ack [11:04:21] ACKNOWLEDGEMENT - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Banyek ack [11:04:35] banyek: what is what failed? [11:05:10] <_joe_> eventlogging_db_sanitization.service [11:05:26] elukey: ^ [11:05:36] zeljkof, ok, added two of them, I'd appreciate some CR+1es on the others and also we'll probably not have that amount of time :) [11:05:40] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:467316|Remove expired throttle rule (T207015)]] (duration: 00m 50s) [11:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:43] T207015: Remove expired throttle rule - https://phabricator.wikimedia.org/T207015 [11:05:58] Urbanecm: first one deployed [11:06:08] ack [11:06:17] (03PS3) 10Zfilipin: Add throttle rule for "Night of the Digital Language" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467321 (https://phabricator.wikimedia.org/T206408) (owner: 10Urbanecm) [11:06:35] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467321 (https://phabricator.wikimedia.org/T206408) (owner: 10Urbanecm) [11:07:46] (03Merged) 10jenkins-bot: Add throttle rule for "Night of the Digital Language" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467321 (https://phabricator.wikimedia.org/T206408) (owner: 10Urbanecm) [11:09:17] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:467321|Add throttle rule for "Night of the Digital Language" (T206408)]] (duration: 00m 49s) [11:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:21] T206408: Requesting temporary lift of IP cap on 2018-11-29 - https://phabricator.wikimedia.org/T206408 [11:09:24] Urbanecm: the second one deployed [11:09:27] ack [11:10:08] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467408 (https://phabricator.wikimedia.org/T206914) (owner: 10Urbanecm) [11:13:58] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/462494 (https://phabricator.wikimedia.org/T196968) [11:14:47] (03CR) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikidata.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462494 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [11:15:44] (03PS2) 10Zfilipin: Add new throttle rule for WMCL Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467408 (https://phabricator.wikimedia.org/T206914) (owner: 10Urbanecm) [11:15:52] (03CR) 10Zfilipin: Add new throttle rule for WMCL Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467408 (https://phabricator.wikimedia.org/T206914) (owner: 10Urbanecm) [11:15:59] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467408 (https://phabricator.wikimedia.org/T206914) (owner: 10Urbanecm) [11:17:09] (03Merged) 10jenkins-bot: Add new throttle rule for WMCL Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467408 (https://phabricator.wikimedia.org/T206914) (owner: 10Urbanecm) [11:17:22] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/462494 (https://phabricator.wikimedia.org/T196968) [11:17:23] I might steal SWAT once it's done, there are some patches to deploy [11:17:46] (03CR) 10jenkins-bot: Remove expired throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467316 (https://phabricator.wikimedia.org/T207015) (owner: 10Urbanecm) [11:17:48] (03CR) 10jenkins-bot: Add throttle rule for "Night of the Digital Language" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467321 (https://phabricator.wikimedia.org/T206408) (owner: 10Urbanecm) [11:17:50] (03CR) 10jenkins-bot: Add new throttle rule for WMCL Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467408 (https://phabricator.wikimedia.org/T206914) (owner: 10Urbanecm) [11:18:54] <_joe_> Amir1: I want to convert wikidata.org today (to the new unified apache config); can I ask you for a bunch of URLs to test to verify things are ok? [11:18:56] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:467408|Add new throttle rule for WMCL Editathon (T206914)]] (duration: 00m 49s) [11:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:59] T206914: Lift IP limit - WMCL Editathon 2018-10-17 - https://phabricator.wikimedia.org/T206914 [11:19:09] _joe_: sure [11:19:16] Urbanecm: third deployed [11:19:20] ack [11:19:33] <_joe_> the deploy is planned today at 6 PM, during puppet-swat [11:19:35] zeljkof: ignore the special page fatals in mediawiki.org, that's me :D [11:19:39] <_joe_> Amir1: [11:19:43] <_joe_> naughty :P [11:19:53] Amir1: /me is pulling hair ;) [11:20:07] _joe_: sure! [11:20:22] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467409 (https://phabricator.wikimedia.org/T207043) (owner: 10Urbanecm) [11:20:38] <_joe_> we did the same conversion for test.wikidata.org and everything seems to have gone smoothly, but one never knows [11:21:46] (03CR) 10Zfilipin: Add throttle rule for editathon at University of North Carolina at Charlotte [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467409 (https://phabricator.wikimedia.org/T207043) (owner: 10Urbanecm) [11:21:52] (03PS2) 10Zfilipin: Add throttle rule for editathon at University of North Carolina at Charlotte [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467409 (https://phabricator.wikimedia.org/T207043) (owner: 10Urbanecm) [11:21:59] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467409 (https://phabricator.wikimedia.org/T207043) (owner: 10Urbanecm) [11:23:12] (03Merged) 10jenkins-bot: Add throttle rule for editathon at University of North Carolina at Charlotte [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467409 (https://phabricator.wikimedia.org/T207043) (owner: 10Urbanecm) [11:24:38] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:467409|Add throttle rule for editathon at University of North Carolina at Charlotte (T207043)]] (duration: 00m 49s) [11:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:41] T207043: Lift block on creating editor accounts for wiki edit-a-thon on 2018-10-24 - https://phabricator.wikimedia.org/T207043 [11:24:55] Urbanecm: fourth one deployed ^ [11:25:34] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460602 (https://phabricator.wikimedia.org/T204292) (owner: 10Urbanecm) [11:25:54] (03PS1) 10Ladsgroup: Re-apply "Enable reading from new backend of change_tag in s7"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467669 [11:26:02] !log akosiaris@deploy1001 scap-helm mathoid upgrade production stable/mathoid [namespace: mathoid, clusters: codfw] [11:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:15] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [11:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:19] !log akosiaris@deploy1001 scap-helm mathoid finished [11:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:38] (03Merged) 10jenkins-bot: Enable Translate on idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460602 (https://phabricator.wikimedia.org/T204292) (owner: 10Urbanecm) [11:27:16] mobrovac: I am upgrading mathoid to helm chart version 0.0.12. Should increase the reliability at both the service and pod level [11:27:26] !log upgrade mathoid chart to version 0.0.12 [11:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:52] ack [11:28:02] yay [11:28:03] thnx akosiaris [11:28:15] Urbanecm: 460602 at mwdebug1002 [11:28:23] (03PS6) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/462494 (https://phabricator.wikimedia.org/T196968) [11:29:26] zeljkof, can you run mwscript extensions/WikimediaMaintenance/createExtensionTables.php idwikimedia translate please? [11:29:42] Urbanecm: sure [11:29:58] Urbanecm: so, `mwscript extensions/WikimediaMaintenance/createExtensionTables.php idwikimedia translate` [11:30:02] yes [11:30:11] See https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#createExtensionTables for docs [11:30:32] I have three patches :D [11:31:38] Amir1, if you want to, you can take SWAT over after 460602 deployment [11:31:57] sure, zeljkof is it fine? [11:32:18] Amir1: anything urgent? there's a few more patches from Urbanecm [11:32:27] yes, but they have "lowest" priority [11:32:29] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:32:37] (03PS1) 10Ladsgroup: Re-enable search integration for ArticlePlaceholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467671 (https://phabricator.wikimedia.org/T195751) [11:32:44] ah, looks like the one I'm deploying now is the only one urgent, the rest can wait [11:32:56] !log the binlog purging stopped on pc2004 (T206740) [11:32:57] exactly, I don't mind waiting for those patches [11:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:01] T206740: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 [11:33:27] it's not that urgent [11:33:30] (03CR) 10jenkins-bot: Add throttle rule for editathon at University of North Carolina at Charlotte [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467409 (https://phabricator.wikimedia.org/T207043) (owner: 10Urbanecm) [11:33:32] (03CR) 10jenkins-bot: Enable Translate on idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460602 (https://phabricator.wikimedia.org/T204292) (owner: 10Urbanecm) [11:33:36] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [11:33:39] zeljkof: tell me once you're done [11:34:14] I've removed my additional patches from the deploy window, I assume Amir1 has more urgent patches than me/lowest [11:34:15] Amir1: I'm fine either way, deploying Urbanecm's patches or leaving the rest of swat to you, it's up to the two of you :) [11:34:21] I just need to finish the current one [11:34:30] Sure. Have you run the script zeljkof ? [11:34:33] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) [11:34:39] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:34:40] Urbanecm: just now [11:34:41] Thanks Urbanecm [11:34:46] yw Amir1 [11:35:03] Urbanecm: the script ran [11:35:20] ok, thank you zeljkof [11:35:26] did you test at mwdebug1002? [11:35:33] Urbanecm: ok to deploy? [11:35:37] testing [11:35:51] ok [11:35:55] * banyek lunch [11:37:49] zeljkof, please deploy it [11:37:53] Urbanecm: ok [11:38:49] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:460602|Enable Translate on idwikimedia (T204292)]] (duration: 00m 49s) [11:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:52] T204292: Extension:Translate for id.wikimedia.org website - https://phabricator.wikimedia.org/T204292 [11:39:02] Urbanecm: deployed! [11:39:08] thanks! [11:39:09] Amir1: the swat is yours [11:39:18] yes, Thank you! [11:41:46] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467671 (https://phabricator.wikimedia.org/T195751) (owner: 10Ladsgroup) [11:43:28] (03PS2) 10Ladsgroup: Re-enable search integration for ArticlePlaceholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467671 (https://phabricator.wikimedia.org/T195751) [11:43:38] (03CR) 10Ladsgroup: [C: 032] Re-enable search integration for ArticlePlaceholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467671 (https://phabricator.wikimedia.org/T195751) (owner: 10Ladsgroup) [11:45:19] (03Merged) 10jenkins-bot: Re-enable search integration for ArticlePlaceholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467671 (https://phabricator.wikimedia.org/T195751) (owner: 10Ladsgroup) [11:46:42] works fine, moving to prod [11:48:40] seems like I finally managed to tune the threshold correctly for the kubernetes alerts [11:48:50] (03CR) 10jenkins-bot: Re-enable search integration for ArticlePlaceholder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467671 (https://phabricator.wikimedia.org/T195751) (owner: 10Ladsgroup) [11:49:29] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:467671|Re-enable search integration for ArticlePlaceholder (T195751)]] (duration: 00m 50s) [11:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:32] T195751: Enable search integration with Article Placeholder again - https://phabricator.wikimedia.org/T195751 [11:54:24] zeljkof: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/467670/ doesn't merge it because of php7.1 issue with elastic (unrelated to the patch) is it okay if I force push it? [11:54:32] https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php71-docker/43/console [11:55:25] Amir1: is there a phab task about the problem? [11:55:48] Yes [11:55:50] let me find it [11:56:10] Amir1: if you are reasonably sure that it's a CI problem, not the patch problem, go ahead, we have done that before, however much I don't like it ;) [11:56:15] https://phabricator.wikimedia.org/T205958 [11:56:42] Amir1: aaah 7.1 on the branch? [11:56:48] * addshore agrees it is unrelated [11:56:48] as it's merged on master and not on the branch, the only thing it needs is backport (or branch cut) [11:57:18] hashar: are you aware of T205958 [11:57:21] T205958: Wikibase\Repo\Search\Elastic\Tests\EntitySearchElasticFulltextTest::testSearchElastic fails on PHP 7.1 - https://phabricator.wikimedia.org/T205958 [11:59:03] addshore zeljkof I am here [11:59:28] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/462494 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [11:59:52] it fixes it moving forward [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181016T1200) [12:02:28] Jonas_WMDE: too late! ;) [12:02:49] (03PS2) 10Ladsgroup: Re-apply "Enable reading from new backend of change_tag in s7"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467669 [12:02:49] !log akosiaris@deploy1001 scap-helm mathoid upgrade production stable/mathoid --set main_app.limits.memory=1g [namespace: mathoid, clusters: codfw] [12:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:00] !log akosiaris@deploy1001 scap-helm mathoid upgrade production stable/mathoid --set main_app.limits.memory=1G [namespace: mathoid, clusters: codfw] [12:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:08] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.24/includes/changetags/ChangeTags.php: SWAT: [[gerrit:467670|Avoid fatals when the filter tags is empty (T194164)]] (duration: 00m 50s) [12:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:11] T194164: Start reading from change_tag_def in production - https://phabricator.wikimedia.org/T194164 [12:03:16] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467669 (owner: 10Ladsgroup) [12:03:20] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [12:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:24] !log akosiaris@deploy1001 scap-helm mathoid finished [12:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:06] zeljkof: T205958 is for php 7.1 so hardly an issue? [12:04:07] T205958: Wikibase\Repo\Search\Elastic\Tests\EntitySearchElasticFulltextTest::testSearchElastic fails on PHP 7.1 - https://phabricator.wikimedia.org/T205958 [12:04:28] zeljkof: if that fails, maybe we can skip those php7.1 jobs for wmf branches [12:04:29] (03Merged) 10jenkins-bot: Re-apply "Enable reading from new backend of change_tag in s7"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467669 (owner: 10Ladsgroup) [12:04:43] (03CR) 10jenkins-bot: Re-apply "Enable reading from new backend of change_tag in s7"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467669 (owner: 10Ladsgroup) [12:04:55] zeljkof oh noes it is the wd birthday present confing change addshore amir could you make sure it will get deployed before it is to late? [12:05:19] Jonas_WMDE: I deployed it to test yesterday and it didnbt work [12:05:21] *didnt work [12:05:31] in meetings all day today but need to figure out why it isnt working there first [12:05:37] oh what? [12:05:44] it is working on the constraint test system [12:05:48] <_joe_> addshore: what doesn't work? [12:06:29] a beta feature we tried to enable yesterday [12:07:23] <_joe_> ok so, we changed the apache configurations for test.wikidata.org yesterday afternoon, so I want to know if it can be related to URL mangling gone wrong somehow [12:07:50] <_joe_> changed as in converted to a new format, they should be 1:1 equivalent to the old ones [12:08:32] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:467669|Enable reading from new backend of change_tag in s7 (T194164)]] (duration: 00m 50s) [12:08:35] <_joe_> so if you link me the change and/or the related phab task I can verify there is no bad interaction [12:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:37] T194164: Start reading from change_tag_def in production - https://phabricator.wikimedia.org/T194164 [12:08:40] SWAT is done [12:08:45] !log EU SWAT is done [12:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:54] _joe_: no it doesn't sound like any interaction would be happening there [12:09:06] <_joe_> ack, thanks [12:09:24] <_joe_> btw, it's supposed to go live on wikidata.org this afternoon in puppet-swat [12:09:33] <_joe_> the same apache config refactor, that is [12:09:43] oooh, okay, I'll be around to watch for explosions, as I also have something in puppet swat :) [12:09:50] <_joe_> I know [12:10:04] <_joe_> I shamelessly piggybacked the fact you'd be around [12:10:10] hehe [12:10:54] <_joe_> I asked Amir1 a set of standard urls to test for today [12:11:45] * addshore goes to look at the change [12:14:01] <_joe_> addshore: the link to the puppet compiler output is what you might want to look at, but it's pretty complex to verify. [12:14:28] !log akosiaris@deploy1001 scap-helm mathoid upgrade production stable/mathoid --reset-values -f mathoid.yaml [namespace: mathoid, clusters: codfw] [12:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:35] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [12:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:37] !log akosiaris@deploy1001 scap-helm mathoid finished [12:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:50] ok finally now we are talking [12:16:18] _joe_, apache changes in puppet swat? :) [12:16:22] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) [12:16:28] (03CR) 10Muehlenhoff: mediawiki::web::prod_sites: convert wikipedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [12:16:39] <_joe_> Krenair: well I am running puppet swat, and the change is mine :P [12:16:58] <_joe_> so it's just allocated at that time, not exactly part of puppetswat [12:17:02] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [12:17:18] <_joe_> oh sigh, what did I do wrong this time [12:17:24] :D [12:18:17] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) [12:19:32] no, it's actually a bug in the ERB, my bad [12:20:07] zh-mo is not matched in ProxyPassRule, but listed as an alias [12:20:31] <_joe_> yes [12:20:41] <_joe_> I just answered [12:20:55] <_joe_> moritzm: also, I'm gonna do a whitespace patch to the erb before merging [12:21:01] <_joe_> to make the diff more palatable [12:32:58] !log pool codfw mathoid [12:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:15] _joe_: I owe you btw to pool all the services in codfw after this ^ [12:33:18] haven't forgotten [12:34:06] (03CR) 10Gehel: "puppet compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/12946/" [puppet] - 10https://gerrit.wikimedia.org/r/467654 (owner: 10Gehel) [12:35:00] !log depool eqiad mathoid for helm chart upgrade [12:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:11] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: fix deprecation warnings in prometheus::node_puppet_agent [puppet] - 10https://gerrit.wikimedia.org/r/467654 (owner: 10Gehel) [12:36:32] (03CR) 10Gehel: "Puppet compiler looks good. This should be tested on deployment-prep, with synchronization with Michael and Mateus." [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [12:37:02] (03PS2) 10Gehel: prometheus: fix deprecation warnings in prometheus::node_puppet_agent [puppet] - 10https://gerrit.wikimedia.org/r/467654 [12:37:50] (03CR) 10Gehel: [C: 032] prometheus: fix deprecation warnings in prometheus::node_puppet_agent [puppet] - 10https://gerrit.wikimedia.org/r/467654 (owner: 10Gehel) [12:38:06] (03PS4) 10Filippo Giunchedi: wmcs: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) [12:42:10] _joe_ I guess that now it is not a good moment to merge my change for mc1035 right? [12:43:47] !log akosiaris@deploy1001 scap-helm mathoid upgrade production stable/mathoid --reset-values -f mathoid.yaml [namespace: mathoid, clusters: eqiad] [12:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:50] !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed [12:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:00] !log akosiaris@deploy1001 scap-helm mathoid finished [12:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:30] (03PS8) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [12:50:18] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 6 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) 05Open>03Resolved a:03Addshore All looks good to me! :) I'm going to mark this as re... [12:50:46] ^^ for anyone that happens to be looking in this channel, wikidata dispatching is now super fast! ^^ :D [12:51:48] <_joe_> nice :) [12:51:58] its super nice [12:52:06] i got pretty excited :P [12:52:07] so wait, the issue was the non compressed row format before ? [12:52:22] em the compact one actually [12:52:23] yup [12:52:26] wow [12:52:29] I am impressed [12:52:32] <_joe_> elukey: it's ok if you want to, I'm not doing anything on apache until 6 pm [12:52:39] !log T186571 removed legofan4000 user from project-tools group (leftover from T165624 legofan4000->macfan4000 rename) [12:52:40] <_joe_> akosiaris: ack, repool everything [12:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:43] T165624: Request to rename LegoFan4000 to MacFan4000 on WikiTech - https://phabricator.wikimedia.org/T165624 [12:52:43] T186571: Toolforge search results does not show all maintainers - https://phabricator.wikimedia.org/T186571 [12:52:47] i wonder if that would make a difference on any other tables akosiaris :P [12:52:58] <_joe_> akosiaris: including restbase and restbase async I guess? [12:53:03] _joe_: yes [12:53:07] addshore: I wonder too [12:55:09] (03PS1) 10Gehel: prometheus: clean deprecation warnings on prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/467677 [12:57:07] !log pool mathoid eqiad [12:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181016T1300) [13:01:57] (03CR) 10Gehel: "puppet compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/12966/" [puppet] - 10https://gerrit.wikimedia.org/r/467677 (owner: 10Gehel) [13:03:29] (03PS3) 10Giuseppe Lavagetto: mediawiki::syslog: stop looking variables up the scope [puppet] - 10https://gerrit.wikimedia.org/r/467641 [13:07:05] (03PS1) 10Elukey: eventlogging_cleaner.py: remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/467679 (https://phabricator.wikimedia.org/T207165) [13:07:43] (03CR) 10Ottomata: [C: 031] profile::analytics::refinery::job::camus: use camus to backup el-client-side [puppet] - 10https://gerrit.wikimedia.org/r/467646 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [13:08:08] (03PS3) 10Elukey: Apply -R 200 to mc1035's memcached instance as perf test [puppet] - 10https://gerrit.wikimedia.org/r/467640 (https://phabricator.wikimedia.org/T203786) [13:08:44] !log restart memcached on mc1035 with -R 200 (will wipe the object cache shard as consequence) - T203786 [13:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:47] T203786: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [13:09:02] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. Patch series of the year for sure!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [13:11:29] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=^apertium|citoid|cxserver|eventbus|eventstreams|graphoid|mathoid|mobileapps|ores|parsoid|pdfrender|proton|recommendation-api|restbase|restbase-async|wdqs|wdqs-internal|zotero$ [13:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:46] !log pool codfw for apertium|citoid|cxserver|eventbus|eventstreams|graphoid|mathoid|mobileapps|ores|parsoid|pdfrender|proton|recommendation-api|restbase|restbase-async|wdqs|wdqs-internal|zotero [13:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:08] (03CR) 10Elukey: [V: 032 C: 032] Apply -R 200 to mc1035's memcached instance as perf test [puppet] - 10https://gerrit.wikimedia.org/r/467640 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [13:13:09] (03CR) 10Mforns: [C: 031] "LGTM! Thanks for ninja speed" [puppet] - 10https://gerrit.wikimedia.org/r/467679 (https://phabricator.wikimedia.org/T207165) (owner: 10Elukey) [13:16:05] ok mc1035 is up and running again [13:21:10] (03PS2) 10Elukey: eventlogging_cleaner.py: remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/467679 (https://phabricator.wikimedia.org/T207165) [13:22:04] (03CR) 10Elukey: [C: 032] eventlogging_cleaner.py: remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/467679 (https://phabricator.wikimedia.org/T207165) (owner: 10Elukey) [13:22:10] (03CR) 10Alexandros Kosiaris: [C: 032] ores: puppet config for redis task tracker [puppet] - 10https://gerrit.wikimedia.org/r/467428 (https://phabricator.wikimedia.org/T152012) (owner: 10Ladsgroup) [13:22:17] (03PS2) 10Alexandros Kosiaris: ores: puppet config for redis task tracker [puppet] - 10https://gerrit.wikimedia.org/r/467428 (https://phabricator.wikimedia.org/T152012) (owner: 10Ladsgroup) [13:24:47] RECOVERY - Check systemd state on db1108 is OK: OK - running: The system is fully operational [13:29:37] RECOVERY - Check systemd state on db1107 is OK: OK - running: The system is fully operational [13:29:48] \o/ [13:30:11] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall seems good, a few minor things to fix" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [13:33:58] !log start install process on cr2-eqdfw (non impacting before reboot) - T203261 [13:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:03] T203261: cr2-eqdfw (MX204) vhclient log noise - https://phabricator.wikimedia.org/T203261 [13:35:03] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Central certificates service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [13:36:32] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::syslog: stop looking variables up the scope [puppet] - 10https://gerrit.wikimedia.org/r/467641 (owner: 10Giuseppe Lavagetto) [13:36:41] (03PS4) 10Giuseppe Lavagetto: mediawiki::syslog: stop looking variables up the scope [puppet] - 10https://gerrit.wikimedia.org/r/467641 [13:37:13] (03PS5) 10Mathew.onipe: scap::target: added additional_services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) [13:38:10] (03CR) 10Dzahn: "the linked ticket talks about "vn" but here we are adding "vi" ?" [dns] - 10https://gerrit.wikimedia.org/r/467425 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [13:39:20] (03PS1) 10Anomie: Set CommentTableSchemaMigrationStage => WRITE_NEW on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467683 (https://phabricator.wikimedia.org/T166733) [13:40:01] (03CR) 10Anomie: [C: 032] "Deploying config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467683 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [13:41:49] (03Merged) 10jenkins-bot: Set CommentTableSchemaMigrationStage => WRITE_NEW on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467683 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [13:43:22] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting comment table migration stage to write-new/read-both on group 0 (T166733) (duration: 00m 50s) [13:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:25] T166733: Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733 [13:43:55] !log disable external BGP sessions on cr2-eqdfw - T203261 [13:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:58] T203261: cr2-eqdfw (MX204) vhclient log noise - https://phabricator.wikimedia.org/T203261 [13:44:26] !log reboot cr2-eqdfw for upgrade - T203261 [13:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:03] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: convert to http::site [puppet] - 10https://gerrit.wikimedia.org/r/467571 (owner: 10Giuseppe Lavagetto) [13:45:16] (03PS4) 10Giuseppe Lavagetto: mediawiki: convert to http::site [puppet] - 10https://gerrit.wikimedia.org/r/467571 [13:45:20] (03PS1) 10Gehel: tlsproxy: allow multiple default servers on different ports [puppet] - 10https://gerrit.wikimedia.org/r/467684 (https://phabricator.wikimedia.org/T198352) [13:51:25] (03CR) 10jenkins-bot: Set CommentTableSchemaMigrationStage => WRITE_NEW on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467683 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [13:51:47] !log rebooting acamar for update to stretch-proposed-updates kernel [13:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:44] jouncebot: next [13:53:44] In 2 hour(s) and 6 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181016T1600) [13:54:12] !log router back and healthy, enable external BGP sessions on cr2-eqdfw - T203261 [13:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:17] T203261: cr2-eqdfw (MX204) vhclient log noise - https://phabricator.wikimedia.org/T203261 [13:55:17] !log depool logstash1007 to change elasticsearch data dir - T206454 [13:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:20] T206454: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 [13:55:25] (03PS3) 10Filippo Giunchedi: logstash: move to /srv/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/465165 (https://phabricator.wikimedia.org/T206454) [13:55:55] (03PS4) 10Elukey: profile::analytics::refinery::job::camus: use camus to backup el-client-side [puppet] - 10https://gerrit.wikimedia.org/r/467646 (https://phabricator.wikimedia.org/T206542) [13:56:02] (03CR) 10Filippo Giunchedi: [C: 032] logstash: move to /srv/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/465165 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [13:58:56] (03PS5) 10BBlack: interface::rps: support tg3 properly [puppet] - 10https://gerrit.wikimedia.org/r/467443 (https://phabricator.wikimedia.org/T206105) [13:58:58] (03PS2) 10BBlack: interface::rps: always be NUMA aware [puppet] - 10https://gerrit.wikimedia.org/r/467469 [14:00:32] _joe_, "3 - (BEST) just add host as a parameter here, and then pass it down from profiles clearly, it's a little more code that will make all your code more explicit. Which is the reason why we have this rule." [14:00:35] a little more code? [14:00:37] lots of things are going to have to ultimately include this [14:00:39] I take it with #2 we can't include profiles outside of roles? [14:00:43] or would we not need to for this? [14:01:24] <_joe_> I suppose any declaration of certcentral::cert should be in profiles [14:01:45] <_joe_> unless we're doing things backwards everywhere else [14:01:45] no, I think it should be included under modules [14:01:49] <_joe_> which is well possible [14:01:52] <_joe_> no I mean [14:01:55] <_joe_> where it will be used [14:02:12] <_joe_> you won't declare a certcentral::cert inside another module [14:02:30] <_joe_> which way you provide the TLS certs to any function is an implementation detail [14:02:37] well you see all the places that currently have letsencrypt::cert::integrated? [14:03:04] <_joe_> the fact that it's done backwards now is not a good reason to persist in antipatterns, though [14:03:07] some are profiles but most are not [14:03:13] yeah I'm not going out of scope [14:03:29] <_joe_> I'm not asking you to [14:03:54] <_joe_> I'm asking whoever writes new puppet code to respect our coding guidelines [14:04:25] sure [14:04:43] but I'm not going to clean up existing noncompliant code [14:05:05] <_joe_> I get that, it should be the duty of the maintainer of that part of our code :) [14:05:16] right [14:05:27] which is why certcentral::cert right now needs to be includable inside profiles and inside modulews [14:05:32] modules* [14:06:20] !log depool in turn logstash1008 and logstash1009 to change elasticsearch data dir - T206454 [14:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:23] T206454: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 [14:06:23] <_joe_> but even when it makes sense to include certcentral::cert in a module (I fail to see when, but I'll look deeper), I count 13 uses of letsencrypt::cert::integrated [14:06:23] (03PS2) 10Dzahn: nagios_common: on jessie, also install libmonitoring-plugin-perl [puppet] - 10https://gerrit.wikimedia.org/r/467011 (https://phabricator.wikimedia.org/T202782) [14:06:32] (03CR) 10Dzahn: nagios_common: on jessie, also install libmonitoring-plugin-perl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467011 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [14:06:57] yep [14:07:03] <_joe_> of which 10 (sigh) out of profiles [14:07:05] (03CR) 10jerkins-bot: [V: 04-1] nagios_common: on jessie, also install libmonitoring-plugin-perl [puppet] - 10https://gerrit.wikimedia.org/r/467011 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [14:07:28] <_joe_> I don't think it's unreasonable to ask people to pass the parameter along, but as I said, you have two other options [14:07:41] <_joe_> one of which is to just add a global, if this is a global indeed [14:07:56] <_joe_> in manifests/realm.pp [14:08:47] yeah [14:08:50] it kind of is a global [14:09:05] I'm just hesitant to add to realm.pp [14:09:58] (03PS1) 10Dzahn: install_server: move letsencrypt::cert::integrated to profile [puppet] - 10https://gerrit.wikimedia.org/r/467686 [14:10:00] 9:) [14:10:13] <_joe_> I thought of another possible approach [14:10:43] <_joe_> what if we proxy the requests for such certs via the puppetmaster frontend? [14:10:50] (03CR) 10Alex Monk: Central certificates service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [14:10:59] <_joe_> it's cleaner and you just need to define your host in one place [14:11:37] _joe_, the puppetmaster frontend being responsible for passing in the client's DN? [14:12:03] <_joe_> how do you extract it now? [14:12:21] <_joe_> the client's DN, I mean [14:12:23] the certcentral API runs in uwsgi which nginx proxies to [14:12:31] the nginx has [14:12:32] ssl_client_certificate /var/lib/puppet/ssl/certs/ca.pem; [14:12:32] ssl_verify_client on; [14:12:36] uwsgi_param HTTP_X_CLIENT_DN $ssl_client_s_dn; [14:12:45] (03PS1) 10Mathew.onipe: wdqs: cleanup rspec test [puppet] - 10https://gerrit.wikimedia.org/r/467687 (https://phabricator.wikimedia.org/T204240) [14:13:01] <_joe_> ok, yes, it can for sure [14:13:17] <_joe_> we even have something like that already for the puppetmaster backends [14:13:22] indeed [14:13:43] <_joe_> so, that would seem to me as the cleanest solution, not just in terms of puppet code [14:13:53] <_joe_> but you people decide :) [14:14:05] there's a few advantages of doing it that way [14:14:19] (03PS1) 10Dzahn: gerrit: move letsencrypt::cert::integrated to profile [puppet] - 10https://gerrit.wikimedia.org/r/467690 [14:14:44] haven't come up with a disadvantage yet [14:14:48] (03PS1) 10Addshore: Add constraint-suggestions to wgBetaFeaturesWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467691 (https://phabricator.wikimedia.org/T207019) [14:14:57] (03CR) 10jerkins-bot: [V: 04-1] gerrit: move letsencrypt::cert::integrated to profile [puppet] - 10https://gerrit.wikimedia.org/r/467690 (owner: 10Dzahn) [14:15:10] <_joe_> I'm going afk for a bit, I have a long few hours ahead of me already [14:15:14] ok [14:15:38] vgutierrez, any idea about that ${module_name} that snuck in there? [14:15:44] (03PS3) 10Dzahn: nagios_common: on jessie, also install libmonitoring-plugin-perl [puppet] - 10https://gerrit.wikimedia.org/r/467011 (https://phabricator.wikimedia.org/T202782) [14:16:11] Krenair: yup, my fault, we can get rid of them [14:16:29] old puppet manerisms on my layer8 [14:16:49] vgutierrez, can you think of any disadvantages of giuseppe's suggestion regarding puppetmaster frontends? [14:17:01] jouncebot: next [14:17:01] In 1 hour(s) and 42 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181016T1600) [14:17:09] I assume that instead of restricting uwsgi to the local nginx, we'd scrap nginx and let puppetmasters talk directly to uwsgi? [14:18:19] so we move uwsgi from a UNIX socket to a TCP socket and we use ferm to restrict the access to that socket to frontend puppetmasters [14:18:29] yeah [14:18:58] we get the same nginx functionality... [14:19:33] (03PS1) 10Mathew.onipe: enable rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/467692 (https://phabricator.wikimedia.org/T204240) [14:19:39] (03PS1) 10KartikMistry: apertium: New upstream release [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/467693 (https://phabricator.wikimedia.org/T206439) [14:19:54] we'd have some extra latency in the requests... [14:20:02] it's going to be always in the same DC, right? [14:20:22] (03CR) 10jerkins-bot: [V: 04-1] apertium: New upstream release [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/467693 (https://phabricator.wikimedia.org/T206439) (owner: 10KartikMistry) [14:20:26] true they'd be bouncing through an extra server but this isn't a particularly performance-critical thing [14:21:08] (03PS2) 10Dzahn: gerrit: move letsencrypt::cert::integrated to profile [puppet] - 10https://gerrit.wikimedia.org/r/467690 [14:21:23] yup.. besides the extra-latency I don't see any disadvantage, and as you just said, we don't care that much about performance [14:21:33] I don't think there are puppetmaster frontends in all the caching DCs, but we were already requiring them to call into eqiad/codfw for this [14:21:55] they already do that for usual puppet stuff [14:22:00] yeah [14:22:00] and it's super slow [14:22:02] (03CR) 10jerkins-bot: [V: 04-1] gerrit: move letsencrypt::cert::integrated to profile [puppet] - 10https://gerrit.wikimedia.org/r/467690 (owner: 10Dzahn) [14:22:08] I meant inter-DC requests between the puppetmaster and us [14:22:15] us being the uwsgi [14:22:42] oh you mean when the puppetmaster frontend in codfw (?) talks to the certcentral host in eqiad and vice versa? [14:22:46] the only real disadvantage is not having TLS between the puppetmaster nginx and our uwsgi [14:23:02] that didn't matter before because it was traffic running in 127.0.0.1 [14:23:07] oooh that's a no-no [14:23:15] * volans about to re-propose the active/active fileserver side [14:23:25] dammit [14:23:37] volans, can you run it through brandon before leaving CRs? [14:23:46] Krenair: maybe we can set-up uwsgsi to talk TLS [14:23:58] yeah [14:23:59] or [14:24:01] we keep nginx [14:24:06] do you have already nginx there right? [14:24:14] restrict that to puppetmasters in [14:24:17] then clients go [14:24:17] I have nginx in front of uwsgi for debmonitor [14:24:25] I'm here again, run what through me? [14:24:26] client -> puppetmaster frontend -> certcentral nginx -> certcentral uwsgi [14:24:37] * volans hides [14:24:46] * vgutierrez hides too [14:25:08] certcentral nginx can verify puppetmaster frontend's cert [14:25:21] for the fetch of the cert/key files in the execution of the client-side manifest? [14:25:22] assuming it does TLS [14:25:27] otherwise let's just not [14:25:33] sure, in that case the only issue it's carrying over the original client DN [14:26:03] what's the rationale for wanting to complicate this by proxying? [14:26:24] bblack, so for this to work we have to tell the client's file resources what host to fetch from [14:26:29] that involves passing some parameters around [14:26:29] (03PS3) 10Dzahn: gerrit: move letsencrypt::cert::integrated to profile [puppet] - 10https://gerrit.wikimedia.org/r/467690 [14:26:56] yes, the certcentral active hostname or whatever [14:27:01] few ways to do it, all seem kind of unsatisfactory in their own way [14:27:08] (03PS4) 10Dzahn: nagios_common: on jessie, also install libmonitoring-plugin-perl [puppet] - 10https://gerrit.wikimedia.org/r/467011 (https://phabricator.wikimedia.org/T202782) [14:27:32] IMHO, it would be silly to introduce new runtime complexity in terms of proxy layers, to solve a puppet-coding-standards problem of "I don't want a small variable shared in the wrong way" [14:27:57] right [14:28:04] !log roll-restart elasticsearch on logstash100[456] to change elasticsearch data dir - T206454 [14:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:07] T206454: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 [14:28:10] but then I thought about it and it may have other benefits [14:28:29] originally, instead of introducing our own nginx layer, we could just piggyback on the puppetmaster frontends [14:28:31] anyways, let me re-read all the backscroll from the past ~30 or so, give me a few and maybe I'll have something more intelligent to say :) [14:28:35] ok [14:30:20] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12970/" [puppet] - 10https://gerrit.wikimedia.org/r/467011 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [14:30:50] actually I'll type as I go, because there's several subpoints [14:32:07] ok [14:32:07] (03Abandoned) 10Dzahn: nagios_common: add stretch support to check_ssl [puppet] - 10https://gerrit.wikimedia.org/r/466951 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [14:32:11] _joe_: re: the anti-pattersn of including legacy LE stuff inside modules: the problems to solve there (and perhaps it's tractable) are that deployed-certificate paths are different for LE vs non-LE certs (e.g. affects nginx config templating), and also the hook into "restart/reload that's consuming the newly-deployed or renewed cert" [14:33:19] <_joe_> bblack: the way I am treating it (for other reasons) is: pass the path of the certs to the part of the module that defines the nginx proxy [14:33:21] maybe we can find ways to abstract those bits [14:33:29] <_joe_> and declare certs in the profile [14:33:39] ok [14:33:42] <_joe_> but in the specific case of that variable [14:33:54] stepping past that is the other bit, which maybe I should background a bit the redundancy model here: [14:35:01] so, the idea is certcentral1001 + certcentral2001 operate completely independent of each other (no sharing / knowledge even), procuring/renewing all configured certs in duplicate all the time. [14:35:51] because doing this part active/passive or with some kind of synced/shared storage turns out to be less-resilient, or has issues on failover (sudden spike of LE requests -> ratelimits, and delays getting new certs out, etc) [14:36:35] BUT: we do want to declare 1 active and 1 passive from the clients' POV, globally, as in "all clients fetch from A not B, until we flip a manual switch" [14:36:59] 10Operations, 10netops: cr2-eqdfw (MX204) vhclient log noise - https://phabricator.wikimedia.org/T203261 (10ayounsi) 05Open>03Resolved Confirmed no more noisy logs. [14:37:37] because otherwise if it were random/A+A, the clients would flip-flop between two distinct copies of their certs/keys (same functional CN/SANs/etc, but different actual key/signature/hash bits), which probably isn't great for constant reload/restarts of the servers, disrupting clients constantly, etc. [14:38:47] we want that flip/flop to just happen when we decide to flip a puppet switch because a DC or host is dead. It's not a time-critical flip either. By the design of renewal periods and so-on, we can do it async by even days (so e.g. it's not a switchdc blocker) [14:39:20] but yes, that requires we have one hieradata somewhere like: "certcentral_host_for_clients: certcentral1001" [14:40:21] I don't know that it would be sensible to define that on a per-client basis, or per-consuming-class basis [14:40:30] (03PS4) 10Bstorm: wiki replicas: Remove most comment joins from non-compat tables [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) [14:40:40] and so we end up with something global [14:40:40] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Andrew) Sorry! Since you mentioned having tested this on a VM in the past I ass... [14:40:42] it would just mean flipping a whole bunch of switches after grepping for the right ones to flip [14:41:18] would be a problem if all hosts in a DC will use one and the others in another DC the other? [14:41:28] kinda like we do with the gloabl cert [14:41:28] volans: yes [14:41:28] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/12971/" [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [14:42:17] because people flap more often between eqiad/codfw and say eqiad/esams? [14:42:26] s/and say/than say/ [14:42:26] yes, basically [14:42:29] (03CR) 10Dzahn: "that's merged now. einsteinium has the new package" [puppet] - 10https://gerrit.wikimedia.org/r/467013 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [14:42:50] but either way, it doesn't change the model. Going that route would just mean two globals instead of one, like the unified cert thing. [14:43:05] one global for non-US DCs and one global for US DCs, one of which needs switching on DC-down/switchdc events [14:43:09] (03PS3) 10Dzahn: nagios_common: switch check_ssl to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467013 (https://phabricator.wikimedia.org/T202782) [14:43:42] (and in pratice, the non-US global would only get used by LE certs deployed to the edge caches, but not the other $random core-only services) [14:44:04] ok [14:44:39] sorry in advance for the spam, long CR series incoming [14:44:48] ok [14:45:00] (03PS1) 10Volans: Add missing AAAA records for druid eqiad hosts [dns] - 10https://gerrit.wikimedia.org/r/467701 [14:45:02] (03PS1) 10Volans: Add missing AAAA records for nameservers [dns] - 10https://gerrit.wikimedia.org/r/467702 [14:45:04] (03PS1) 10Volans: Add missing AAAA records for aqs eqiad hosts [dns] - 10https://gerrit.wikimedia.org/r/467703 [14:45:06] (03PS1) 10Volans: Add missing AAAA record for matomo eqiad host [dns] - 10https://gerrit.wikimedia.org/r/467704 [14:45:08] (03PS1) 10Volans: Add missing AAAA records for analytics eqiad hosts [dns] - 10https://gerrit.wikimedia.org/r/467705 [14:45:10] (03PS1) 10Volans: Fix mgmt incongruences [dns] - 10https://gerrit.wikimedia.org/r/467706 [14:45:12] (03PS1) 10Volans: Remove obsolete mgmt records [dns] - 10https://gerrit.wikimedia.org/r/467707 [14:45:14] (03PS1) 10Volans: Fix labs/cloud records [dns] - 10https://gerrit.wikimedia.org/r/467708 [14:45:16] (03PS1) 10Volans: Fix records for camera [dns] - 10https://gerrit.wikimedia.org/r/467709 [14:45:18] (03PS1) 10Volans: Add missing AAAA records to analytics hosts [dns] - 10https://gerrit.wikimedia.org/r/467710 [14:45:36] anyways, I think of our only issue driving "proxy certcentral fileserver through puppetmaster" model is "avoid a global variable in puppet factoring", I don't really like that justification. [14:45:46] if there's some other reasons (I didn't see them above), maybe? [14:45:54] 10Operations, 10Cloud-Services, 10netops: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406 (10ayounsi) No objections for me. [14:46:05] but our answer to globals shouldn't be to remodel runtime to match puppet coding standards I don't think. [14:46:20] I agree [14:46:21] (03PS3) 10Urbanecm: Add vnwikimedia to DNS [dns] - 10https://gerrit.wikimedia.org/r/467425 (https://phabricator.wikimedia.org/T207052) [14:46:40] (03PS1) 10Volans: Fix PTR for db1042 [dns] - 10https://gerrit.wikimedia.org/r/467711 [14:46:42] (03PS1) 10Volans: Fix PTR for mr1-esams [dns] - 10https://gerrit.wikimedia.org/r/467712 [14:46:44] (03CR) 10Urbanecm: [C: 04-1] "Do not mere till objections on https://phabricator.wikimedia.org/T207052 will be resolved." [dns] - 10https://gerrit.wikimedia.org/r/467425 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [14:46:49] with regards to the options in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441991/64/modules/certcentral/manifests/cert.pp [14:47:08] (03CR) 10Urbanecm: [C: 04-1] "I made a typo, sorry. Fixed (please see self CR-1 comment as well)." [dns] - 10https://gerrit.wikimedia.org/r/467425 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [14:47:17] We could pass it around everywhere, which feels wrong for something we're treating as global [14:47:37] well [14:48:03] so, that it's one global peice of data is one thing, whether it ends up being a true "global" is probably another (re: the arguments about profiles and such) [14:48:32] 10Operations, 10fundraising-tech-ops: add icinga1001 to send_nsca and pfw rules in FRACK - https://phabricator.wikimedia.org/T207175 (10Dzahn) [14:48:38] only the client-side needs this data, not the server-side. So it's not like it's actually shared between classes, it's just shared between hosts using a single class [14:48:43] 10Operations, 10fundraising-tech-ops: add icinga1001 to send_nsca and pfw rules in FRACK - https://phabricator.wikimedia.org/T207175 (10Dzahn) [14:48:49] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) [14:49:08] so you could image a world like this (forgive my stupid naming that probably doesn't even match the current patch): [14:49:26] (03PS3) 10Urbanecm: Initial configuration for viwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467528 (https://phabricator.wikimedia.org/T207052) [14:49:32] (03PS4) 10Urbanecm: Initial configuration for viwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467528 (https://phabricator.wikimedia.org/T207052) [14:49:48] profile::certcentral::client - has a parameter in hieradata properly under its own namespace, for the global "active host" setting we're discussing. [14:49:49] (03CR) 10Volans: "I will not merge this change, I'll leave it to the service owners, see below." [dns] - 10https://gerrit.wikimedia.org/r/467701 (owner: 10Volans) [14:50:32] role::fooservice - uses profile::certcentral::client + profile::some_apache_thing [14:51:04] !log mobrovac@deploy1001 Started deploy [proton/deploy@a657059]: Rollback to puppeteer v1.5.0 - T186748 [14:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:07] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [14:51:19] profile::some_apache_thing: has some params that tell it where the deployed certcentral file paths are, so it can template config + restart/reload itself, etc [14:51:53] !log mobrovac@deploy1001 Finished deploy [proton/deploy@a657059]: Rollback to puppeteer v1.5.0 - T186748 (duration: 00m 49s) [14:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:57] but this isn't the current model, which has things hackily injected deeper [14:52:06] anomie: sorry to ping you here, but could I ask you to delay MCR deployment to wikidatawiki until the s8 data issue id completely fixed? [14:52:39] e.g. tlsproxy's templating of certificate paths has conditionals like: [14:52:42] <%- @certs_nginx.each do |cert| -%> [14:52:45] ssl_certificate /etc/ssl/localcerts/<%= cert %>.chained.crt; [14:52:48] <%- if !@acme_subjects.empty? -%> [14:52:50] ssl_certificate /etc/acme/cert/<%= @server_name.gsub(/\W/, '_') %>.chained.crt; [14:53:06] all of that kind of thing would have to be cleaned up so cases like tlsproxy are not really aware of which cert source they're using, just handed pathnames to use [14:53:14] anomie: we are not yet 100% sure the "not in use, but will be in use soon" are consistent, and that may create issues- they should be fixed by tomorrow or at the very late next week [14:53:20] *tables [14:53:52] jynus: That's no problem. At the moment we're not planning on deploying it anywhere else until the 29th at the earliest, since most people are at TechConf next week. [14:54:24] please ping me before doing it [14:54:30] (for wikidata only) [14:54:36] the rest no problem so far [14:54:48] (and by implication, the certcentral::cert::integrated call in tlsproxy goes away in favor of the profile->paths approach) [14:56:16] but, if we our initial puppetization such that something like certcentral::client_cert or whatever is meant to easily replace current occurences of letsencrypt::cert::integrated without deep refactoring on all the cases [14:56:20] (03CR) 10Volans: "I will not merge this change, I'll leave it to the service owners, see below." [dns] - 10https://gerrit.wikimedia.org/r/467702 (owner: 10Volans) [14:56:42] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10mobrovac) [14:57:02] then certcentral::client_cert isn't a profile, just a class used from other classes against standards like the old one, and the only easy way to get it the hostname consistently is through e.g. a certcentral_host global variable in common.yaml [14:57:35] apparent'y joe's client is flaking, so I'm guessing he's missing half of this even though he'd have Opinions! :) [14:57:36] (03CR) 10GTirloni: [C: 032] toollabs-golang - Update to Stretch and Go 1.10 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/464736 (https://phabricator.wikimedia.org/T206276) (owner: 10GTirloni) [14:57:38] (03CR) 10Volans: "I will not merge this change, I'll leave it to the service owners, see below." [dns] - 10https://gerrit.wikimedia.org/r/467703 (owner: 10Volans) [14:58:02] 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10greg) >>! In T206612#4665826, @LarsWirzenius wrote: > My wikitech account is LarsWirzenius, my preferred Unix username is liw. I've signed L3 on Oct... [14:58:42] (03CR) 10Volans: "I will not merge this change, I'll leave it to the service owners, see below." [dns] - 10https://gerrit.wikimedia.org/r/467704 (owner: 10Volans) [14:58:55] (03CR) 10Dzahn: "you should ask Rob about the cameras. i remember running across these before and i'm not sure if they are still actually mounted in the re" [dns] - 10https://gerrit.wikimedia.org/r/467709 (owner: 10Volans) [14:59:16] (03CR) 10Volans: "I will not merge this change, I'll leave it to the service owners, see below." [dns] - 10https://gerrit.wikimedia.org/r/467705 (owner: 10Volans) [15:00:28] so I think the real question is one of ordering, there's 3 paths towards eventual consistency: [15:01:02] <_joe_> sorry bblack I had my bouncer coming back and doing a join/part battle with irccloud [15:01:03] (03CR) 10Ottomata: [C: 031] Add missing AAAA records for druid eqiad hosts [dns] - 10https://gerrit.wikimedia.org/r/467701 (owner: 10Volans) [15:01:03] (03CR) 10Volans: "I will not merge this change, I'll leave it to the service owners, see below." [dns] - 10https://gerrit.wikimedia.org/r/467706 (owner: 10Volans) [15:01:20] <_joe_> so I lost the last context, but I'll read the logs [15:01:32] ok [15:01:37] so I think the real question is one of ordering, there's 3 paths towards eventual consistency: [15:02:34] 1) We deploy certcentral client puppetization roughly like it is now, as an easy drop-in for all current usage of letsencrypt::cert::integrated, and then later we refactor towards an cert-client-profile to remove the global and the deep module->module refs in favor of profiles and explicitly-passed cert path info, etc (there's probably a couple other bits, but tractable) [15:02:35] (03CR) 10Volans: "I will not merge this change, I'll leave it to the service owners, see below." [dns] - 10https://gerrit.wikimedia.org/r/467707 (owner: 10Volans) [15:03:29] 2) We refactor certcentral client puppetization to do it "right" from the start, and then have to (possibly deeply) refactor each current le::cert::integrated client class as we convert them to consuming certs from the new code. [15:04:22] 3) We refactor the current le::cert::integrated and all its consumers to the proper profile-based way first, and do the initial certcentral the right way, and the transition process is again easy as well. [15:05:08] (03CR) 10Volans: "I will not merge this change, I'll leave it to the service owners, see below." [dns] - 10https://gerrit.wikimedia.org/r/467708 (owner: 10Volans) [15:05:32] 10Operations, 10Wikimedia-Logstash: logstash HTTP Basic Auth prompt says "WMF Labs" - https://phabricator.wikimedia.org/T207178 (10herron) [15:05:47] (2) is unappealing because it makes the transition harder than it has to be. While we're increasingly rolling out certs and trying to track down real certcentral bugs, our schedule will also be hampered by all the refactoring of unrelated modules inlined into the process [15:06:04] 10Operations, 10Wikimedia-Logstash: logstash HTTP Basic Auth prompt says "WMF Labs" - https://phabricator.wikimedia.org/T207178 (10GTirloni) For reference, Labs -> WMCS renaming: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Rebranding_Cloud_Services_products [15:06:15] (03CR) 10Volans: "I will not merge this change, I'll leave it to the service owners, see below." [dns] - 10https://gerrit.wikimedia.org/r/467709 (owner: 10Volans) [15:06:17] (03CR) 10Dzahn: "looks good but what it actually fixes is db2041/2042, not db1042" [dns] - 10https://gerrit.wikimedia.org/r/467711 (owner: 10Volans) [15:06:24] <_joe_> bblack: 4) we do as I suggested earlier to avoid the most hideous thing (the deep hiera call) - we proxy calls to certcentral from puppet via the puppetmaster frontend [15:06:37] <_joe_> and we can refactor things with ease when we prefer [15:06:38] (3) is unappealing because it means refactoring a bunch of code we're about to throw away, and stalling some progress a bit until we're done [15:06:55] (1) is unappealing because it leaves things unclean until someone gets around to cleaning them, which may be never [15:06:57] (03CR) 10Volans: "I will not merge this change, I'll leave it to the service owners, see below." [dns] - 10https://gerrit.wikimedia.org/r/467710 (owner: 10Volans) [15:08:04] well [15:08:20] (4) is unappealing on the grounds that it just seems fundamentally wrong to introduce new real-world/runtime proxying of data to solve a puppet code factoring/standards issue. [15:08:25] re (1) surely that's the default position [15:08:26] (03CR) 10Volans: "I will not merge this change, I'll leave it to the service owners, see below." [dns] - 10https://gerrit.wikimedia.org/r/467711 (owner: 10Volans) [15:08:28] so I can see unappealing reasons for all of this [15:08:48] <_joe_> bblack: well, this is serving files via puppet [15:08:53] Krenair: only in the sense that it's the approximate current state of the code being reviewed. I don't think that really makes it a default. [15:08:59] <_joe_> so going through the puppetmaster doesn't exactly seem asinine [15:09:20] <_joe_> also would allow us to limit the servers that can fetch private keys from the certcentral application [15:09:50] we already limit the servers that can fetch private keys [15:09:54] _joe_: yeah but it was never meant to be master-based really, and we are planning to limit the servers (that's in config too, for the certcentral boxes to verify clients against known clients of a given cert) [15:10:12] (03CR) 10Volans: "I will not merge this change, I'll leave it to the service owners, see below." [dns] - 10https://gerrit.wikimedia.org/r/467712 (owner: 10Volans) [15:10:15] another way to think about the puppet-fileserver integration is this: [15:10:58] we could/should have just served the certs over plain client-authenticated HTTPS, rather than co-opting puppet fileserver protocol (and I think we do, for the general use-case outside of here). [15:11:27] but we reasoned that integrating that in a puppet world is hard. we then have to puppetize an HTTPS fetcher script with client auth, and then trigger from that client-side-runtime code whatever service reloads, etc [15:11:34] <_joe_> the appeal I see is the puppet agent connecting to a single server [15:12:08] vs subbing in puppet fileserver protocol so that we can get puppet-level notify hooks and the client cert automagically, and not deploy that extra per-service integration script, etc [15:12:14] <_joe_> but I am ok with slapping the url of the host in realm.pp via a hiera call there, too (that would be solution 1) [15:12:39] (03CR) 10Mobrovac: scap::target: added additional_services_names param (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [15:12:40] <_joe_> I agree the approach you chose server-side it's better [15:13:26] (03PS2) 10Dzahn: Fix PTR for db2041 [dns] - 10https://gerrit.wikimedia.org/r/467711 (owner: 10Volans) [15:14:06] mutante: the previous commit message was correct [15:14:12] for solution 1, we could take any of whichever global-variable options make the most sense. it could be e.g. "certcentral_fileserver_host" in common.yaml, or $certcentral_fileserver_host = "foo" in realm.pp, or yes I guess realm.pp reading from common.yaml or whatever. [15:14:29] I don't know which is the least-painful [15:14:44] volans: it only touches names in codfw, how can it fix one in eqiad? [15:15:03] sorry I meant 42 [15:15:04] not 41 [15:15:12] <_joe_> I just want to avoid new hiera calls nested deep into our code [15:15:16] rotlf, I didn't see I put 1xxx [15:15:18] my bad [15:15:34] that's the part i meant, yea [15:15:35] <_joe_> having it in realm.pp is acceptable, even if a bit ugly [15:15:59] I see [15:16:09] <_joe_> $certcentral_host = hiera('...') [15:16:11] 10Operations, 10SRE-Access-Requests: Requesting deployment access to servers for Performance Team task for perf-roots - https://phabricator.wikimedia.org/T207090 (10Imarlier) Yes, this is approved. [15:16:21] (03PS5) 10Elukey: profile::analytics::refinery::job::camus: use camus to backup el-client-side [puppet] - 10https://gerrit.wikimedia.org/r/467646 (https://phabricator.wikimedia.org/T206542) [15:16:22] _joe_: how does the proxying solution work exactly? we put some magic in puppet's apache config to proxy certain paths out? [15:16:26] <_joe_> and then refer to it as $::certcentral_host everywhere [15:16:32] <_joe_> bblack: yes [15:17:02] is that mode likely to survive future puppet major versions? [15:17:17] well I guess that's a good question for our fileserver integration directly, too [15:17:18] <_joe_> yes [15:17:54] <_joe_> or at least I hope, or we're in *big* trouble, if http proxying puppet requests doesn't work anymore [15:18:36] <_joe_> bigger than this for sure [15:19:34] I'm just imagining something where they change the protocol and subsume all our proxying with some other loadbalance solution, etc [15:19:49] but in that kind of wild case, they're as likely to break fileserver protocol compatibility anyways. [15:20:05] (03PS3) 10Volans: Fix PTR for db2042 [dns] - 10https://gerrit.wikimedia.org/r/467711 [15:22:29] 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10Cmjohnson) A case has been opened up with HPE Support Your case was successfully submitted. Please note your Case ID: 5333327393 for future reference. [15:22:58] (03CR) 10Papaul: [C: 032] Fix PTR for db2042 [dns] - 10https://gerrit.wikimedia.org/r/467711 (owner: 10Volans) [15:23:32] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: dumps mounts don't show up on eqiad1-r VMs - https://phabricator.wikimedia.org/T207184 (10Andrew) p:05Triage>03High [15:24:11] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: dumps mounts don't show up on eqiad1-r VMs - https://phabricator.wikimedia.org/T207184 (10Andrew) We might need to check if other things (e.g scratch) have this same issue. [15:27:32] (03CR) 10Anomie: [C: 04-1] "This doesn't work right due to T207186." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467123 (https://phabricator.wikimedia.org/T200914) (owner: 10MGChecker) [15:27:38] Krenair: _joe_: so how about we go with option 1, which is largely like the current CR structurally (except move the global to realm.pp), and tack onto our Q2 goal (which is largely: migrate all the letsencrypt::cert::integrated clients) to refactor the client-side integration at the end, so it doesn't get lost for years? [15:27:59] (03CR) 10Ayounsi: [C: 031] Fix PTR for mr1-esams [dns] - 10https://gerrit.wikimedia.org/r/467712 (owner: 10Volans) [15:28:27] ok [15:28:33] or option 4 I guess is next-best, which is do the abstraction through puppet proxying which also leaves most of the CR intact, and same basic plan [15:29:07] I could go either way. I just don't want to refactor this PS65 or whatever it's at and then refactor all the user-classes as we go, or refactor the soon-to-be-dead code [15:29:23] (03PS6) 10Mathew.onipe: scap::target: added additional_services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) [15:29:32] 10Operations, 10hardware-requests, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), 10User-Eevans: Hardware for session storage service - https://phabricator.wikimedia.org/T206017 (10CCicalese_WMF) [15:29:47] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1006 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.109:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.48.109, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7fef22b27bd0: Failed to establish a new connection: [Errno 111] Co [15:30:08] (03CR) 10Mathew.onipe: scap::target: added additional_services_names param (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [15:30:18] RECOVERY - Host cloudvirt1019 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [15:31:03] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) Still waiting on a response [15:32:06] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1006 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 86, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 99.1 [15:32:07] hards: 232, initializing_shards: 2, number_of_data_nodes: 3, delayed_unassigned_shards: 0 [15:32:34] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Cmjohnson) a:03ArielGlenn @ArielGlenn Can you help get this disk back into rotation. Shows as unconfigured good [15:32:45] the elasticsearch health check is me btw [15:34:39] (03CR) 10Giuseppe Lavagetto: [C: 031] Turn off wikidata dispatch verbose mode [puppet] - 10https://gerrit.wikimedia.org/r/467282 (owner: 10Addshore) [15:36:36] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:37:20] _joe_: from my pov, I guess I'd still choose 1, because it avoids having to also patch up the proxying part, and either way we'd refactor to a profile later, unless you would strongly rather we proxy first? [15:38:32] (well, option 1 + tack on a goals thing to do the refactoring to a profile at the end, so we don't lose track of that entirely) [15:38:42] <_joe_> bblack: that's ok, just declare the global in realm.pp then. It's a bit ugly, but we can swap that out quickly in case of need [15:38:46] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:39:04] _joe_: ok, thanks! [15:39:08] (03PS65) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [15:39:51] (03CR) 10Alex Monk: Central certificates service (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [15:39:53] (03CR) 10jerkins-bot: [V: 04-1] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [15:40:40] the double quotes :) [15:40:52] on the template paths [15:41:33] yeah I know [15:41:42] (03PS66) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [15:43:47] (03PS1) 10Jgreen: add metrics to nsca_frack.cfg.erb and alphabetize them [puppet] - 10https://gerrit.wikimedia.org/r/467721 [15:43:57] (03PS2) 10Gehel: tlsproxy: allow multiple default servers on different ports [puppet] - 10https://gerrit.wikimedia.org/r/467684 (https://phabricator.wikimedia.org/T198352) [15:44:36] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:33] (03CR) 10Jgreen: [C: 032] add metrics to nsca_frack.cfg.erb and alphabetize them [puppet] - 10https://gerrit.wikimedia.org/r/467721 (owner: 10Jgreen) [15:45:47] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get) [15:46:36] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:46:56] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [15:47:46] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:50:40] (03CR) 10BBlack: [C: 031] "LGTM for now, modulo tiny whitespace nits!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465538 (https://phabricator.wikimedia.org/T206496) (owner: 10Imarlier) [15:50:41] ACKNOWLEDGEMENT - HP RAID on cloudvirt1019 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8 - OK: 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T207189 [15:50:45] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T207189 (10ops-monitoring-bot) [15:51:58] (03PS1) 10Banyek: mariadb: enable notifications for db2096 [puppet] - 10https://gerrit.wikimedia.org/r/467722 (https://phabricator.wikimedia.org/T206593) [15:52:44] 10Operations, 10Wikimedia-Logstash: logstash HTTP Basic Auth prompt says "WMF Labs" - https://phabricator.wikimedia.org/T207178 (10Vgutierrez) p:05Triage>03Normal [15:52:44] (03PS6) 10BBlack: interface::rps: support tg3 properly [puppet] - 10https://gerrit.wikimedia.org/r/467443 (https://phabricator.wikimedia.org/T206105) [15:53:04] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:53] (03CR) 10BBlack: [C: 032] interface::rps: support tg3 properly [puppet] - 10https://gerrit.wikimedia.org/r/467443 (https://phabricator.wikimedia.org/T206105) (owner: 10BBlack) [15:55:34] (03CR) 10Mobrovac: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [15:55:38] (03PS6) 10Filippo Giunchedi: New Kafka cluster logging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) [15:59:40] (03CR) 10BBlack: [C: 04-1] "Also, should for now define numa_networking = "on" in the per-host hieradata/hosts/wdqsNNNN.yaml, before or during turning this on here. " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465624 (https://phabricator.wikimedia.org/T206105) (owner: 10Gehel) [15:59:48] <_joe_> can I ask for a puppet merge moratorium as soon as PuppetSWAT begins? [16:00:04] godog and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181016T1600). [16:00:04] addshore and _joe_: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:22] <_joe_> ALL: please don't merge new puppet patches until I say so [16:00:28] \o [16:01:03] (03PS2) 10Giuseppe Lavagetto: Turn off wikidata dispatch verbose mode [puppet] - 10https://gerrit.wikimedia.org/r/467282 (owner: 10Addshore) [16:01:19] <_joe_> addshore: let's merge as soon as jenkins complies [16:01:26] ack [16:01:35] (03CR) 10Giuseppe Lavagetto: [C: 032] Turn off wikidata dispatch verbose mode [puppet] - 10https://gerrit.wikimedia.org/r/467282 (owner: 10Addshore) [16:02:25] PROBLEM - Device not healthy -SMART- on cloudvirt1019 is CRITICAL: cluster=misc device={cciss,6,cciss,7,cciss,8,cciss,9} instance=cloudvirt1019:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1019&var-datasource=eqiad%2520prometheus%252Fops [16:02:45] (03PS1) 10BryanDavis: apache basic auth: Use term "developer account" [puppet] - 10https://gerrit.wikimedia.org/r/467723 (https://phabricator.wikimedia.org/T179461) [16:03:26] (03PS2) 10BryanDavis: apache basic auth: Use term "developer account" [puppet] - 10https://gerrit.wikimedia.org/r/467723 (https://phabricator.wikimedia.org/T179461) [16:03:40] <_joe_> addshore: your change is merged and applied [16:03:43] amazing [16:03:46] I will watch the logs [16:04:26] _joe_: https://gerrit.wikimedia.org/r/467723 is a pile of those "WMF Labs" basic auth prompts [16:04:31] <_joe_> some new dispatchers are running without verbose [16:05:03] _joe_: :) [16:05:04] <_joe_> bd808: ack, I am about to win another of those t-shirts though [16:05:22] :) good luck [16:05:33] bd808, zomg more apache config :) [16:05:45] (03PS7) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/462494 (https://phabricator.wikimedia.org/T196968) [16:06:21] <_joe_> addshore: I'm going to merge ^^, and then apply puppet across the fleet once I tested the change on one host [16:06:28] ack [16:06:40] <_joe_> https://etherpad.wikimedia.org/p/joes-random-notes has a list of urls we'll be testing [16:06:43] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:06:44] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::camus: use camus to backup el-client-side [puppet] - 10https://gerrit.wikimedia.org/r/467646 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [16:06:50] elukey... [16:06:53] (03PS6) 10Elukey: profile::analytics::refinery::job::camus: use camus to backup el-client-side [puppet] - 10https://gerrit.wikimedia.org/r/467646 (https://phabricator.wikimedia.org/T206542) [16:06:54] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:06:55] argh sorry [16:06:59] just seen it [16:07:10] only +2ed [16:07:14] will stop [16:07:26] <_joe_> lemme see fatalmonitor [16:08:05] (03PS2) 10Imarlier: Add sitemaps rewrite for additional domains [puppet] - 10https://gerrit.wikimedia.org/r/465538 (https://phabricator.wikimedia.org/T206496) [16:08:44] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:08:54] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:08:58] (03CR) 10Imarlier: Add sitemaps rewrite for additional domains (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465538 (https://phabricator.wikimedia.org/T206496) (owner: 10Imarlier) [16:09:38] <_joe_> ok, I'd proceed [16:09:44] +1 [16:09:50] <_joe_> akosiaris: can you +1 the change if you think it's ok? [16:10:02] <_joe_> even here is ok :) [16:10:10] <_joe_> I'm not sure if I fully convinced you [16:10:36] I 'll +1 it hoping I won't regret it, but it's just a step towards something better [16:10:48] cause there is some dups here and there [16:11:40] <_joe_> akosiaris: sure, there are [16:12:03] <_joe_> but the three I introduce will make things faster (less rewriting of urls) [16:12:23] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.871 second response time [16:12:58] <_joe_> ok last chance to stop me :P [16:13:15] (03CR) 10Alexandros Kosiaris: [C: 031] mediawiki::web::prod_sites: convert wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/462494 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [16:13:23] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/462494 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [16:14:04] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:15:13] (03CR) 10MGChecker: "which is resolved now…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467123 (https://phabricator.wikimedia.org/T200914) (owner: 10MGChecker) [16:15:13] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:15:58] <_joe_> !log disabled puppet on all appservers, merging wikidata apache change, re-enabling puppet on mwdebug1001 for testing [16:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:04] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:31] <_joe_> so, math redirection changes, uhm [16:18:59] <_joe_> oh simply wasn't there, ofc [16:19:11] <_joe_> ok, no harm done AFAICT [16:20:23] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:21:00] <_joe_> running puppet across the apaches [16:22:34] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:24:34] <_joe_> addshore: the change is applied everywhere, lemme know if you see issues/if they're reported [16:24:43] will do [16:26:53] (03CR) 10Lars Wirzenius: "I'd prefer to say which wiki's credentials, but I'm OK with the change as is." [puppet] - 10https://gerrit.wikimedia.org/r/467723 (https://phabricator.wikimedia.org/T179461) (owner: 10BryanDavis) [16:28:48] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: dumps mounts don't show up on eqiad1-r VMs - https://phabricator.wikimedia.org/T207184 (10Bstorm) Scratch is universally exported. Maps has specific IPs listed. Dumps appears to be controlled via a hiera value. What's the correct... [16:33:18] !log depool restbase-async from eqiad in order to test traffic going to parsoid codfw [16:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:12] bstorm_, well the range in the ticket was the old labs instances range right? [16:34:20] <_joe_> ok puppet swat is done :) [16:34:24] Yes. [16:34:29] <_joe_> elukey: merge your change, sorry [16:34:32] <_joe_> and everyone else [16:34:37] I just need the correct range for the current ones, Krenair [16:34:39] <_joe_> but this was a delicate change [16:34:40] Bsadowski1, so you just need to add the new 172.16.0.0/12 ? [16:34:41] The newer ones [16:34:48] sorry [16:34:49] bstorm_, ^ [16:35:15] (03CR) 10BryanDavis: "> I'd prefer to say which wiki's credentials, but I'm OK with the" [puppet] - 10https://gerrit.wikimedia.org/r/467723 (https://phabricator.wikimedia.org/T179461) (owner: 10BryanDavis) [16:35:23] <_joe_> Amir1, addshore if you notice or are reported some issue, ask someone to phone me if I'm needed. At any hour [16:35:27] Krenair: yeah, if that's the right one :) Sure [16:35:29] hmm [16:35:32] wait [16:35:49] ack [16:35:56] looks like we're actually using 172.16.0.0/21 Bsadowski1 [16:35:59] ack [16:36:00] bstorm_, ^ dammit did it again [16:37:12] from a quick look at the puppet repo [16:38:38] 10Operations, 10Parsoid, 10Datacenter-Switchover-2018: Parsoid no longer active-active - https://phabricator.wikimedia.org/T207091 (10akosiaris) 05Open>03Resolved a:03akosiaris This is now fixed per https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?from=now-15m&to=now&cluster=pars... [16:40:05] yep, 172.16.0.0/21 is for the current build-out. We may eventually add things to 172.16.128.0/21 as well [16:40:38] (03PS7) 10Elukey: profile::analytics::refinery::job::camus: use camus to backup el-client-side [puppet] - 10https://gerrit.wikimedia.org/r/467646 (https://phabricator.wikimedia.org/T206542) [16:40:39] you mean 172.16.128.0/22 ? [16:43:00] Krenair: no, /21. according to puppet. [16:43:00] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.416 second response time [16:43:15] https://www.irccloud.com/pastebin/CiQX5dhm/ [16:43:27] https://www.irccloud.com/pastebin/QPpQrteX/ [16:43:39] uh, right, yes [16:43:41] I guess that second bit is moot anyway, no current plans to add things to codfw [16:45:20] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10RobH) p:05Triage>03Normal [16:46:11] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:46:15] yes, they're both /21 [16:46:39] 172.16/12 is the broader definition of that whole private address space in IANA, not what's allocated to our private use in eqiad/codfw internally [16:46:53] (03CR) 10Volans: [C: 04-1] "I think we can simplify it, see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [16:48:20] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.395 second response time [16:49:10] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:49:10] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:51:41] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:52:53] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: dumps mounts don't show up on eqiad1-r VMs - https://phabricator.wikimedia.org/T207184 (10Andrew) we discussed this on IRC but for the record, the additional range we need is 172.16.0.0/21 [16:53:31] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:53:31] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:56:20] (03PS2) 10Gehel: wdqs: spread IRQ from NIC over multiple CPUs [puppet] - 10https://gerrit.wikimedia.org/r/465624 (https://phabricator.wikimedia.org/T206105) [16:57:08] (03CR) 10Gehel: "> Patch Set 1: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465624 (https://phabricator.wikimedia.org/T206105) (owner: 10Gehel) [16:57:33] bblack: ^ ready for your last review (hopefully) [16:59:15] oh hmmm [16:59:45] gehel: $interface_primary isn't a global, it's a fact: $facts['interface_primary'] [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181016T1700). [17:00:10] bblack: looking [17:00:24] I think you can just replace $::interface_primary with that [17:00:42] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [17:00:42] (03CR) 10Cwhite: nagios_common: set flag -2 on check_nrpe for nrpe on stretch (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:00:50] 10Operations, 10Patch-For-Review: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Dzahn) [17:00:59] no parsoid deploy today [17:01:13] bblack: facts are not available as globals anymore? I though that was the case [17:01:39] <_joe_> !log restarted pdfrender on scb1004 [17:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:34] gehel: I don't know if they are, but git grep says nobody else uses interface_primary as a global, only as $facts[] [17:02:47] (03PS3) 10Gehel: wdqs: spread IRQ from NIC over multiple CPUs [puppet] - 10https://gerrit.wikimedia.org/r/465624 (https://phabricator.wikimedia.org/T206105) [17:02:58] bblack: corrected [17:04:29] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) p:05Triage>03Normal [17:06:10] pretty sure $::fqdn is available and that's from facter, right gehel ? [17:06:49] (03PS1) 10Bstorm: labstore: add new openstack region to dumps labstores [puppet] - 10https://gerrit.wikimedia.org/r/467732 (https://phabricator.wikimedia.org/T207184) [17:07:52] Krenair: looks like the documented way to access facts is via $facts[], I think the facts as globals is a left over from a long time ago. [17:08:02] Not sure when $facts[] was introduced [17:09:04] (03CR) 10Bstorm: [C: 032] labstore: add new openstack region to dumps labstores [puppet] - 10https://gerrit.wikimedia.org/r/467732 (https://phabricator.wikimedia.org/T207184) (owner: 10Bstorm) [17:09:10] os_version appears to work by looking up lsbdist* stuff using lookupvar [17:09:13] (03PS2) 10Bstorm: labstore: add new openstack region to dumps labstores [puppet] - 10https://gerrit.wikimedia.org/r/467732 (https://phabricator.wikimedia.org/T207184) [17:09:35] (03PS3) 10Cwhite: nagios_common: set flag -2 on check_nrpe for nrpe on stretch [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782) [17:09:41] (03CR) 10Andrew Bogott: [C: 031] labstore: add new openstack region to dumps labstores [puppet] - 10https://gerrit.wikimedia.org/r/467732 (https://phabricator.wikimedia.org/T207184) (owner: 10Bstorm) [17:10:12] (03PS4) 10Cwhite: nagios_common: set flag -2 on check_nrpe for nrpe on stretch [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782) [17:10:57] (03PS5) 10Cwhite: nagios_common: set flag -2 on check_nrpe for nrpe on stretch [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782) [17:11:52] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) a:05Cmjohnson>03Ottomata So, we are not sure about what vlan these will be going into. This could affect what row they go into. @Ottomata: Can y... [17:14:52] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10Patch-For-Review: dumps mounts don't show up on eqiad1-r VMs - https://phabricator.wikimedia.org/T207184 (10Andrew) looks fixed to me! [17:15:09] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Andrew) @Smalyshev, dumps should be available on that host now. [17:15:57] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10Patch-For-Review: dumps mounts don't show up on eqiad1-r VMs - https://phabricator.wikimedia.org/T207184 (10Bstorm) Re-exported after some manual puppet runs. If it's good on the clients, I'll close it. [17:16:47] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Bstorm) [17:16:54] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10Patch-For-Review: dumps mounts don't show up on eqiad1-r VMs - https://phabricator.wikimedia.org/T207184 (10Bstorm) 05Open>03Resolved [17:18:21] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.670 second response time [17:19:24] (03PS1) 10Effie Mouzeli: WIP: Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) [17:21:00] (03PS1) 10WMDE-leszek: Wikidata: add setting for setting the enabled entity data forms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467735 (https://phabricator.wikimedia.org/T207196) [17:21:02] (03PS1) 10WMDE-leszek: Wikidata: enable JSON-LD data format on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) [17:21:42] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:45] (03CR) 10jerkins-bot: [V: 04-1] Wikidata: add setting for setting the enabled entity data forms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467735 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [17:21:47] (03CR) 10WMDE-leszek: [C: 04-1] "only to be merged once If559090ca1982e17c19dd5a4bb99620eaa3af9ed is deployed on test wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467735 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [17:21:52] (03CR) 10jerkins-bot: [V: 04-1] Wikidata: enable JSON-LD data format on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [17:21:59] (03CR) 10WMDE-leszek: "commented on the wrong patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467735 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [17:22:15] (03CR) 10WMDE-leszek: [C: 04-1] "only to be merged once If559090ca1982e17c19dd5a4bb99620eaa3af9ed is deployed on test wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467736 (https://phabricator.wikimedia.org/T207196) (owner: 10WMDE-leszek) [17:24:47] (03PS2) 10WMDE-leszek: Wikidata: add setting for setting the enabled entity data forms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467735 (https://phabricator.wikimedia.org/T207196) [17:26:13] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM! I'm not sure offhand if this will cause apache restart/reload though, if it does we'll need to apply some care post merge just in ca" [puppet] - 10https://gerrit.wikimedia.org/r/467723 (https://phabricator.wikimedia.org/T179461) (owner: 10BryanDavis) [17:27:12] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 59.18 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:30:32] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 75.52 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:32:51] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 46.84 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:34:46] (03CR) 10Smalyshev: [C: 031] Use testwikidatawiki instead of testwikidata in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467509 (https://phabricator.wikimedia.org/T207089) (owner: 10Urbanecm) [17:37:12] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 71.07 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:40:11] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:42:22] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:43:42] (03PS1) 10Gehel: wdqs: labs should be deployed manually [puppet] - 10https://gerrit.wikimedia.org/r/467738 [17:46:05] (03CR) 10Smalyshev: [C: 031] wdqs: labs should be deployed manually [puppet] - 10https://gerrit.wikimedia.org/r/467738 (owner: 10Gehel) [17:47:42] (03PS2) 10Gehel: wdqs: labs should be deployed manually [puppet] - 10https://gerrit.wikimedia.org/r/467738 [17:50:41] (03CR) 10Gehel: [C: 032] wdqs: labs should be deployed manually [puppet] - 10https://gerrit.wikimedia.org/r/467738 (owner: 10Gehel) [17:52:36] (03PS1) 10CRusnov: admin: adding user crusnov and key material [puppet] - 10https://gerrit.wikimedia.org/r/467740 (https://phabricator.wikimedia.org/T207009) [17:52:38] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/467740 (https://phabricator.wikimedia.org/T207009) (owner: 10CRusnov) [17:54:26] (03PS1) 10Gehel: wdqs: missing `nodes` configuration for labs [puppet] - 10https://gerrit.wikimedia.org/r/467741 [17:55:06] cool, how jenkins says welcome [17:55:24] (03CR) 10Gehel: [C: 032] wdqs: missing `nodes` configuration for labs [puppet] - 10https://gerrit.wikimedia.org/r/467741 (owner: 10Gehel) [17:55:39] yeah :) [17:56:38] (03PS1) 10Alexandros Kosiaris: Make restbase active/active [puppet] - 10https://gerrit.wikimedia.org/r/467742 [18:00:56] (03PS2) 10CRusnov: admin: adding user crusnov and key material [puppet] - 10https://gerrit.wikimedia.org/r/467740 (https://phabricator.wikimedia.org/T207009) [18:02:01] (03CR) 10Volans: [C: 032] admin: adding user crusnov and key material [puppet] - 10https://gerrit.wikimedia.org/r/467740 (https://phabricator.wikimedia.org/T207009) (owner: 10CRusnov) [18:07:12] (03PS1) 10Gehel: wdqs: align labs configuration with what we have on production [puppet] - 10https://gerrit.wikimedia.org/r/467743 [18:07:22] (03CR) 10Ppchelko: "What about the background jobs which we process in codfw (restbase-async.discovery.wmnet)? Should we make them active-active too? The sepa" [puppet] - 10https://gerrit.wikimedia.org/r/467742 (owner: 10Alexandros Kosiaris) [18:07:54] (03CR) 10Smalyshev: [C: 031] wdqs: align labs configuration with what we have on production [puppet] - 10https://gerrit.wikimedia.org/r/467743 (owner: 10Gehel) [18:08:07] (03CR) 10Gehel: [C: 032] wdqs: align labs configuration with what we have on production [puppet] - 10https://gerrit.wikimedia.org/r/467743 (owner: 10Gehel) [18:08:24] (03PS2) 10Gehel: wdqs: align labs configuration with what we have on production [puppet] - 10https://gerrit.wikimedia.org/r/467743 [18:10:02] (03CR) 10Dzahn: [C: 032] "tested directly on einsteinium with different check commands using check_ssl.. all worked fine" [puppet] - 10https://gerrit.wikimedia.org/r/467013 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:10:12] (03PS4) 10Dzahn: nagios_common: switch check_ssl to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467013 (https://phabricator.wikimedia.org/T202782) [18:11:43] (03PS1) 10Gehel: wdqs: 'manual' is a new valid deployment mode [puppet] - 10https://gerrit.wikimedia.org/r/467744 [18:12:04] (03PS1) 10Volans: Revert "admins: add Cas Rusnov to admins as ldap_only" [puppet] - 10https://gerrit.wikimedia.org/r/467745 [18:12:32] mutante: can you double check please? ^^^ [18:13:42] (03PS1) 10Elukey: camus: fix eventlogging-client-side configuration [puppet] - 10https://gerrit.wikimedia.org/r/467746 (https://phabricator.wikimedia.org/T206542) [18:13:51] PROBLEM - Device not healthy -SMART- on db2051 is CRITICAL: cluster=mysql device=cciss,2 instance=db2051:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2051&var-datasource=codfw%2520prometheus%252Fops [18:13:55] (03CR) 10Mathew.onipe: [C: 031] wdqs: 'manual' is a new valid deployment mode [puppet] - 10https://gerrit.wikimedia.org/r/467744 (owner: 10Gehel) [18:14:03] (03PS2) 10Elukey: camus: fix eventlogging-client-side configuration [puppet] - 10https://gerrit.wikimedia.org/r/467746 (https://phabricator.wikimedia.org/T206542) [18:14:05] 10Operations, 10Patch-For-Review: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10crusnov) [18:14:06] (03CR) 10Gehel: [C: 032] wdqs: 'manual' is a new valid deployment mode [puppet] - 10https://gerrit.wikimedia.org/r/467744 (owner: 10Gehel) [18:15:12] (03PS3) 10Elukey: camus: fix eventlogging-client-side configuration [puppet] - 10https://gerrit.wikimedia.org/r/467746 (https://phabricator.wikimedia.org/T206542) [18:15:18] volans: i dont see him in the admin module in that change? maybe after rebase? [18:15:22] (03CR) 10Elukey: [V: 032 C: 032] camus: fix eventlogging-client-side configuration [puppet] - 10https://gerrit.wikimedia.org/r/467746 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [18:15:29] mutante: it was merged already [18:15:36] sure to be rebased [18:15:50] (03PS2) 10Dzahn: Revert "admins: add Cas Rusnov to admins as ldap_only" [puppet] - 10https://gerrit.wikimedia.org/r/467745 (owner: 10Volans) [18:15:51] gehel: ready to merge? [18:16:03] elukey: done [18:16:18] super [18:16:48] ERROR: puppet-merge on puppetmaster2002.codfw.wmnet failed [18:16:51] lovely [18:17:14] (03CR) 10Dzahn: [C: 031] Revert "admins: add Cas Rusnov to admins as ldap_only" [puppet] - 10https://gerrit.wikimedia.org/r/467745 (owner: 10Volans) [18:17:27] volans: yep, looks good now. and the UID matches LDAP user [18:17:36] didnt check the key or anything [18:17:49] if you just meant the revert part.. ACK , +1 [18:17:51] great, thanks! yeah I took care of that [18:17:53] yep [18:17:55] thanks [18:18:21] anybody going to merge something? I believe that my error for pm2002 will go away during the next merge [18:18:29] elukey: yes me [18:18:34] should I go ahead? [18:18:34] (03PS1) 10Gehel: wdqs: 'manual' is a new valid deployment mode [puppet] - 10https://gerrit.wikimedia.org/r/467748 [18:18:47] elukey: volans me too if you need :) [18:18:55] it said [18:18:55] error: cannot lock ref 'refs/remotes/origin/production': is at 6082be9e9cb86651215af67954ee0f3ad85fc652 but expected eb489866ef1ee52a2fbac4888d02eaa491aabdae [18:19:07] only for 2002 [18:19:15] mmmh [18:19:25] I might have run puppet-merge too soon after gehel [18:19:38] (03PS5) 10Dzahn: nagios_common: switch check_ssl to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467013 (https://phabricator.wikimedia.org/T202782) [18:19:41] race conditions! [18:19:50] (03CR) 10Gehel: [C: 032] wdqs: 'manual' is a new valid deployment mode [puppet] - 10https://gerrit.wikimedia.org/r/467748 (owner: 10Gehel) [18:19:59] (03PS2) 10Gehel: wdqs: 'manual' is a new valid deployment mode [puppet] - 10https://gerrit.wikimedia.org/r/467748 [18:20:04] (03CR) 10Gehel: [V: 032 C: 032] wdqs: 'manual' is a new valid deployment mode [puppet] - 10https://gerrit.wikimedia.org/r/467748 (owner: 10Gehel) [18:20:29] elukey: I'm merging, let's see if it works [18:20:36] ack, [18:20:40] * volans next in line [18:21:02] WARNING: Revision range includes commits from multiple committers! [18:21:03] lol, i was already in it [18:21:28] on puppetmaster2002 [18:21:42] but looks like the merge was done anyway [18:21:46] so probably good [18:21:55] you have to actually type "multiple" to make it happen [18:22:24] nah, I was merging a single change, but it looks like 2 changes were merged on 2002 [18:22:37] probably the recovery from the situation elukey described [18:22:41] yep [18:22:57] it didn't merge my commit on 2002 [18:23:01] so it did this time [18:23:04] seems good [18:28:01] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Revisit the logging work done on Q1 2017-2018 for the standard pod setup - https://phabricator.wikimedia.org/T207200 (10akosiaris) p:05Triage>03Normal [18:29:03] (03PS6) 10Dzahn: nagios_common: switch check_ssl to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467013 (https://phabricator.wikimedia.org/T202782) [18:29:09] should we all use the same master for merging to avoid that kind of thing? [18:29:14] i use 1001 [18:31:21] PROBLEM - MariaDB Slave SQL: s8 on db1124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.wb_items_per_site: Cant find record in wb_items_per_site, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1087-bin.003068, end_log_pos 176514643 [18:32:59] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Revisit the logging work done on Q1 2017-2018 for the standard pod setup - https://phabricator.wikimedia.org/T207200 (10akosiaris) [18:33:17] mutante: I use 1001 too, it failed running puppet-merge [18:33:26] (on 2002 I mean) [18:33:29] elukey: gotcha, ok [18:33:54] (03CR) 10Eevans: "Do we know what this will do to traffic? How many requests will go to codfw (presumably any that ago there come off eqiad)?" [puppet] - 10https://gerrit.wikimedia.org/r/467742 (owner: 10Alexandros Kosiaris) [18:34:26] (03PS5) 10Bstorm: wiki replicas: Remove most comment joins from non-compat tables [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) [18:34:40] replaced Nagios::Plugin with Monitoring::Plugin on prod Icinga.. for check_ssl [18:34:57] tested manually first of course.. it worked.. still wanted to mention it [18:36:10] in case there is an unexpected check_ssl related issue.. but there really should not [18:36:37] this is to make it work on both jessie and stretch [18:38:16] (03CR) 10Bstorm: "Fixed a missing comma in the yaml" [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [18:38:33] (03PS6) 10Bstorm: wiki replicas: Remove most comment joins from non-compat tables [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) [18:39:51] (03CR) 10Bstorm: [C: 032] "Local testing seems good." [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [18:42:25] !log ppchelko@deploy1001 Started deploy [proton/deploy@a657059]: Try restarting for metrics [18:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:45] !log ppchelko@deploy1001 Finished deploy [proton/deploy@a657059]: Try restarting for metrics (duration: 00m 20s) [18:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:32] (03PS1) 10Pmiazga: Enable client side error counting on Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467760 (https://phabricator.wikimedia.org/T206702) [18:43:36] !log ppchelko@deploy1001 Started restart [proton/deploy@a657059]: Try restarting again for metrics [18:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:38] (03PS2) 10Dzahn: icinga: replace Nagios::Plugin with Monitoring::Plugin in multiple Perl scripts [puppet] - 10https://gerrit.wikimedia.org/r/467015 (https://phabricator.wikimedia.org/T202782) [18:43:54] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10herron) [18:44:44] !log ppchelko@deploy1001 Started restart [proton/deploy@a657059]: Try restarting again for metrics [18:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:07] (03PS2) 10Pmiazga: Enable client side error counting on Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467760 (https://phabricator.wikimedia.org/T206702) [18:56:38] (03PS8) 10Paladox: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [18:57:45] (03PS1) 10Dzahn: nagios_common: convert check_bgp to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467763 (https://phabricator.wikimedia.org/T202782) [18:59:01] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) They need to be reachable by the Analytics VLAN, so I would normally propose that one. Since this is a special case, maybe it makes more sense to... [19:00:04] twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Americas version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181016T1900). [19:01:44] (03CR) 10Jdlrobson: [C: 031] Enable client side error counting on Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467760 (https://phabricator.wikimedia.org/T206702) (owner: 10Pmiazga) [19:07:44] (03CR) 10Cwhite: [C: 031] "Looks good to me, as long as the database dashboards are good." [puppet] - 10https://gerrit.wikimedia.org/r/467264 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [19:09:43] (03PS1) 10Gehel: wdqs: 'manual' is a new valid deployment mode [puppet] - 10https://gerrit.wikimedia.org/r/467764 [19:10:01] (03PS2) 10Gehel: wdqs: 'manual' is a new valid deployment mode [puppet] - 10https://gerrit.wikimedia.org/r/467764 [19:11:02] (03CR) 10Gehel: [C: 032] wdqs: 'manual' is a new valid deployment mode [puppet] - 10https://gerrit.wikimedia.org/r/467764 (owner: 10Gehel) [19:11:17] (03PS2) 10Dzahn: nagios_common: convert check_bgp to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467763 (https://phabricator.wikimedia.org/T202782) [19:11:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests: Set up 3 Ganeti VMs for datalake cloud analytics Hadoop cluster - https://phabricator.wikimedia.org/T207205 (10Ottomata) [19:11:36] (03CR) 10Dzahn: [C: 032] "tested" [puppet] - 10https://gerrit.wikimedia.org/r/467763 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:13:05] (03PS9) 10Paladox: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [19:13:54] (03PS10) 10Paladox: Planet: Redesign UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [19:15:04] herron: if you're back at work today, is there anything you need to get unblocked on T41785? [19:15:05] T41785: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 [19:15:10] Or is it just dns changes remaining? [19:16:18] hey andrewbogott! not blocked by cloud on anything, the associated patch is up for review and working through that [19:16:32] the aliaser will work fine [19:16:53] it'd be ideal if the permanent non-NAT neutron solution is fixed, though! [19:17:07] while we're working on that, and as not have to revisit this in a few weeks :) [19:18:12] fwiw that’s T206261 [19:18:13] T206261: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 [19:18:25] indeed, thanks :) [19:18:27] yeah it strikes me as something to get decided and stable before migrating anything significant there [19:18:56] I must not understand how that's related to the smtp thing. Are we hardcoding IPs someplace for the mx? [19:19:26] smarthost clients will use mx-out01.wmflabs.org in their config, and that should stay private [19:20:19] (03PS11) 10Paladox: Planet: Redesign UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [19:20:21] ok, so… how is T206261 related? [19:20:46] I mean, if we change the routing and turn of the aliaser who will notice? [19:22:06] andrewbogott: the email logs on the smarthost will all appear as coming from 185.15.56.1 [19:22:28] so it would be harder (or impossible?) to identify which VM sends what [19:22:37] which could be important if e.g. we have a spammer [19:22:45] that we need to react to quickly [19:22:48] does that make sense? [19:23:01] so as of https://phabricator.wikimedia.org/T41785#4644690 the traffic does stay private, but T206261 seems more optimal approach [19:23:36] unless the sending instance has a floating IP of their own, right? [19:23:40] As I understand it, traffic from outside of the cloud net will show the proper originating IP. And traffic from inside the cloud will show the correct internal IP of the originating host. [19:24:07] it should, but that's not the case right now [19:24:14] (well, it is, with labs-aliaser, I believe) [19:24:28] cf. herron's https://phabricator.wikimedia.org/T206261#4652702 [19:24:35] 18:01:00.956781 IP 185.15.56.1.34330 > 172.16.1.239.25: Flags [S], seq 2492930878, win 29200, options [mss 1460,sackOK,TS val 301736818 ecr 0,nop,wscale 9], length 0 [19:25:07] Krenair: not sure, do hosts with floating IP perform dns lookup differently? [19:25:36] so traffic from a cloud VM to a cloud floating IP would appear as originating from 185.15.56.1 [19:25:40] i.e. both DNATed and SNATed [19:25:49] unsure if I'm making sense :) [19:26:03] herron, no [19:26:15] my point is you should see a 185.15. etc. IP that isn't the generic catch-all one [19:26:33] if they're using the public IP [19:26:34] the example in that paste looks like the test is using the literal IP rather than the hostname, yes? [19:27:05] how do you mean? [19:27:51] (03PS1) 10Ottomata: Use net_topology script content rather than erb path [puppet/cdh] - 10https://gerrit.wikimedia.org/r/467766 (https://phabricator.wikimedia.org/T204951) [19:28:05] If I understand correctly, in that test Keith is opening a connection here: [19:28:06] cloudinfra-puppetmaster-01:~# nc -vz 185.15.56.18 25 [19:28:07] (03PS1) 10Gehel: wdqs: improve resource ordering for non scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/467767 [19:28:10] andrewbogott: yes in the paste the context is routing [19:28:23] (03CR) 10jerkins-bot: [V: 04-1] Use net_topology script content rather than erb path [puppet/cdh] - 10https://gerrit.wikimedia.org/r/467766 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [19:28:28] so [19:28:33] I'm going to try again [19:28:46] (03CR) 10jerkins-bot: [V: 04-1] wdqs: improve resource ordering for non scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/467767 (owner: 10Gehel) [19:28:54] 1) Is there any situation where the MX is being addressed directly by IP, or just by hostname? [19:29:06] 2) Is there any situation where, when the MX is addressed by hostname, we get the incorrect originating IP? [19:29:22] My understanding is that the answer to both of those is 'no' because of the ip aliaser. [19:29:22] is the IP vs. hostname because of labs-aliaser? [19:29:28] yes [19:29:35] ok, I'm not too familiar with that tbh [19:30:46] (03PS2) 10Gehel: wdqs: improve resource ordering for non scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/467767 [19:31:28] (03CR) 10Cwhite: "> That covers all the dump roles on the snapshot hosts, but not the" [puppet] - 10https://gerrit.wikimedia.org/r/466904 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [19:32:38] (03PS2) 10Cwhite: prometheus: add collector.ntp.server option and enable on recursor nodes [puppet] - 10https://gerrit.wikimedia.org/r/466696 (https://phabricator.wikimedia.org/T183454) [19:32:43] (03PS3) 10Gehel: wdqs: improve resource ordering for non scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/467767 [19:33:07] So… I'm going back to thinking that T206261 is unrelated to the mx issue, and is basically a 'this would be cleaner' refactor rather than a blocker for any of my work. [19:33:08] T206261: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 [19:33:38] I suppose so [19:34:05] wouldn't call it a refactor, from arturo's comment is sounds like it may be a single hiera value change, but I'm not too familiar [19:34:13] so I may be wrong [19:35:04] (03CR) 10Gehel: "puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler1002/12977/" [puppet] - 10https://gerrit.wikimedia.org/r/467767 (owner: 10Gehel) [19:35:24] (03PS1) 10Ottomata: Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) [19:35:40] (03PS2) 10Ottomata: Use net_topology script content rather than erb path [puppet/cdh] - 10https://gerrit.wikimedia.org/r/467766 (https://phabricator.wikimedia.org/T204951) [19:35:55] (03PS2) 10Ottomata: Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) [19:36:21] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [19:36:34] (03CR) 10jerkins-bot: [V: 04-1] Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [19:37:22] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [19:38:34] (03PS3) 10Ottomata: Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) [19:38:50] (03PS4) 10Ottomata: Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) [19:39:02] (03CR) 10Anomie: "Although you'll have to wait for the train (or have the patch SWATted)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467123 (https://phabricator.wikimedia.org/T200914) (owner: 10MGChecker) [19:39:15] yes paravoid to my understanding, only that hiera setting needs to be tuned [19:39:29] (03CR) 10jerkins-bot: [V: 04-1] Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [19:39:50] but I would nees to dive into the code and do some testing in order to provide correct advice [19:40:35] our dmz_cidr setting should take precedence to my understanding but I may be wrong, I need to read the code [19:40:46] and see the resulting iptables ruleset [19:40:49] (03PS5) 10Ottomata: Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) [19:41:20] (03CR) 10jerkins-bot: [V: 04-1] Move Hadoop net topology to hiera [puppet] - 10https://gerrit.wikimedia.org/r/467769 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [19:41:39] arturo: if theres anything I can do to help I’m happy to [19:41:42] I plan to do that this week, I couldn't during the offsite [19:42:22] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.498 second response time [19:43:01] I'm not opposed to that getting fixed, just want to make it clear that it's not a blocker for smtp or migration (since the current behavior is the same as the old region behavior) [19:43:09] my first test will be to simply add the new NAT exclusion and see what happens [19:43:21] (the hiera key) [19:45:42] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:11] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [19:47:11] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [19:49:11] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.925 second response time [19:49:35] (03CR) 10Cwhite: [C: 032] prometheus: add collector.ntp.server option and enable on recursor nodes [puppet] - 10https://gerrit.wikimedia.org/r/466696 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [19:52:25] (03CR) 10ArielGlenn: [C: 031] "Feel free to remove it also from the dumpsdata hosts at will; maybe cloud folks will want to be asked about labstore1006/7, as those serve" [puppet] - 10https://gerrit.wikimedia.org/r/466904 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [19:52:31] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:02] PROBLEM - HP RAID on db2051 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:3 - Controller: OK - Battery/Capacitor: OK [19:54:04] ACKNOWLEDGEMENT - HP RAID on db2051 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:3 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T207212 [19:54:08] 10Operations, 10ops-codfw: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10ops-monitoring-bot) [19:56:32] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10ArielGlenn) Whoever has clinic duty should probably take this and hand it to the right person. I think @Dzahn and/or @MoritzMuehlenhoff may oversee the ubuntu mirror boxes (if not, please excuse the ping). [20:07:23] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853 (10Aklapper) Is the current instance on wmflabs.org puppetized? If so, where to see the related puppet files? [20:08:45] (03CR) 10Smalyshev: [C: 031] wdqs: improve resource ordering for non scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/467767 (owner: 10Gehel) [20:09:36] (03PS4) 10Gehel: wdqs: improve resource ordering for non scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/467767 [20:10:50] (03CR) 10Gehel: [C: 032] wdqs: improve resource ordering for non scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/467767 (owner: 10Gehel) [20:13:24] (03PS10) 10Herron: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) [20:14:10] (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [20:14:13] (03PS3) 10BBlack: Add sitemaps rewrite for additional domains [puppet] - 10https://gerrit.wikimedia.org/r/465538 (https://phabricator.wikimedia.org/T206496) (owner: 10Imarlier) [20:14:15] (03PS8) 10Mforns: Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) [20:14:24] (03CR) 10BBlack: [C: 032] Add sitemaps rewrite for additional domains [puppet] - 10https://gerrit.wikimedia.org/r/465538 (https://phabricator.wikimedia.org/T206496) (owner: 10Imarlier) [20:14:48] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [20:14:50] !log restarted pdfrender on scb1003 [20:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:06] (03CR) 10Herron: smarthost: create mail smarthost role/profile (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [20:27:46] (03PS1) 10Ottomata: Move Hive profile settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/467815 (https://phabricator.wikimedia.org/T204951) [20:28:16] (03CR) 10jerkins-bot: [V: 04-1] Move Hive profile settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/467815 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [20:28:28] !log twentyafterfour@deploy1001 Started scap: Syncing 1.32.0-wmf.26 refs T191072 [20:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:32] T191072: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 [20:28:56] (03PS2) 10Ottomata: Move Hive profile settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/467815 (https://phabricator.wikimedia.org/T204951) [20:29:28] (03CR) 10jerkins-bot: [V: 04-1] Move Hive profile settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/467815 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [20:30:02] (03PS1) 10Dzahn: nagios_common: convert check_jnx_alarmts to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467816 (https://phabricator.wikimedia.org/T202782) [20:31:05] (03PS3) 10Ottomata: Move Hive profile settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/467815 (https://phabricator.wikimedia.org/T204951) [20:31:35] (03CR) 10jerkins-bot: [V: 04-1] Move Hive profile settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/467815 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [20:32:08] (03PS11) 10Herron: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) [20:32:53] (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [20:33:00] (03PS4) 10Ottomata: Move Hive profile settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/467815 (https://phabricator.wikimedia.org/T204951) [20:34:49] 10Operations, 10ops-esams, 10Traffic, 10hardware-requests: Procure and install LVS and miscellaneous servers - https://phabricator.wikimedia.org/T184068 (10RobH) [20:34:58] (03CR) 10Hashar: "The logic in rake_modules does not detect it should run wdqs. You can run it locally with:" [puppet] - 10https://gerrit.wikimedia.org/r/467692 (https://phabricator.wikimedia.org/T204240) (owner: 10Mathew.onipe) [20:36:09] (03CR) 10BBlack: [C: 031] "LGTM, sorry for all the back and forth! Compiler likes it too, and the catalog data from that confirms numa_networking is turned on corre" [puppet] - 10https://gerrit.wikimedia.org/r/465624 (https://phabricator.wikimedia.org/T206105) (owner: 10Gehel) [20:37:40] (03PS5) 10Ottomata: Move Hive profile settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/467815 (https://phabricator.wikimedia.org/T204951) [20:39:27] (03CR) 10Ottomata: "No-ops all around: https://puppet-compiler.wmflabs.org/compiler1002/12980/" [puppet] - 10https://gerrit.wikimedia.org/r/467815 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [20:44:58] (03PS3) 10BBlack: authdns: add interface::rps and TFO [puppet] - 10https://gerrit.wikimedia.org/r/464861 [20:47:05] (03CR) 10BBlack: [C: 032] "tg3 support fixed up in https://gerrit.wikimedia.org/r/c/operations/puppet/+/467443 . None of our current authdns hosts are NUMA (they ha" [puppet] - 10https://gerrit.wikimedia.org/r/464861 (owner: 10BBlack) [20:48:23] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10thcipriani) Seems like this is something we co... [20:55:00] !log twentyafterfour@deploy1001 Finished scap: Syncing 1.32.0-wmf.26 refs T191072 (duration: 26m 32s) [20:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:04] T191072: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 [21:08:22] (03PS1) 1020after4: group0 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467819 [21:08:24] (03CR) 1020after4: [C: 032] group0 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467819 (owner: 1020after4) [21:09:36] (03Merged) 10jenkins-bot: group0 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467819 (owner: 1020after4) [21:11:27] (03PS1) 10Bstorm: wiki replicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/467820 (https://phabricator.wikimedia.org/T181650) [21:13:21] 10Operations, 10Patch-For-Review: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Volans) [21:15:33] (03CR) 10jenkins-bot: group0 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467819 (owner: 1020after4) [21:15:55] (03PS5) 10Bstorm: labstore: make nfsd-ldap package required for jessie, but not stretch [puppet] - 10https://gerrit.wikimedia.org/r/466990 [21:16:14] (03PS1) 10Ottomata: Label Hadoop prometheus metrics with the hadoop_cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/467821 (https://phabricator.wikimedia.org/T204951) [21:18:09] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.32.0-wmf.26 refs T191072 [21:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:13] T191072: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 [21:28:45] 10Operations, 10Wikimedia-General-or-Unknown, 10Performance: arwiki page giving "entire web request took longer than 60 seconds and timed out" - https://phabricator.wikimedia.org/T206878 (10mmodell) [21:29:03] Krinkle: are you deploying https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/467818/ soon ? [21:29:14] (03PS1) 10Thifranc: Remove 'double quoted string' errors from puppet-lint [puppet] - 10https://gerrit.wikimedia.org/r/467824 [21:29:16] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/467824 (owner: 10Thifranc) [21:29:20] 10Operations, 10Wikimedia-General-or-Unknown, 10Performance, 10Wikimedia-production-error: arwiki page giving "entire web request took longer than 60 seconds and timed out" - https://phabricator.wikimedia.org/T206878 (10mmodell) [21:29:25] AaronSchulz: yeah, planning to as soon as I'm done triaging the new branch's impact. [21:30:50] \o/ [21:33:16] (03CR) 10Paladox: [C: 04-1] Remove 'double quoted string' errors from puppet-lint (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/467824 (owner: 10Thifranc) [21:42:03] !log ppchelko@deploy1001 Started deploy [restbase/deploy@d9e3a09]: Downgrade major-greater to minor-greater if no-cache is required [21:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:34] (03PS1) 10Thifranc: Correct alongside Paladox review [puppet] - 10https://gerrit.wikimedia.org/r/467826 [21:43:07] thifranc i think you pushed wrong :) [21:43:30] sry for trouble, first gerrit use [21:43:47] thifranc yep :), should become easier in the future. [21:43:54] you can also use the inline editor. [21:45:03] twentyafterfour: can we roll back? filing the report, but see at least one new fatal that probably affects most edits due to a bad type hint in AbuseFilter that needs fixing, that was introduced a few days ago [21:45:24] actually, is it test wikis only, or group0 fully. [21:45:33] nevermind if test only. Can fix as we go. [21:45:49] Krinkle <+logmsgbot> !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.32.0-wmf.26 refs T191072 [21:45:49] T191072: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 [21:45:56] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@d9e3a09]: Downgrade major-greater to minor-greater if no-cache is required (duration: 03m 53s) [21:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:41] !log ppchelko@deploy1001 Started deploy [restbase/deploy@d9e3a09]: Downgrade major-greater to minor-greater if no-cache is required, take 2 [21:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:10] paladox: I don't see what's pushed wrong, I can ee my changes on https://gerrit.wikimedia.org/r/c/operations/puppet/+/467826 isn't that what's expected ? [21:51:00] thifranc you need to cherry-pick https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467824/ [21:51:08] then make your changes thengit commit -a --amend [21:51:17] then you git push [21:52:39] thifranc: unlike github in which you're expected to push new commits to resolve issues, in Gerrit you amend the existing commit with fixes [21:54:33] Krinkle: we can roll back [21:55:52] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@d9e3a09]: Downgrade major-greater to minor-greater if no-cache is required, take 2 (duration: 09m 11s) [21:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:56] !log ppchelko@deploy1001 Started deploy [restbase/deploy@d9e3a09]: Downgrade major-greater to minor-greater if no-cache is required, take 3 [21:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:11] (03PS1) 1020after4: group0 wikis to 1.32.0-wmf.24 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467828 [21:56:13] (03CR) 1020after4: [C: 032] group0 wikis to 1.32.0-wmf.24 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467828 (owner: 1020after4) [21:57:18] (03Merged) 10jenkins-bot: group0 wikis to 1.32.0-wmf.24 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467828 (owner: 1020after4) [21:58:52] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.32.0-wmf.24 refs T191072 [21:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:55] T191072: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 [22:00:11] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@d9e3a09]: Downgrade major-greater to minor-greater if no-cache is required, take 3 (duration: 04m 15s) [22:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:17] !log ppchelko@deploy1001 Started deploy [restbase/deploy@d9e3a09]: Downgrade major-greater to minor-greater if no-cache is required, take 4 [22:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:53] 10Operations, 10DBA, 10MediaWiki-Cache, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Krinkle) [22:02:53] (03CR) 10jenkins-bot: group0 wikis to 1.32.0-wmf.24 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467828 (owner: 1020after4) [22:04:10] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@d9e3a09]: Downgrade major-greater to minor-greater if no-cache is required, take 4 (duration: 03m 53s) [22:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:15] !log ppchelko@deploy1001 Started deploy [restbase/deploy@d9e3a09]: Downgrade major-greater to minor-greater if no-cache is required, take 5 [22:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:31] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@d9e3a09]: Downgrade major-greater to minor-greater if no-cache is required, take 5 (duration: 05m 16s) [22:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:43] !log ppchelko@deploy1001 Started deploy [restbase/deploy@d9e3a09]: Downgrade major-greater to minor-greater if no-cache is required, take 6 [22:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:00] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@d9e3a09]: Downgrade major-greater to minor-greater if no-cache is required, take 6 (duration: 01m 18s) [22:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:30] (03PS1) 10Legoktm: Add REL1_32 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467830 [22:18:54] (03PS1) 10Dzahn: nagios_common: convert check_sslxNN to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467831 (https://phabricator.wikimedia.org/T202782) [22:39:49] (03PS2) 10Dzahn: nagios_common: convert check_jnx_alarms to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467816 (https://phabricator.wikimedia.org/T202782) [22:43:02] (03PS2) 10Thifranc: Remove 'double quoted string' errors from puppet-lint [puppet] - 10https://gerrit.wikimedia.org/r/467824 [22:44:46] thifranc: amend worked it looks :) [22:45:32] still a little lost, next contributions will be smoother I promise ! [22:46:51] thifranc: no worries at all, glad it works [22:52:17] (03PS12) 10Paladox: Planet: Redesign UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [22:58:17] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181016T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:02:45] (03CR) 10Thifranc: "Add Giuseppe Lavagetto as reviewer he was the last person to edit those files" [puppet] - 10https://gerrit.wikimedia.org/r/467824 (owner: 10Thifranc) [23:05:57] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:12:57] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [23:17:27] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [23:19:19] 10Operations, 10Wikimedia-Mailing-lists: Make aklapper a co-admin of the list-admins@ mailing list - https://phabricator.wikimedia.org/T207239 (10Aklapper) [23:21:05] (03PS13) 10Paladox: Planet: Redesign UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [23:30:38] (03PS1) 10Dduvall: Project clone URLs based on access control [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/467843 [23:37:14] * Krinkle staging on mwdebug1001 [23:41:08] (03PS14) 10Paladox: Planet: Redesign UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 (https://phabricator.wikimedia.org/T207243) [23:45:40] * Krinkle not staging