[00:00:05] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180117T0000). [00:00:05] kaldari and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:08] \o [00:01:31] i suppose i can ship things [00:07:30] (03PS2) 10Dzahn: DHCP: switch to http to retrieve installer image [puppet] - 10https://gerrit.wikimedia.org/r/404607 (https://phabricator.wikimedia.org/T182215) [00:07:36] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3904801 (10ayounsi) a:05ayounsi>03None [00:08:39] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3890758 (10ayounsi) About: > Postgres DB was empty after I was able to have it restarted. No tables defined, but a netbox DB is defined. I re-ran the scap script... [00:08:48] !log ebernhardson@tin Synchronized php-1.31.0-wmf.17/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T182616 Turn off cirrus AB test on hewiki (duration: 01m 14s) [00:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:03] T182616: Re-run AB test for Hebrew Wikipedia (has > 1% of search traffic) with new model - https://phabricator.wikimedia.org/T182616 [00:10:18] !log ebernhardson@tin Synchronized php-1.31.0-wmf.16/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T182616 Turn off cirrus AB test on hewiki (duration: 01m 12s) [00:10:25] kaldari: here for swat? [00:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:55] here [00:14:58] o/ [00:15:22] (03PS3) 10Dzahn: DHCP: switch to http to retrieve installer image [puppet] - 10https://gerrit.wikimedia.org/r/404607 (https://phabricator.wikimedia.org/T182215) [00:15:32] (03CR) 10EBernhardson: [C: 032] Updating fonts list and sorting it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403984 (https://phabricator.wikimedia.org/T184664) (owner: 10Kaldari) [00:15:35] kaldari: anything to test? [00:15:41] * ebernhardson has no clue what this even does :P [00:17:21] solid +2 :) [00:17:54] (03Merged) 10jenkins-bot: Updating fonts list and sorting it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403984 (https://phabricator.wikimedia.org/T184664) (owner: 10Kaldari) [00:18:04] (03CR) 10jenkins-bot: Updating fonts list and sorting it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403984 (https://phabricator.wikimedia.org/T184664) (owner: 10Kaldari) [00:18:11] ebernhardson: nope [00:21:37] !log ebernhardson@tin Synchronized fc-list: SWAT: T184664 Updating fonts list and sorting it (duration: 01m 12s) [00:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:51] T184664: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664 [00:22:01] kaldari: deployed [00:22:34] (03Merged) 10jenkins-bot: Remove cirrus AB test config for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404592 (https://phabricator.wikimedia.org/T182616) (owner: 10EBernhardson) [00:22:47] (03CR) 10jenkins-bot: Remove cirrus AB test config for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404592 (https://phabricator.wikimedia.org/T182616) (owner: 10EBernhardson) [00:23:02] isnt the test to render some specific SVG?:) [00:23:29] oh, you already did, saw ticket comment, cool [00:25:09] turns out .. i didn't rebase before shipping that one :P one more time [00:26:11] !log ebernhardson@tin Synchronized fc-list: SWAT: T184664 Updating fonts list and sorting it (duration: 01m 12s) [00:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:51] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T182616 Remove cirrus AB test config for hewiki (duration: 01m 09s) [00:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:05] T182616: Re-run AB test for Hebrew Wikipedia (has > 1% of search traffic) with new model - https://phabricator.wikimedia.org/T182616 [00:29:07] swat complete [00:45:59] (03PS4) 10Dzahn: DHCP: switch to http to retrieve installer image [puppet] - 10https://gerrit.wikimedia.org/r/404607 (https://phabricator.wikimedia.org/T182215) [01:03:34] (03CR) 10Dzahn: [C: 032] DHCP: switch to http to retrieve installer image [puppet] - 10https://gerrit.wikimedia.org/r/404607 (https://phabricator.wikimedia.org/T182215) (owner: 10Dzahn) [01:17:16] (03CR) 10Ayounsi: [C: 032] Add my new key [puppet] - 10https://gerrit.wikimedia.org/r/404604 (owner: 10MaxSem) [01:17:22] (03PS3) 10Ayounsi: Add my new key [puppet] - 10https://gerrit.wikimedia.org/r/404604 (owner: 10MaxSem) [01:17:46] (03CR) 10Ayounsi: [V: 032 C: 032] Add my new key [puppet] - 10https://gerrit.wikimedia.org/r/404604 (owner: 10MaxSem) [01:24:08] 10Operations, 10Patch-For-Review: install_server: switch to stretch as default install image - https://phabricator.wikimedia.org/T182215#3905122 (10Dzahn) 05Open>03Resolved both things done , in separate changes, and mailed ops list about it. [01:27:36] (03CR) 10Dzahn: "better add traffic team, i'm not a good reviewer for this" [puppet] - 10https://gerrit.wikimedia.org/r/317450 (https://phabricator.wikimedia.org/T133548) (owner: 10Alex Monk) [01:42:14] 10Operations, 10Ops-Access-Requests: Requesting access to stat1004, stat1005, stat1006 for mneisler - https://phabricator.wikimedia.org/T184838#3905129 (10MNeisler) I've confirmed with the team that I'll need access to the `analytics-privatedata-users` and `researchers` groups. I'll specifically need access to... [01:58:31] (03PS10) 10Dzahn: Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [01:58:39] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9753/ruthenium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [01:59:41] (03PS11) 10Dzahn: parsoid::testing:: Switch to YAML configuration [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [02:02:42] (03PS1) 10MaxSem: Add a test verifying that rtl.dblist is up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) [02:03:42] (03CR) 10Dzahn: "applied on ruthenium now" [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [02:31:06] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.16) (duration: 07m 11s) [02:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:37] 10Operations, 10Research, 10Patch-For-Review, 10Research-2017-18-Q2: Permissions to upload data to the analytics cluster from a machine at Drexel - https://phabricator.wikimedia.org/T177521#3905172 (10DarTar) 05Open>03Resolved [03:21:10] (03PS47) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [03:37:46] (03PS1) 10TerraCodes: Move flaggedrevs to NS_MAIN on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404620 (https://phabricator.wikimedia.org/T148603) [04:26:52] (03Draft2) 10Jayprakash12345: Add Draft Namespace in enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404624 (https://phabricator.wikimedia.org/T184957) [04:27:24] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [04:28:45] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [04:39:24] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [04:40:45] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [04:52:54] PROBLEM - MariaDB Slave Lag: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.24 seconds [04:53:34] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.59 seconds [04:53:35] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.93 seconds [04:53:35] PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.88 seconds [04:53:35] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.52 seconds [04:53:35] PROBLEM - MariaDB Slave Lag: s3 on db2018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.54 seconds [04:53:44] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.79 seconds [05:00:05] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [05:02:35] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 57.13 seconds [05:02:35] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 56.24 seconds [05:02:35] RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 53.40 seconds [05:02:35] RECOVERY - MariaDB Slave Lag: s3 on db2018 is OK: OK slave_sql_lag Replication lag: 48.91 seconds [05:02:35] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 47.94 seconds [05:02:44] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 33.53 seconds [05:02:54] RECOVERY - MariaDB Slave Lag: s3 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 18.45 seconds [06:14:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404629 (https://phabricator.wikimedia.org/T174569) [06:16:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404629 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:18:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404629 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:18:28] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404629 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:20:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1104 - T174569 (duration: 01m 14s) [06:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:57] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:21:07] !log Upgrade mariadb and kernel on db1104 [06:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:02] !log Deploy schema change on db1104 - T174569 [06:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:15] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:29:31] !log Stop replication in sync on db1089 and s1 codfw master (db2048) - T162807 [06:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:42] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [06:30:04] RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:40:03] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3905339 (10Marostegui) [06:40:21] !log Stop MySQL on labsdb1001 (already dead) and labsdb1003 - T184832 [06:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:32] T184832: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832 [06:40:55] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:42:09] (03PS4) 10Marostegui: mariadb: Set as spares labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/404323 (https://phabricator.wikimedia.org/T184832) (owner: 10Jcrespo) [06:42:34] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [06:42:54] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [06:43:28] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3905342 (10Marostegui) [06:43:32] (03CR) 10Marostegui: [C: 032] mariadb: Set as spares labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/404323 (https://phabricator.wikimedia.org/T184832) (owner: 10Jcrespo) [06:46:38] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3905344 (10Marostegui) [06:47:18] !log Remove labsdb1001 and labsdb1003 from tendril - T184832 [06:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:30] T184832: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832 [06:49:47] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3905347 (10Marostegui) a:03Cmjohnson I believe this is now ready for @Cmjohnson to proceed. [07:45:48] <_joe_> !log depooling mw1209-1220 from the appserver cluster for decommissioning, T185004 [07:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:02] T185004: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004 [07:46:33] !log oblivian@neodymium conftool action : set/pooled=inactive; selector: cluster=appserver,name=mw12([0-1][0-9]|20)\.eqiad\.wmnet [07:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:31] !log restart varnish backend on cp4024 (ton of 503s, icinga alerting for mailbox lag) [07:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:13] (03CR) 10Giuseppe Lavagetto: site.pp: reorganize MediaWiki appservers in codfw for role/row (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/404498 (owner: 10Giuseppe Lavagetto) [07:54:29] (03CR) 10Giuseppe Lavagetto: site.pp: reorganize appservers in eqiad by function/row (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404453 (owner: 10Giuseppe Lavagetto) [07:55:07] PROBLEM - mediawiki-installation DSH group on mw1215 is CRITICAL: Host mw1215 is not in mediawiki-installation dsh group [07:56:12] (03PS3) 10Giuseppe Lavagetto: site.pp: reorganize appservers in eqiad by function/row [puppet] - 10https://gerrit.wikimedia.org/r/404453 [07:56:14] (03PS2) 10Giuseppe Lavagetto: site.pp: decommission mw1201-1208 [puppet] - 10https://gerrit.wikimedia.org/r/404499 (https://phabricator.wikimedia.org/T185004) [07:56:16] (03PS2) 10Giuseppe Lavagetto: site.pp: decommission mw1209-1220 [puppet] - 10https://gerrit.wikimedia.org/r/404500 (https://phabricator.wikimedia.org/T185004) [07:56:18] (03PS2) 10Giuseppe Lavagetto: site.pp: reorganize MediaWiki appservers in codfw for role/row [puppet] - 10https://gerrit.wikimedia.org/r/404498 [07:58:51] RECOVERY - Check Varnish expiry mailbox lag on cp4024 is OK: OK: expiry mailbox lag is 0 [07:59:41] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [07:59:50] (03CR) 10Elukey: [C: 031] site.pp: reorganize appservers in eqiad by function/row [puppet] - 10https://gerrit.wikimedia.org/r/404453 (owner: 10Giuseppe Lavagetto) [08:00:01] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [08:03:32] PROBLEM - mediawiki-installation DSH group on mw1209 is CRITICAL: Host mw1209 is not in mediawiki-installation dsh group [08:04:12] PROBLEM - mediawiki-installation DSH group on mw1216 is CRITICAL: Host mw1216 is not in mediawiki-installation dsh group [08:05:52] (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: reorganize appservers in eqiad by function/row [puppet] - 10https://gerrit.wikimedia.org/r/404453 (owner: 10Giuseppe Lavagetto) [08:11:31] PROBLEM - mediawiki-installation DSH group on mw1220 is CRITICAL: Host mw1220 is not in mediawiki-installation dsh group [08:12:12] PROBLEM - mediawiki-installation DSH group on mw1210 is CRITICAL: Host mw1210 is not in mediawiki-installation dsh group [08:12:12] PROBLEM - mediawiki-installation DSH group on mw1219 is CRITICAL: Host mw1219 is not in mediawiki-installation dsh group [08:13:37] these are expected, part of decom --^ [08:15:08] 10Operations, 10hardware-requests, 10HHVM, 10Patch-For-Review, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#3905409 (10Joe) [08:18:53] <_joe_> yeah but I downtimed the hosts... [08:19:49] (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: decommission mw1201-1208 [puppet] - 10https://gerrit.wikimedia.org/r/404499 (https://phabricator.wikimedia.org/T185004) (owner: 10Giuseppe Lavagetto) [08:30:11] (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: decommission mw1209-1220 [puppet] - 10https://gerrit.wikimedia.org/r/404500 (https://phabricator.wikimedia.org/T185004) (owner: 10Giuseppe Lavagetto) [08:36:07] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404635 (https://phabricator.wikimedia.org/T174569) [08:37:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404635 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:40:10] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404635 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:40:20] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404635 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:40:28] 10Operations, 10hardware-requests, 10HHVM, 10Patch-For-Review, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#3905423 (10Joe) [08:40:42] 10Operations, 10hardware-requests, 10HHVM, 10Patch-For-Review, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#3902917 (10Joe) a:05Joe>03None [08:42:30] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=61%) [08:42:35] (03PS1) 10Marostegui: db-eqiad.php: Depool db1065, pool db1067 for vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404636 (https://phabricator.wikimedia.org/T162807) [08:44:14] !log reboot stat100[456] for kernel upgrades [08:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1101:3318 (duration: 15m 42s) [08:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065, pool db1067 for vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404636 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:58:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065, pool db1067 for vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404636 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:58:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065, pool db1067 for vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404636 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:00:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 - T162807 (duration: 01m 11s) [09:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:43] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:03:35] (03PS1) 10Filippo Giunchedi: restbase: reprovision restbase1016 [puppet] - 10https://gerrit.wikimedia.org/r/404638 (https://phabricator.wikimedia.org/T184100) [09:04:34] (03CR) 10Filippo Giunchedi: [C: 032] restbase: reprovision restbase1016 [puppet] - 10https://gerrit.wikimedia.org/r/404638 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [09:06:05] !log reboot analytics1003 for kernel upgrades [09:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:57] (03PS2) 10Gehel: wdqs: simplify logging of categories reload [puppet] - 10https://gerrit.wikimedia.org/r/404315 [09:11:27] 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3905487 (10Qgil) [09:11:31] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404639 [09:11:39] (03CR) 10Alexandros Kosiaris: [C: 031] Postgres: remove hardcoded version [puppet] - 10https://gerrit.wikimedia.org/r/404516 (https://phabricator.wikimedia.org/T184634) (owner: 10Ayounsi) [09:12:29] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[init_superset] [09:13:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404639 (owner: 10Marostegui) [09:14:36] !log reimage restbase1016 - T184100 [09:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:47] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [09:15:53] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404639 (owner: 10Marostegui) [09:16:51] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404639 (owner: 10Marostegui) [09:17:29] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:17:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1101:3318 (duration: 01m 12s) [09:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:29] PROBLEM - DPKG on furud is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:25:16] thorium should be fixed [09:25:28] RECOVERY - DPKG on furud is OK: All packages OK [09:28:45] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404642 [09:30:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404642 (owner: 10Marostegui) [09:30:34] !log rebooting flerovium and furud for kernel security update [09:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:05] 10Operations, 10MediaWiki-Configuration, 10discovery-system: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078#3905518 (10Joe) p:05Triage>03Normal [09:32:01] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404642 (owner: 10Marostegui) [09:32:13] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404642 (owner: 10Marostegui) [09:34:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Full repool db1101:3318 (duration: 01m 11s) [09:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:20] 10Operations, 10ops-eqiad: mw1191 ipmi-sel cpu errors - https://phabricator.wikimedia.org/T179640#3905531 (10elukey) 05Open>03Resolved a:03elukey Host decommed in https://phabricator.wikimedia.org/T183895 [09:38:16] 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#3905536 (10fgiunchedi) I'm running into this issue again when reimaging restbase systems as part of {T184100}. From the comments above it seems that setting rootdelay= fixes the issue but we're not applyin... [09:46:04] !log removed upstart config for brrd on eventlog1001 (failing and spamming syslog, old leftover?) [09:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:21] Cc: Krinkle,ottomata --^ [09:52:13] !log reboot druid1005 for kernel upgrades [09:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:40] uff it is druid1002, amending [09:59:53] 10Operations, 10MediaWiki-Configuration, 10discovery-system: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078#3905554 (10Volans) [10:00:41] 10Operations, 10MediaWiki-Configuration, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597#3905561 (10Volans) [10:01:05] !log start cassandra-a on restbase1016 [10:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:01] 10Operations, 10MediaWiki-Configuration, 10discovery-system: Prepare conftool for safely editing mediawiki-config values - https://phabricator.wikimedia.org/T185080#3905564 (10Joe) p:05Triage>03Normal [10:11:54] !log reset RAC on hydrogen, serial console was inaccessible [10:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:28] !log depooling hydrogen (and keeping pdns-recursor stopped for a few minutes to check whether problems with load-balanced recdns traffic are still an issue) [10:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:43] (03PS1) 10Jcrespo: compare.py: Implement progress reporting, more than 2 servers comp. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404647 [10:19:05] 10Operations, 10DBA, 10MediaWiki-Configuration, 10discovery-system: Allow use of EtcdConfig to configure slave databases - https://phabricator.wikimedia.org/T185084#3905634 (10Joe) p:05Triage>03Normal [10:19:50] 10Operations, 10DBA, 10MediaWiki-Configuration, 10discovery-system: Allow use of EtcdConfig to configure slave databases - https://phabricator.wikimedia.org/T185084#3905634 (10Joe) [10:22:51] !log repooling hydrogen (and pdns-recursor restarted), experiment concluded [10:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:11] (03PS1) 10Muehlenhoff: Remove hydrogen from LVS name server config [puppet] - 10https://gerrit.wikimedia.org/r/404648 [10:26:55] (03CR) 10Ema: [C: 031] Remove hydrogen from LVS name server config [puppet] - 10https://gerrit.wikimedia.org/r/404648 (owner: 10Muehlenhoff) [10:27:20] (03CR) 10Muehlenhoff: [C: 032] Remove hydrogen from LVS name server config [puppet] - 10https://gerrit.wikimedia.org/r/404648 (owner: 10Muehlenhoff) [10:34:57] (03PS16) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [10:35:00] !log depooling hydrogen again [10:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:17] !log rebooting hydrogen for kernel security update [10:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:49] !log repooling hydrogen [10:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:12] (03PS1) 10Ema: Revert "Remove hydrogen from LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/404649 [10:44:08] (03PS1) 10Muehlenhoff: Revert "Remove hydrogen from LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/404650 [10:44:21] (03CR) 10Muehlenhoff: [C: 031] Revert "Remove hydrogen from LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/404649 (owner: 10Ema) [10:44:36] (03PS2) 10Alexandros Kosiaris: grafana: Enable grafana's LDAP [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) [10:44:38] (03PS1) 10Alexandros Kosiaris: grafana: Add migration script from proxy to LDAP auth [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) [10:44:42] (03CR) 10Ema: [C: 032] Revert "Remove hydrogen from LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/404649 (owner: 10Ema) [10:44:44] (03Abandoned) 10Muehlenhoff: Revert "Remove hydrogen from LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/404650 (owner: 10Muehlenhoff) [10:45:35] (03CR) 10jerkins-bot: [V: 04-1] grafana: Enable grafana's LDAP [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [10:51:04] !log reset RAC on chromium, serial console is inaccessible [10:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:13] (03PS3) 10Alexandros Kosiaris: grafana: Enable grafana's LDAP [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) [10:56:45] 10Operations, 10Traffic, 10media-storage, 10User-fgiunchedi: Swift invalid range requests causing 501s - https://phabricator.wikimedia.org/T183902#3905720 (10fgiunchedi) [11:03:24] (03PS2) 10Jcrespo: compare.py: Implement progress reporting, more than 2 servers comp. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404647 [11:07:05] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Looks correct puppet-wise, but I dislike the fact we're basically writing the grafana configurations in hiera directly and there is no val" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [11:07:57] (03PS3) 10Jcrespo: compare.py: Implement progress reporting, more than 2 servers comp. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404647 [11:09:01] (03CR) 10Giuseppe Lavagetto: [C: 031] Simplify profile::grafana::production [puppet] - 10https://gerrit.wikimedia.org/r/404319 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [11:09:49] (03CR) 10Volans: "LGTM if this is a one-off script. But in this case do we really need it added into Puppet?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [11:10:25] (03CR) 10Giuseppe Lavagetto: [C: 031] Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [11:11:09] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [11:11:46] (03CR) 10Giuseppe Lavagetto: [C: 031] grafana: Allow to modify the config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/404320 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [11:18:42] RECOVERY - Nginx local proxy to apache on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.037 second response time [11:19:00] <_joe_> !log restarted nginx on mw1346, was in a bad state [11:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:31] !log rebooting neodymium for kernel security update [11:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:30] !log rearmed keyholder on neodymium [11:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:40] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: / 2720 MB (3% inode=97%): /srv 429831 MB (6% inode=93%) [11:37:09] yeah we are aware of this --^ [11:37:27] going to work on it this afternoon [11:40:19] !log bootstrap cassandra-b on restbase1016 [11:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:34] (03PS1) 10Filippo Giunchedi: restbase: reprovision restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/404652 (https://phabricator.wikimedia.org/T184100) [11:48:32] (03PS1) 10Aklapper: Allow discourse-mediawiki.wmflabs.org RSS feed on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404653 (https://phabricator.wikimedia.org/T185087) [11:51:33] (03PS4) 10Alexandros Kosiaris: grafana: Enable grafana LDAP in production [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) [11:56:47] (03CR) 10Alexandros Kosiaris: "It is indeed a once-off. I 've uploaded it in puppet, mostly to get reviews for it (thanks btw!) and have people aware of what is going to" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [12:00:22] (03CR) 10Alexandros Kosiaris: "Yes, I dislike that too, but it's the status quo for this module and my current aim is not to challenge it, but rather do T170150." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [12:07:25] (03PS2) 10Alexandros Kosiaris: grafana: Add migration script from proxy to LDAP auth [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) [12:07:27] (03PS5) 10Alexandros Kosiaris: grafana: Enable grafana LDAP in production [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) [12:10:27] !log updating HHVM in deployment-prep to 3.18.5+wmf4 [12:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:39] 10Operations, 10DBA, 10Patch-For-Review: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3905851 (10jcrespo) [12:16:42] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3905852 (10jcrespo) [12:20:27] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [12:28:18] !log uploading HHVM 3.18.5+wmf4 for jessie-wikimedia to apt.wikimedia.org (3.18.7 with the patch https://github.com/facebook/hhvm/commit/bd7b2bcfe70b053a3a001480653012f68599250f backed out) [12:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:51] (03CR) 10Mark Bergsma: "A few minor nitpicks, otherwise good to go." (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema) [12:35:57] (03PS4) 10Mark Bergsma: Support multiple BGP peerings [debs/pybal] - 10https://gerrit.wikimedia.org/r/393066 (https://phabricator.wikimedia.org/T180069) [12:35:59] (03PS8) 10Mark Bergsma: Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764) [12:38:00] (03CR) 10Mark Bergsma: [C: 032] Support multiple BGP peerings [debs/pybal] - 10https://gerrit.wikimedia.org/r/393066 (https://phabricator.wikimedia.org/T180069) (owner: 10Mark Bergsma) [12:38:26] (03Merged) 10jenkins-bot: Support multiple BGP peerings [debs/pybal] - 10https://gerrit.wikimedia.org/r/393066 (https://phabricator.wikimedia.org/T180069) (owner: 10Mark Bergsma) [12:48:52] (03CR) 10Faidon Liambotis: [C: 031] "I still find passing verbose around to functions intended to upgrade packages etc. to be a bit icky, but OK, +1 as far as I'm concerned :)" [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [12:53:30] (03PS6) 10Alexandros Kosiaris: grafana: Enable grafana LDAP in production [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) [12:58:34] (03PS4) 10Faidon Liambotis: Update group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/402583 (https://phabricator.wikimedia.org/T184338) (owner: 10Framawiki) [12:59:13] (03CR) 10Faidon Liambotis: [C: 032] Update group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/402583 (https://phabricator.wikimedia.org/T184338) (owner: 10Framawiki) [13:02:15] 10Operations, 10Patch-For-Review: Update people.wikimedia.org with the 2017 Wikimedia hackathon group photo - https://phabricator.wikimedia.org/T184338#3905931 (10faidon) 05Open>03Resolved Merged -- thanks :) [13:09:21] 10Operations, 10Continuous-Integration-Infrastructure, 10HHVM: HHVM 3.18.5+dfsg-1+wmf3 changes parse_url causing unit tests to fail - https://phabricator.wikimedia.org/T185024#3905953 (10hashar) [13:12:57] !log Fixing drifts on db1065 - T162807 [13:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:08] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [13:15:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404661 (https://phabricator.wikimedia.org/T174569) [13:17:11] !log upgrading app server canaries to 3.18.5+wmf4 [13:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404661 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [13:20:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404661 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [13:20:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404661 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [13:20:54] (03PS4) 10Jcrespo: compare.py: Implement progress reporting, more than 2 servers comp. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404647 [13:21:38] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3905976 (10chasemp) [13:22:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1099:3318 - T174569 (duration: 01m 12s) [13:22:22] !log Deploy schema change on db1099:3318 - https://phabricator.wikimedia.org/T174569 [13:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:25] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [13:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:51] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10chasemp) [13:26:49] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906000 (10chasemp) [13:31:36] !log reboot acrab for PCID,INVPCID enabling [13:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:25] 10Operations, 10Continuous-Integration-Infrastructure, 10HHVM: HHVM 3.18.5+dfsg-1+wmf3 changes parse_url causing unit tests to fail - https://phabricator.wikimedia.org/T185024#3906004 (10MoritzMuehlenhoff) I've built/uploaded new HHVM packages for jessie (stretch following soon) which disable the broken patc... [13:32:49] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404664 [13:32:50] PROBLEM - Host acrab is DOWN: PING CRITICAL - Packet loss = 100% [13:33:51] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [13:34:10] RECOVERY - Host acrab is UP: PING OK - Packet loss = 0%, RTA = 36.85 ms [13:34:45] 10Operations, 10cloud-services-team: Labstore1006/7 profile for meltdown kernel - https://phabricator.wikimedia.org/T185101#3906007 (10chasemp) p:05Triage>03High [13:35:24] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906023 (10chasemp) [13:36:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404664 (owner: 10Marostegui) [13:37:57] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404664 (owner: 10Marostegui) [13:38:07] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404664 (owner: 10Marostegui) [13:39:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1104 (duration: 01m 13s) [13:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:35] !log labstore2001:~# /sbin/reboot [13:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:50] !log labstore2002:~# sudo update-grub && /sbin/reboot [13:45:59] !log reboot sca2003 webperf2001 planet2001 poolcounter2002 mx2001 kubetcd200{1,2,3} install2002 dbmonitor2001 alsafi acrux hassaleh diadem nihal pybal-test200{1,2,3} releases2001 tureis for PCID, INVPCID [13:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:19] (03CR) 10Jcrespo: [C: 04-1] "After looking at it, this is wrong- mariadb maintenace must be kept, including currently no tasks (we will include a checksum soon). the t" [puppet] - 10https://gerrit.wikimedia.org/r/403978 (https://phabricator.wikimedia.org/T184797) (owner: 10Dzahn) [13:50:44] (03PS1) 10Ema: eqiad: temporarily remove chromium from LVS nameservers [puppet] - 10https://gerrit.wikimedia.org/r/404672 [13:51:26] !log reboot labstore2003 [13:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:51] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:00] PROBLEM - puppet last run on mw2124 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:02] elukey: Thx for the notif [13:52:11] PROBLEM - puppet last run on oresrdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:20] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:30] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:40] PROBLEM - puppet last run on elastic2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:41] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:51] PROBLEM - puppet last run on elastic2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:01] PROBLEM - puppet last run on kubetcd2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:11] PROBLEM - puppet last run on elastic2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:11] PROBLEM - etc request latencies on acrux is CRITICAL: CRITICAL - etcd_request_latencies is 68357 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:53:17] akosiaris: did you named puppetdb today too? [13:53:44] volans: yes. I rebooted nihal for the kernel upgrades [13:53:51] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:51] PROBLEM - puppet last run on elastic2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:54:11] RECOVERY - etc request latencies on acrux is OK: OK - etcd_request_latencies is 2536 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:54:11] ack [13:54:13] the more interesting thing (which is something I wanted to test) is the etc request latencies on acrux thing [13:54:37] I force rebooted the etcd cluster without waiting much, on purpose [13:54:59] (03CR) 10Muehlenhoff: [C: 031] eqiad: temporarily remove chromium from LVS nameservers [puppet] - 10https://gerrit.wikimedia.org/r/404672 (owner: 10Ema) [13:55:09] and yes it has recovered [13:55:16] but I like it has alerted [13:55:18] that's nice [13:55:20] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:21] PROBLEM - puppet last run on elastic2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:21] PROBLEM - puppet last run on hassaleh is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:21] yeah [13:55:21] PROBLEM - puppet last run on mw2239 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:30] PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:30] PROBLEM - puppet last run on mw2132 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:41] PROBLEM - puppet last run on acrux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:56:00] PROBLEM - puppet last run on db2084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:56:19] cluster is healthy [13:56:20] nice [13:56:20] PROBLEM - puppet last run on mc2033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:56:20] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:57:01] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:57:40] akosiaris: ignorant question - why does the oom killer act every time on nitrogen even if there are (potentially) pages from the page-cache to reclaim? I can't see anything weird from the puppetdb, except the puppetdb's jvm crossing a certain threshold of committed memory (but not trashing afaics, or better, not changing its GC behavior much) [13:58:42] (03CR) 10Rush: [C: 031] apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [13:58:54] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:59:08] elukey: that's my exactly my question. I am waiting to see what will happen next time [13:59:37] I can't understand how suddenly the VM is at top vm memory usage and OOM shows up [14:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180117T1400). [14:00:04] Jayprakash12345: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:17] I can SWAT today [14:00:22] i am here [14:00:44] RECOVERY - puppet last run on acrux is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:01:04] akosiaris, elukey: out of the last 9 OOM killer in dmesg, 8 of them coincide with the 26,56 minutes of the hour when puppet runs at the same time in both Icinga hosts [14:01:14] can we try to force tegmen at a different time for some days [14:01:14] Jayprakash12345: I will let you know when the first patch is at mwdebug, in a few minutes [14:01:19] and see if we still repro? [14:01:25] volans: sure let's do it [14:01:36] is nitrogen a VM on ganeti? [14:01:44] zeljkof: ok :) [14:01:50] elukey: yes [14:01:57] ahhh nice didn't know it [14:02:00] volans: a nice finding [14:02:11] so yeah that might trigger it [14:02:40] it might just be an indicator of overload, and actually wasy j.oe that said have a look at icinga that is heavy on the puppetdb [14:02:47] and I found that they have the same crontab :( [14:03:04] and given is on a VM... maybe the new kernel doesn't help [14:03:06] (03CR) 10Ema: [C: 032] eqiad: temporarily remove chromium from LVS nameservers [puppet] - 10https://gerrit.wikimedia.org/r/404672 (owner: 10Ema) [14:03:14] this only happens though when the jvm reaches the ~6g committed memory [14:03:16] and we "overload" easier now [14:03:17] not before [14:03:37] !log depooling chromium [14:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:19] volans: what do you mean with "overload" in this case? Memory pressure ? [14:04:27] !log restart of elasticsearch / cirrus eqiad completed (cluster still recovering) [14:04:29] elukey: we should monitor postgres memory too at the same times [14:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:59] volans: yeah this is another suspicion that I have - 6g for puppetdb + a spike for postgress == oom acting [14:05:18] but maybe it could work to simply tell it to drop page cache a bit [14:05:33] rather than being so harsh :D [14:05:39] eheheh [14:05:54] vm.swappiness = 0 too [14:06:20] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [14:06:21] (03PS17) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [14:06:23] ah nice finding [14:06:24] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [14:06:33] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404624 (https://phabricator.wikimedia.org/T184957) (owner: 10Jayprakash12345) [14:06:44] we have 1g of swap in there [14:06:57] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3906052 (10akosiaris) Scheduling this for February 12th 2018, say 10:00 am UTC. I 'll run a few more tests and then send an informational m... [14:07:04] I was reading the other day https://chrisdown.name/2018/01/02/in-defence-of-swap.html [14:07:40] and it reminded me that vm.swappiness = 0 triggers some specific behaviour of the kernel regarding memory reclaim [14:07:44] !log rebooting chromium for kernel security update [14:07:48] yeah me too... note that he has no numbers however [14:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:06] (03Merged) 10jenkins-bot: Add Draft Namespace in enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404624 (https://phabricator.wikimedia.org/T184957) (owner: 10Jayprakash12345) [14:08:09] akosiaris: no numbers? [14:08:31] (03CR) 10jenkins-bot: Add Draft Namespace in enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404624 (https://phabricator.wikimedia.org/T184957) (owner: 10Jayprakash12345) [14:08:33] volans: yeah the entire post is kind of academic [14:08:51] ah yeah, no benchmarks, indeed, and it didn't convince me completely [14:08:51] (03PS3) 10Zfilipin: Create "eliminator" user group on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404327 (https://phabricator.wikimedia.org/T184607) (owner: 10Jayprakash12345) [14:09:16] neither was I convinced entirely. I did keep a mental note to revisit the issue at some point [14:09:24] PROBLEM - DPKG on es2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:10:11] but I thought that in same cases might be worth a test either with swappines = 1 or with the cgroup stuff [14:10:20] just speculation for now ;) [14:10:24] RECOVERY - DPKG on es2001 is OK: All packages OK [14:10:25] agreed [14:10:41] volans: how can we change the puppet run times for the icinga hosts? [14:10:58] I don't think you really can [14:10:58] dunno, I asked you and you told me that we could force it :D [14:11:10] I know it's the hash of the host [14:11:19] so not sure if in hiera we can override [14:11:21] didn't check the code [14:11:27] (03PS1) 10Ema: Revert "eqiad: temporarily remove chromium from LVS nameservers" [puppet] - 10https://gerrit.wikimedia.org/r/404674 [14:12:00] zeljkof: [config] 404624 Add Draft Namespace in enwikiversity (T184957) working good. [14:12:01] T184957: en:wikiversity Draft Namespace - https://phabricator.wikimedia.org/T184957 [14:12:01] Jayprakash12345: 404624 is at mwdebug1002, please test and let me know if I can deploy [14:12:12] Jayprakash12345: ok to deploy? [14:12:13] zeljkof: deply [14:12:21] ok, deploying... [14:12:21] elukey: it's in base:puppet [14:12:25] we could also just disable puppet on tegmen for a couple of days, so far it happened every 1~2 days [14:12:48] elukey: but $crontime is calculated in the puppet.cron.erb file [14:13:12] well, the $times variables is, not $crontime [14:13:19] (03PS2) 10Ema: Revert "eqiad: temporarily remove chromium from LVS nameservers" [puppet] - 10https://gerrit.wikimedia.org/r/404674 [14:13:27] (03CR) 10Ema: [V: 032 C: 032] Revert "eqiad: temporarily remove chromium from LVS nameservers" [puppet] - 10https://gerrit.wikimedia.org/r/404674 (owner: 10Ema) [14:13:55] $crontime = fqdn_rand(60, 'puppet-params-crontime') [14:14:17] heh.. not very configurable [14:14:26] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:404624|Add Draft Namespace in enwikiversity (T184957)]] (duration: 01m 12s) [14:14:29] !log repooling chromium [14:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:38] Jayprakash12345: deployed, please check [14:14:41] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906059 (10chasemp) [14:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:24] RECOVERY - puppet last run on hassaleh is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:16:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404327 (https://phabricator.wikimedia.org/T184607) (owner: 10Jayprakash12345) [14:17:27] (03Merged) 10jenkins-bot: Create "eliminator" user group on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404327 (https://phabricator.wikimedia.org/T184607) (owner: 10Jayprakash12345) [14:17:37] (03CR) 10jenkins-bot: Create "eliminator" user group on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404327 (https://phabricator.wikimedia.org/T184607) (owner: 10Jayprakash12345) [14:18:30] Jayprakash12345: 404327 is at mwdebug1002, please test and let me know if I can deploy [14:18:37] zeljkof: ok [14:18:54] (03PS2) 10Filippo Giunchedi: restbase: reprovision restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/404652 (https://phabricator.wikimedia.org/T184100) [14:18:56] (03PS1) 10Filippo Giunchedi: restbase: reprovision restbase101[35] [puppet] - 10https://gerrit.wikimedia.org/r/404675 (https://phabricator.wikimedia.org/T184100) [14:20:23] RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:20:24] RECOVERY - puppet last run on elastic2029 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [14:20:24] RECOVERY - puppet last run on mw2239 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:20:33] RECOVERY - puppet last run on mw2181 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:20:55] RECOVERY - puppet last run on db2084 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [14:21:23] RECOVERY - puppet last run on mc2033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:21:23] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:21:53] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:22:03] RECOVERY - puppet last run on mw2124 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:22:03] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:22:13] RECOVERY - puppet last run on oresrdb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:22:23] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:22:25] zeljkof: ok, deploy [14:22:32] Jayprakash12345: deploying... [14:22:33] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:22:43] RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:22:43] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:22:53] RECOVERY - puppet last run on elastic2032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:23:04] RECOVERY - puppet last run on kubetcd2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:23:14] RECOVERY - puppet last run on elastic2009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:23:47] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:404327|Create "eliminator" user group on ur.wikipedia (T184607)]] (duration: 01m 12s) [14:23:53] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:23:54] RECOVERY - puppet last run on elastic2016 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:59] T184607: Create "eliminator" user group on ur.wikipedia - https://phabricator.wikimedia.org/T184607 [14:24:13] Jayprakash12345: deployed, please check and thanks for deploying with #releng ;) [14:25:33] RECOVERY - puppet last run on mw2132 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:25:46] zeljkof: Checked, Thanks for being here. [14:27:05] 10Operations, 10cloud-services-team: labstore2003 reboots into mode missing /srv disks - https://phabricator.wikimedia.org/T185102#3906073 (10chasemp) p:05Triage>03High [14:27:13] !log EU SWAT finished [14:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:09] (03PS5) 10Ema: Use up-and-enabled servers in can-depool logic [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715) [14:29:40] (03CR) 10Ema: Use up-and-enabled servers in can-depool logic (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema) [14:29:55] (03CR) 10Volans: "Much better, thanks for improving it. Still some fixes needed, see inline." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [14:30:53] 10Operations: Stack overflow on beta cluster API interaction - https://phabricator.wikimedia.org/T185103#3906085 (10Niedzielski) [14:38:29] !log labstore1001:~# /sbin/reboot [14:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:58] (03CR) 10Ema: [C: 032] Use up-and-enabled servers in can-depool logic [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema) [14:42:42] (03CR) 10Volans: "A couple of questions inline" (033 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema) [14:43:38] (03CR) 10Arlolra: "Thanks. I ran update_parsoid.sh on ruthenium and so far so good." [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [14:43:56] * volans 2 over 2 CR merged while reviewing... need to restart internal NTPd [14:44:15] :) [14:47:56] volans: the threshold is set in setUp() [14:48:33] ema: ah right, missed that, thx [14:48:55] I assumed it was 50% anyway ;) [14:49:41] volans: if I understand your other question correctly, no, we don't try to depool servers which are fine according to pybal's monitoring [14:50:30] ema: what I was trying to ask is, is canDepool() independent of the server you want to depool? [14:50:44] volans: yes [14:50:58] shouldn't instead it be relative to it? [14:51:13] dependening on the situation I might be able to depool *that* server, but not another one [14:51:37] canDepool is only called upon monitoring changes, not in case of admin actions [14:51:50] ok [14:52:50] so it doesn't cover the case where an admin shoots herself in the foot by depooling the only host serving traffic [14:53:04] btw, not sure I agree with sum() being cleaner [14:53:11] sum to me suggests summing integers, not counting [14:53:15] !log resetting RAC on labsdb1006 (serial console inaccessible) [14:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:56] mark: it's personal, no strong feeling, just I though slightly unnecessary to create another list to just count it's size [14:54:04] and not use it for anything else [14:54:32] the sum return already the result needed in this specific case [14:54:40] and is slightly shorter :) [14:54:59] i guess [14:55:04] * ema agrees with volans but blissfully ignores his comment [14:55:09] there's a whole lot of that in pybal btw [14:55:18] especially since a lot of it predates these features in python hehe [14:55:44] PROBLEM - DPKG on es2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:55:53] i wanted to mention something similar with a set comprehension earlier in ema's code [14:55:58] and then noticed pybal used a dict [14:56:05] iirc, because pybal didn't have sets yet I abused a dict ;) [14:56:08] python [14:56:23] (03PS1) 10Ema: Use up-and-enabled servers in can-depool logic [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/404680 (https://phabricator.wikimedia.org/T184715) [14:56:36] sets were added in 2.3 [14:56:54] yup [14:57:04] i think pybal was written for 2.2 or thereabouts [14:57:24] ehehe [14:57:25] (03CR) 10Ema: [C: 032] Use up-and-enabled servers in can-depool logic [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/404680 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema) [14:57:52] !log resetting RAC on labsdb1007 (serial console inaccessible) [14:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:19] 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#3906167 (10fgiunchedi) So I tried to investigate this on restbase1013 which reliably failed to reboot cleanly after d-i finished: ``` Loading Linux 4.9.0-0.bpo.5-amd64 ... Loading initial ramdisk ...... [15:04:38] (03PS7) 10Ottomata: Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) [15:05:07] (03CR) 10jerkins-bot: [V: 04-1] Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) (owner: 10Ottomata) [15:06:07] (03PS8) 10Ottomata: Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) [15:06:33] (03CR) 10Elukey: [C: 032] Allow to explicitly set the JAVA_HOME environment variable [puppet/cdh] - 10https://gerrit.wikimedia.org/r/403701 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [15:07:00] (03PS9) 10Ottomata: Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) [15:08:33] (03PS1) 10Elukey: Update the cdh module to the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/404685 (https://phabricator.wikimedia.org/T166248) [15:16:30] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9761/" [puppet] - 10https://gerrit.wikimedia.org/r/404685 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [15:21:43] (03CR) 10Eevans: [C: 031] "I always struggle getting the first bootstrap cleanly started when Puppet cannot run to completion (which it cannot while units are masked" [puppet] - 10https://gerrit.wikimedia.org/r/404675 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [15:23:39] (03PS2) 10Filippo Giunchedi: restbase: reprovision restbase101[35] [puppet] - 10https://gerrit.wikimedia.org/r/404675 (https://phabricator.wikimedia.org/T184100) [15:25:38] (03CR) 10Filippo Giunchedi: [C: 032] restbase: reprovision restbase101[35] [puppet] - 10https://gerrit.wikimedia.org/r/404675 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [15:25:54] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404688 [15:27:18] (03CR) 10Eevans: [C: 031] restbase: reprovision restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/404652 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [15:28:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404688 (owner: 10Marostegui) [15:29:29] (03PS1) 10Herron: add support for SSLCARevocationCheck setting in puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) [15:30:13] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404688 (owner: 10Marostegui) [15:30:23] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404688 (owner: 10Marostegui) [15:32:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1104 (duration: 01m 12s) [15:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:41] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1065, pool db1067 for vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404690 [15:34:55] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1065, pool db1067 for vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404690 [15:36:17] !log upgrading nginx on mw servers in codfw to 1.13.6-2+wmf1~jessie1 [15:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:34] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1065, pool db1067 for vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404690 (owner: 10Marostegui) [15:38:55] PROBLEM - Host labstore2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:39:39] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065, pool db1067 for vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404690 (owner: 10Marostegui) [15:39:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065, pool db1067 for vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404690 (owner: 10Marostegui) [15:41:10] <_joe_> !log dropping ruwiki htmlCacheUpdate records stuck int he old jobqueue [15:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 after fixing data drifts - T162807 (duration: 01m 12s) [15:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:36] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [15:45:59] (03PS1) 10Arlolra: Fix typo in parsoid-rt.config.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/404691 [15:48:02] (03PS1) 10Marostegui: db1063.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/404692 (https://phabricator.wikimedia.org/T184397) [15:48:42] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3906289 (10Andrew) [15:48:45] 10Operations, 10Cloud-VPS, 10monitoring, 10cloud-services-team (Kanban): remove cloud VPS project 'ganglia' - https://phabricator.wikimedia.org/T183917#3906287 (10Andrew) 05Open>03Resolved yep, the project is gone. [15:49:12] (03CR) 10Marostegui: [C: 032] db1063.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/404692 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui) [15:50:19] 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3906294 (10chasemp) [15:51:41] !log labstore1002:~# /sbin/reboot [15:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:22] (03PS1) 10Ema: 1.14.3: canDepool and alert instrumentation bugfixes [debs/pybal] - 10https://gerrit.wikimedia.org/r/404694 (https://phabricator.wikimedia.org/T184715) [15:53:13] (03CR) 10Ema: [C: 032] 1.14.3: canDepool and alert instrumentation bugfixes [debs/pybal] - 10https://gerrit.wikimedia.org/r/404694 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema) [15:53:21] (03PS1) 10Ema: 1.14.3: canDepool and alert instrumentation bugfixes [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/404695 (https://phabricator.wikimedia.org/T184715) [15:54:15] (03CR) 10Ema: [C: 032] 1.14.3: canDepool and alert instrumentation bugfixes [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/404695 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema) [15:54:59] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906316 (10chasemp) [15:55:33] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3897875 (10chasemp) thanks @Marostegui [15:55:37] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3906319 (10Bawolff) Thankyou [15:57:31] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906331 (10chasemp) [15:58:13] godog: is there any reason that the mcrouter package doesn't add an init.d entry or was that just not finished? [15:59:26] (03PS1) 10Ottomata: Parameterize varnishkafka certificate name for easier setup in Cloud VPS. [puppet] - 10https://gerrit.wikimedia.org/r/404698 (https://phabricator.wikimedia.org/T121561) [15:59:44] AaronSchulz: no idea specifically, but I'm not surprised since we shouldn't be shipping init.d scripts but systemd service files instead [16:00:33] !log pybal 1.14.3 uploaded to apt.w.o [16:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:15] godog: it doesn't appear to do that either [16:04:18] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906361 (10chasemp) [16:04:21] 10Operations, 10Cloud-VPS, 10cloud-services-team: Reboot non-labvirt cloud provider hardware for meltdown - https://phabricator.wikimedia.org/T184730#3906362 (10chasemp) [16:04:26] 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3906358 (10chasemp) 05Open>03Resolved a:03chasemp Full working etherpad is archived at https://wikitech.wikimedia.... [16:05:33] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9763/cp1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/404698 (https://phabricator.wikimedia.org/T121561) (owner: 10Ottomata) [16:06:09] !log upgrading nginx on mwdebug servers to 1.13.6-2+wmf1~jessie1 [16:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:03] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906365 (10chasemp) [16:07:08] !log upgrading HHVM in codfw to 3.18.7 (wmf4) [16:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:19] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10chasemp) [16:09:02] AaronSchulz: ok, can you point me to the repo? [16:09:30] (03CR) 10Subramanya Sastry: [C: 031] "does this need a rebase?" [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra) [16:09:47] godog: I assume it is https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/mcrouter [16:09:53] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906372 (10chasemp) [16:09:54] (03PS1) 10Ottomata: Blacklist gwtoolsetUploadMetadataJob from Hive json refine job [puppet] - 10https://gerrit.wikimedia.org/r/404701 [16:09:55] !log routing ns0 to codfw (baham) [16:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:43] (03PS2) 10Ottomata: Blacklist gwtoolsetUploadMetadataJob from Hive json refine job [puppet] - 10https://gerrit.wikimedia.org/r/404701 [16:10:52] (03CR) 10Ottomata: [V: 032 C: 032] Blacklist gwtoolsetUploadMetadataJob from Hive json refine job [puppet] - 10https://gerrit.wikimedia.org/r/404701 (owner: 10Ottomata) [16:12:55] !log labmon1001:~# /sbin/reboot [16:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:11] AaronSchulz: thanks, I asked because I didn't work on it, _joe_ did [16:14:36] (03CR) 10Arlolra: "no" [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra) [16:15:47] AaronSchulz: there's a debian/mcrouter.service in that repo, though? [16:16:26] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906379 (10chasemp) [16:16:42] AaronSchulz: the debian/rules file uses dh_systemd_enable --no-enable, so you need to manually enable it after installation [16:17:06] RECOVERY - Check systemd state on labmon1001 is OK: OK - running: The system is fully operational [16:17:34] 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#3906383 (10fgiunchedi) Booting both restbase1013 and restbase1015 without `quiet` it looks like a race condition: on 1015 assembly worked as intended but on 1013 it failed in the usual way we've experience... [16:17:39] !log reboot radon (eqiad authdns) for kernel upgrade [16:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:50] !log labmon1001:~# service grafana-server [16:18:01] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10chasemp) [16:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:57] moritzm: thanks for taking a look! [16:19:29] (03PS1) 10Mark Bergsma: Add unit test cases for Server [debs/pybal] - 10https://gerrit.wikimedia.org/r/404704 [16:21:02] godog: I didn't see it in systemctl status, shouldn't it be there? [16:21:09] XioNoX: radon is back online and serving queries, please revert the routing change! [16:21:37] cool [16:21:50] (03PS2) 10Mark Bergsma: Add unit test cases for Server [debs/pybal] - 10https://gerrit.wikimedia.org/r/404704 [16:22:00] ema: reverted [16:22:20] XioNoX: looks good [16:22:20] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906403 (10chasemp) [16:22:33] godog: nvm [16:22:38] I was missing -a ;) [16:22:48] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10chasemp) a:03Andrew [16:23:03] (03PS1) 10Eevans: [WIP] cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) [16:23:11] (03CR) 10Gehel: "This patch only configures beta (production changes have been extracted to another patch). We are good to go for this one." [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [16:23:27] (03CR) 10Gehel: [C: 031] Updates to enable short URLs for transliteration for crhwiki - beta [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [16:23:29] (03CR) 10jerkins-bot: [V: 04-1] [WIP] cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans) [16:23:50] <_joe_> urandom: do you want me to take a peek? [16:24:04] ema: ready for ns1 [16:24:08] (03PS3) 10Mark Bergsma: Add unit test cases for Server [debs/pybal] - 10https://gerrit.wikimedia.org/r/404704 [16:24:30] XioNoX: ns1 to radon for baham reboot, sounds good! [16:24:43] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906413 (10chasemp) [16:24:59] !log routing ns1 to eqiad [16:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:41] (03PS2) 10Eevans: [WIP] cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) [16:25:46] _joe_: sure [16:25:49] ema: done [16:26:01] _joe_: if you promise not to think less of me as a person [16:26:11] _joe_: godog made me do it! [16:26:14] <_joe_> urandom: ahahah :( [16:26:15] (03CR) 10MarkTraceur: [C: 031] "Looks good, ready to deploy from my POV" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403680 (https://phabricator.wikimedia.org/T184728) (owner: 10Matthias Mullie) [16:26:16] !log reboot baham (codfw authdns) for kernel upgrade [16:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:31] * urandom throws godog under the bus [16:26:32] (03PS10) 10Gehel: Updates to enable short URLs for transliteration for crhwiki - beta [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [16:27:00] <_joe_> EWWWWW [16:27:04] ya [16:27:06] <_joe_> :P [16:27:15] (03CR) 10Smalyshev: [C: 031] wdqs: simplify logging of categories reload [puppet] - 10https://gerrit.wikimedia.org/r/404315 (owner: 10Gehel) [16:27:42] hahaha yeah mkdir -p equivalent still isn't a thing in puppet is it? [16:28:01] (03CR) 10Gehel: [C: 032] Updates to enable short URLs for transliteration for crhwiki - beta [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [16:28:03] (03PS1) 10Ottomata: Update secrets/certificates with deployment-prep certs for TLS Kafka [labs/private] - 10https://gerrit.wikimedia.org/r/404706 (https://phabricator.wikimedia.org/T121561) [16:29:19] (03PS2) 10Ottomata: Update secrets/certificates with deployment-prep certs for TLS Kafka [labs/private] - 10https://gerrit.wikimedia.org/r/404706 (https://phabricator.wikimedia.org/T121561) [16:29:31] (03CR) 10Ottomata: [V: 032 C: 032] Update secrets/certificates with deployment-prep certs for TLS Kafka [labs/private] - 10https://gerrit.wikimedia.org/r/404706 (https://phabricator.wikimedia.org/T121561) (owner: 10Ottomata) [16:30:41] XioNoX: baham back online and serving queries [16:30:54] ema: rolling back routing changes [16:31:02] yes please [16:31:17] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906430 (10greg) Asking for help from @aaron (kind of object cache) and #operations / #performance-team to diagnose this. Ori was previously the be... [16:31:29] done [16:31:40] last one is ns2 -> radon for eeden reboot [16:31:49] yup [16:33:50] !log routing ns2 to radon [16:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:46] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906441 (10chasemp) [16:36:12] XioNoX: looks like it's done, ok to reboot? [16:36:27] (03PS1) 10Mark Bergsma: Separate out coordinator.Server into its own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/404713 [16:36:29] ema: nop, something is wrong with routing it seems [16:36:45] XioNoX: ok, I don't see ns2 queries coming into eeden [16:36:49] so that part works [16:37:10] I also don't see them on radon, so that port does not work [16:37:24] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:57] I gathered some data and rolledback [16:38:06] XioNoX: ok I see ns2 queries back to eeden [16:38:43] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 83.94 ms [16:39:13] PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 21 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [16:39:20] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906454 (10Tgr) [16:39:30] so for some reasons the router doesn't want acept the route to 208.80.154.93 [16:40:06] some kind of filter perhaps? [16:40:15] no arp? [16:41:23] PROBLEM - puppet last run on mw2100 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 23 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm] [16:42:37] maybe because of the no-resolve keyword [16:45:32] (03CR) 10Gehel: [C: 031] "The related patch for beta has been merged (and does not break anything - there isn't a crh-wiki on beta to test more). We can probably me" [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel) [16:47:23] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron) [16:48:04] I'll investigate it, probably by adding a new VIP from the esams range on radon, and adding matching statics, that way no risk for production traffic [16:49:12] XioNoX: sounds good to me, let's postpone eeden reboot then [16:49:13] RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:49:16] (03CR) 10Dzahn: "gotcha Jaime, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/403978 (https://phabricator.wikimedia.org/T184797) (owner: 10Dzahn) [16:49:40] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906497 (10greg) p:05Triage>03High This is almost UBN! per "This is causing siteinfo API requests (probably all API requests) to fail, which is... [16:49:41] moritzm: see above, radon and baham rebooted, eeden TODO [16:49:51] (03PS2) 10Dzahn: testreduce: Fix typo in parsoid-rt.config.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra) [16:50:10] (03PS3) 10Dzahn: testreduce: Fix typo in parsoid-rt.config.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra) [16:50:45] (03CR) 10Dzahn: "it does, but right before merge in any case, so doing more before that not needed" [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra) [16:51:08] (03CR) 10Dzahn: [C: 032] testreduce: Fix typo in parsoid-rt.config.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra) [16:51:23] RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:52:37] (03CR) 10Dzahn: "deployed on ruthenium" [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra) [16:52:43] !log upgrade secondary LVSs to pybal 1.13.4 T184715, T184721 [16:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:55] T184721: Alert instrumentation returning 500 errors - https://phabricator.wikimedia.org/T184721 [16:52:56] T184715: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 [16:53:24] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906510 (10greg) [16:53:42] @seen Ladsgroup [16:53:43] mutante: I have never seen Ladsgroup [16:53:51] mutante it's Amir1 :) [16:54:00] :) thanks! [16:54:07] your welcome :). [16:54:27] (03PS3) 10Eevans: [WIP] cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) [16:54:32] Amir1: you were mentioned for this because you did work on standardizing error pages :) https://gerrit.wikimedia.org/r/#/c/395552/ [16:57:26] mutante: hey, I just got back [16:57:29] let me check [16:58:06] mutante: Can I run a quick test and let you know? [16:58:26] (03PS1) 10Jcrespo: mariadb: Install stretch on es200[1234] reinstall [puppet] - 10https://gerrit.wikimedia.org/r/404721 [16:59:28] (03PS3) 10Gehel: wdqs: simplify logging of categories reload [puppet] - 10https://gerrit.wikimedia.org/r/404315 [16:59:53] ema: great [17:00:15] Amir1: of course, you can run tests for weeks ;) [17:00:28] welcome back [17:00:34] (03CR) 10Eevans: "[PC output](http://puppet-compiler.wmflabs.org/9765/)" [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans) [17:00:59] (03CR) 10Gehel: [C: 032] wdqs: simplify logging of categories reload [puppet] - 10https://gerrit.wikimedia.org/r/404315 (owner: 10Gehel) [17:03:04] (03PS2) 10Jcrespo: mariadb: Install stretch on es200[1234] reinstall [puppet] - 10https://gerrit.wikimedia.org/r/404721 [17:04:13] RECOVERY - Host labstore2003 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms [17:04:30] mutante: tests have finished and it looks good IMO [17:04:55] mutante: tell me when you merged it so I approve the GCI task [17:05:04] PROBLEM - Check the NTP synchronisation status of timesyncd on labstore2003 is CRITICAL: Return code of 255 is out of bounds [17:05:13] PROBLEM - MegaRAID on labstore2003 is CRITICAL: Return code of 255 is out of bounds [17:06:10] !log upgrade pybal on primary LVSs to 1.14.3 T184715, T184721 [17:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:22] T184721: Alert instrumentation returning 500 errors - https://phabricator.wikimedia.org/T184721 [17:06:23] T184715: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 [17:07:03] PROBLEM - SSH on labstore2003 is CRITICAL: connect to address 10.192.21.6 and port 22: Connection refused [17:07:03] PROBLEM - Disk space on labstore2003 is CRITICAL: Return code of 255 is out of bounds [17:07:13] PROBLEM - Check systemd state on labstore2003 is CRITICAL: Return code of 255 is out of bounds [17:07:14] PROBLEM - DPKG on labstore2003 is CRITICAL: Return code of 255 is out of bounds [17:07:14] PROBLEM - configured eth on labstore2003 is CRITICAL: Return code of 255 is out of bounds [17:07:14] PROBLEM - dhclient process on labstore2003 is CRITICAL: Return code of 255 is out of bounds [17:08:22] !log bootstrap cassandra-a on restbase1013 [17:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:34] PROBLEM - puppet last run on labstore2003 is CRITICAL: Return code of 255 is out of bounds [17:08:34] (03CR) 10Jcrespo: [C: 032] mariadb: Install stretch on es200[1234] reinstall [puppet] - 10https://gerrit.wikimedia.org/r/404721 (owner: 10Jcrespo) [17:11:40] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906569 (10greg) [17:12:12] !log Rebooting labstore2004 [17:12:20] (03PS4) 10Eevans: cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) [17:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:50] (03CR) 10Chad: [C: 031] "Fine by me, anything is an improvement over the current page :D" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox) [17:16:30] (03PS5) 10Thcipriani: Scap canary: cache last good deploy time [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) [17:17:07] !log reboot labstore2003 [17:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:39] (03CR) 10Chad: [C: 031] "Actually, minor nit re: file naming inside (not a blocker, but would be nice)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox) [17:19:49] (03CR) 10Thcipriani: "Fixups from Volans review." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani) [17:20:12] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715#3906596 (10ema) 05Open>03Resolved a:03ema [17:20:21] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Alert instrumentation returning 500 errors - https://phabricator.wikimedia.org/T184721#3906598 (10ema) 05Open>03Resolved a:03ema [17:24:44] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [17:24:44] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:25:16] thcipriani: there should be only one line starting with MEDIAWIKI_STAGING_DIR right? [17:26:08] <_joe_> whoa big big spike in 5xx [17:26:26] reported in -tech too [17:26:33] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [17:26:35] <_joe_> still ongoing? [17:26:36] <_joe_> yes [17:27:13] <_joe_> no it seems over [17:28:13] RECOVERY - SSH on labstore2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [17:28:14] RECOVERY - Disk space on labstore2003 is OK: DISK OK [17:28:23] RECOVERY - Check systemd state on labstore2003 is OK: OK - running: The system is fully operational [17:28:23] RECOVERY - configured eth on labstore2003 is OK: OK - interfaces up [17:28:23] RECOVERY - DPKG on labstore2003 is OK: All packages OK [17:28:24] RECOVERY - dhclient process on labstore2003 is OK: PROCS OK: 0 processes with command name dhclient [17:31:03] (03CR) 10Kaldari: [C: 04-1] Add a test verifying that rtl.dblist is up to date (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) (owner: 10MaxSem) [17:31:37] volans: that's true, updating patch. [17:31:53] (03CR) 10Filippo Giunchedi: add support for SSLCARevocationCheck setting in puppetmaster frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron) [17:31:59] thcipriani: if you wait 2 min I'll finish the review ;) [17:32:13] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [17:32:37] volans: heh, k, thanks for the review by the by :) [17:32:51] moritzm, jusrt saw your email re. labsdb1004 [17:32:56] Sorry for the terrible delay. [17:33:26] I'm available for the next 6 hours if today works. [17:33:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:33:49] Otherwise, I'll respond to the email with some ideas for earlier UTC tomorrow. [17:34:33] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [17:34:44] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [17:35:04] RECOVERY - Check the NTP synchronisation status of timesyncd on labstore2003 is OK: OK: synced at Wed 2018-01-17 17:35:02 UTC. [17:35:13] RECOVERY - MegaRAID on labstore2003 is OK: OK: optimal, 1 logical, 2 physical [17:35:29] (03PS26) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) [17:35:42] (03CR) 10Paladox: Update gerrit login display (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox) [17:36:50] (03PS1) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: improvements for apt-upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/404736 [17:37:28] (03CR) 10Volans: "Thanks for the fixes, much nicer! I've added just a couple of smaller comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani) [17:38:35] thcipriani: done ^^ :) [17:39:45] thanks! [17:40:00] yw [17:40:11] volans: ^ [17:40:25] (03CR) 10Arturo Borrero Gonzalez: "Thanks Volans for the review." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:41:16] arturo: ack, looking [17:41:27] * volans hates when gerrit doesn't scroll properly [17:42:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10procurement: eqiad: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176338#3906721 (10RobH) 05Open>03Resolved they are in racktables and now being tracked int eh spares rack in eqiad. [17:42:29] (03PS2) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: improvements for apt-upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/404736 [17:43:45] (03PS1) 10Ottomata: [WIP] Produce webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) [17:44:11] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Produce webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [17:44:12] !log resetting RAC on labsdb1004 (serial console inaccessible) [17:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:24] RECOVERY - Disk space on stat1005 is OK: DISK OK [17:45:11] (03PS2) 10Ottomata: [WIP] Produce webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) [17:45:55] (03CR) 10Chad: [C: 031] "lgtm, let's do this!" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox) [17:46:04] no_justification thanks :). [17:46:27] Is gerritLogin.js something hardcoded in Gerrit itself? ie: would gerritLogin.cache.js not work? [17:46:39] A minor detail since it only loads on the login page, mostly curious [17:47:16] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/9766/cp1054.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [17:47:24] no_justification we can use gerritLogin.cache.js if you want [17:47:29] we load the js file in the css file [17:47:37] (03PS27) 10Dzahn: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox) [17:47:45] Where's that JS file loaded from? [17:48:04] (03PS28) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) [17:48:11] no_justification here https://gerrit.wikimedia.org/r/#/c/402665/28/modules/gerrit/files/etc/GerritSite.css [17:48:12] (03PS1) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: ensure python3-apt is installed [puppet] - 10https://gerrit.wikimedia.org/r/404740 [17:48:27] [17:49:05] (03PS29) 10Dzahn: gerrit: new fancy login page design [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox) [17:49:22] Hah! [17:49:29] Really, that's injected as raw HTML? [17:49:37] no_justification apparently so [17:49:47] #til [17:49:57] Oh well, it's not urgent. Let's land it as-is, then maybe follow up [17:50:24] (03CR) 10Dzahn: [C: 032] "there you go" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox) [17:51:54] no_justification mutante looks perfect https://gerrit.wikimedia.org/r/login/%23%2Fq%2Fstatus%3Aopen :). [17:52:03] https://gerrit.wikimedia.org/r/login/%23%2Fq%2Fstatus%3Aopen [17:52:28] thanks, it definitely looks more modern [17:52:40] and the part that upstream designer guy was on the change was also nice [17:52:43] for licensing [17:52:44] :) [17:52:48] There's a bunch of negative space at the top for me, but minor [17:52:50] arturo: if I'm not mistaken the puppet side of it is missing the dependency on the python3-apt lib [17:53:10] no_justification i guess it must be that margin-top: 10%; [17:53:42] https://phabricator.wikimedia.org/F12619936 [17:54:05] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3904733 (10Krinkle) Is this task about session storage Redis or JobQueue Redis? I would assume JobQueue Redis given that ApiSiteInfo is supports ou... [17:54:16] volans: I sent this --> https://gerrit.wikimedia.org/r/404740 [17:54:28] ahhh separate one [17:54:31] missed that [17:54:45] (03Draft1) 10Paladox: Gerrit: Remove margin-top: 10% from GerritSite.css [puppet] - 10https://gerrit.wikimedia.org/r/404741 [17:54:47] (03Draft2) 10Paladox: Gerrit: Remove margin-top: 10% from GerritSite.css [puppet] - 10https://gerrit.wikimedia.org/r/404741 [17:54:55] paladox: Yeah, removing that rule made it work nicer. Does that affect anything else? [17:55:15] I don't think so, it's targeted to the login page [17:55:18] no_justification from my quick testing nope. [17:55:33] arturo: that's telling puppet the order, but not actually installing the package. You need also to explictely add the package [17:55:49] !log gerrit login page design changed (https://gerrit.wikimedia.org/r/402665) in case you were worried it was a fake page trying to steal your login, heh [17:55:52] no_justification it works :). https://gerrit.wikimedia.org/r/404741 [17:55:57] the order and the dependency, it will fail if there is no Package['python3-apt'] resource managed, to be precise [17:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:10] tested on https://gerrit.git.wmflabs.org/r/login/%23%2Fq%2Fstatus%3Aopen [17:56:26] oh volans I think I understand now [17:56:56] we usually use require_pacakge around the code, that is wrapper that accepts both multiple params, one per package [17:57:00] or a list of packages [17:57:10] (03PS1) 10Ottomata: Ensure samtar and samwalton9 are absent after account expiration [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878) [17:57:13] *require_package ofc [17:57:30] (03PS2) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: ensure python3-apt is installed [puppet] - 10https://gerrit.wikimedia.org/r/404740 [17:57:32] (03PS2) 10Ottomata: Ensure samtar and samwalton9 are absent after account expiration [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878) [17:57:48] volans: not that solution then ^^ [17:58:22] (03CR) 10Muehlenhoff: Ensure samtar and samwalton9 are absent after account expiration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878) (owner: 10Ottomata) [17:58:39] (03CR) 10Dzahn: [C: 032] Gerrit: Remove margin-top: 10% from GerritSite.css [puppet] - 10https://gerrit.wikimedia.org/r/404741 (owner: 10Paladox) [17:58:44] thanks :) [17:59:16] arturo: that works too, the require_package allow you to do a single call for all the packages required in that file and also ensure that they are installed before anything in the same scope is executed [17:59:26] * James_F waves in advance of jouncebot. [17:59:33] see modules/wmflib/lib/puppet/parser/functions/require_package.rb for its implementation if you're curious ;) [17:59:49] https://gerrit.wikimedia.org/r/login/%23%2Fq%2Fstatus%3Aopen looks much better now :). [17:59:51] (03PS3) 10Ottomata: Ensure samtar and samwalton9 are absent after account expiration [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878) [17:59:59] 10Operations, 10MediaWiki-API, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906780 (10Krinkle) [18:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180117T1800). [18:00:04] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:16] (03CR) 10Muehlenhoff: [C: 031] Ensure samtar and samwalton9 are absent after account expiration [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878) (owner: 10Ottomata) [18:00:16] paladox: confirmed :) [18:00:19] :) [18:00:20] 10Operations, 10MediaWiki-API, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3904733 (10Krinkle) [18:00:28] (03CR) 10Volans: "ok, thanks for the reply and the follow up on I4ead8b545e57cd135cee313636c816da194cacfd" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [18:00:28] o/ James_F I can SWAT. [18:00:50] Ta. [18:01:17] (03CR) 10Volans: [C: 031] "LGTM, optional nitpick inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404736 (owner: 10Arturo Borrero Gonzalez) [18:02:06] bblack: shall we merge the Letsencrypt license change now? [18:03:01] as side node, mod_md has been merged to 2.4.x and it will be part of 2.4.30 [18:03:11] https://httpd.apache.org/docs/2.4/mod/mod_md.html [18:04:13] (03PS4) 10Ottomata: Ensure samtar and samwalton9 are absent after account expiration [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878) [18:04:16] (03CR) 10Ottomata: [V: 032 C: 032] Ensure samtar and samwalton9 are absent after account expiration [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878) (owner: 10Ottomata) [18:06:18] (03CR) 10Dzahn: [C: 031] letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk) [18:07:22] Zuul seems sleep deprived or something. [18:07:33] (03CR) 10Volans: [C: 031] "LGTM, optional change inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404740 (owner: 10Arturo Borrero Gonzalez) [18:07:37] 10Operations, 10MediaWiki-API, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906804 (10aaron) So I cannot contact redis via nutcracker on tin. I noticed the password was not actually set for redis (try... [18:08:01] (03PS1) 10Ottomata: Use log_retention params in profile::kafka::broker [puppet] - 10https://gerrit.wikimedia.org/r/404747 [18:09:08] !log uploading HHVM 3.18.5+wmf4 for stretch-wikimedia to apt.wikimedia.org (3.18.7 with the patch https://github.com/facebook/hhvm/commit/bd7b2bcfe70b053a3a001480653012f68599250f backed out) [18:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:57] elukey: oh, nice [18:12:41] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906818 (10madhuvishy) [18:12:44] 10Operations, 10cloud-services-team: labstore2003 reboots into mode missing /srv disks - https://phabricator.wikimedia.org/T185102#3906816 (10madhuvishy) 05Open>03Resolved Drives not being mounted at /srv is the right behavior. The lvms aren't mounted by default because if they were, our bdsync based backu... [18:16:43] (03PS2) 10Herron: add support for SSLCARevocationCheck setting in puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) [18:17:43] Alrighty! James_F - You're good to test it on mwdebug1002. [18:18:04] Niharika: Yup, LGTM. [18:18:20] Let's sync it. [18:18:27] (03CR) 10Ottomata: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9767/" [puppet] - 10https://gerrit.wikimedia.org/r/404747 (owner: 10Ottomata) [18:19:49] (03CR) 10Herron: add support for SSLCARevocationCheck setting in puppetmaster frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron) [18:20:44] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3906844 (10jcrespo) [18:20:47] 10Operations, 10DBA, 10Patch-For-Review: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3906843 (10jcrespo) [18:20:49] !log niharika29@tin Synchronized php-1.31.0-wmf.17/includes/EditPage.php: Update Save/Publish button flag from 'constructive' to 'progressive' https://gerrit.wikimedia.org/r/#/c/404733/ (duration: 01m 14s) [18:20:51] James_F: Done. ^ [18:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:07] Niharika: Thank you so much. :-) [18:21:20] You're welcome. [18:21:32] (03PS5) 10Dzahn: ganeti: create profiles, split monitoring/firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/392564 [18:21:59] (03CR) 10jerkins-bot: [V: 04-1] ganeti: create profiles, split monitoring/firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn) [18:22:31] (03CR) 10Dzahn: ganeti: create profiles, split monitoring/firewall classes (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn) [18:23:46] (03PS6) 10Dzahn: ganeti: create profiles, split monitoring/firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/392564 [18:26:33] (03CR) 10Qgil: [C: 04-1] "This patch would enable one specific feed. We need to discuss whether it is possible to whitelist any RSS feed coming from a domain. Would" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404653 (https://phabricator.wikimedia.org/T185087) (owner: 10Aklapper) [18:27:45] mutante: yes please [18:28:11] (03PS1) 10Rush: icinga: add aborrero to sms group [puppet] - 10https://gerrit.wikimedia.org/r/404751 (https://phabricator.wikimedia.org/T178807) [18:29:14] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906886 (10chasemp) [18:31:03] mutante: brief sanity check review https://gerrit.wikimedia.org/r/#/c/404751/ please? [18:33:21] (03PS1) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 [18:33:39] (03PS2) 10Dzahn: deployment-prep: Commit hiera config for etcd [puppet] - 10https://gerrit.wikimedia.org/r/403205 (owner: 10Chad) [18:33:50] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo) [18:34:25] (03CR) 10Dzahn: [C: 032] "labs-only" [puppet] - 10https://gerrit.wikimedia.org/r/403205 (owner: 10Chad) [18:34:41] (03PS2) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 [18:35:08] (03PS3) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 [18:35:10] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo) [18:35:32] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo) [18:37:11] (03CR) 10Dzahn: [C: 031] icinga: add aborrero to sms group [puppet] - 10https://gerrit.wikimedia.org/r/404751 (https://phabricator.wikimedia.org/T178807) (owner: 10Rush) [18:37:18] chasemp: looks good! [18:37:21] bblack: ok :) [18:37:29] no_justification: done [18:37:35] ty! [18:37:46] (03PS2) 10Rush: icinga: add aborrero to sms group [puppet] - 10https://gerrit.wikimedia.org/r/404751 (https://phabricator.wikimedia.org/T178807) [18:37:48] tx mutante [18:37:53] yw [18:38:21] (03PS3) 10Dzahn: letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk) [18:38:25] (03CR) 10Rush: [C: 032] icinga: add aborrero to sms group [puppet] - 10https://gerrit.wikimedia.org/r/404751 (https://phabricator.wikimedia.org/T178807) (owner: 10Rush) [18:41:37] (03PS4) 10Dzahn: letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk) [18:43:00] (03CR) 10Dzahn: [C: 032] letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk) [18:43:31] PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:43:31] PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:43:42] PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:43:53] (03PS4) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 [18:44:02] PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:44:02] PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:44:02] PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:44:22] PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:44:26] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo) [18:44:54] (03CR) 10Dzahn: "no issues with puppet runs: cobalt, netmon1002, dbmonitor1001 ..." [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk) [18:45:46] (03PS5) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 [18:46:10] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo) [18:46:31] PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer [18:47:12] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:48:59] Krenair: i merged the LE change, no problems in prod. i then went to deployment-mx02 to confirm, i guess need to wait for puppetmaster to sync [18:50:27] (03PS6) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 [18:50:56] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo) [18:51:15] (03PS7) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 [18:51:21] (03CR) 10Dzahn: "so far the error has not changed on deployment-mx02 but i think it just needs to sync the puppetmaster with prod.. should work in a bit.." [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk) [18:51:47] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo) [18:52:42] RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational [18:53:02] RECOVERY - Disk space on pybal-test2001 is OK: DISK OK [18:53:02] RECOVERY - DPKG on pybal-test2001 is OK: All packages OK [18:53:02] RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full [18:53:31] RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient [18:53:31] RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set [18:53:31] RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up [18:53:32] RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [18:53:53] (03PS8) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 [18:54:21] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo) [18:55:40] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: unattended-upgrades: improvements for apt-upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/404736 (owner: 10Arturo Borrero Gonzalez) [18:55:49] (03PS3) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: improvements for apt-upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/404736 [18:56:50] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] apt: unattended-upgrades: improvements for apt-upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/404736 (owner: 10Arturo Borrero Gonzalez) [18:57:11] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 48 minutes ago with 0 failures [18:57:56] (03PS9) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 [18:59:06] (03PS10) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180117T1900) [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:08] (03PS11) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 [19:00:29] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: unattended-upgrades: ensure python3-apt is installed [puppet] - 10https://gerrit.wikimedia.org/r/404740 (owner: 10Arturo Borrero Gonzalez) [19:00:31] (03PS3) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: ensure python3-apt is installed [puppet] - 10https://gerrit.wikimedia.org/r/404740 [19:00:39] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] apt: unattended-upgrades: ensure python3-apt is installed [puppet] - 10https://gerrit.wikimedia.org/r/404740 (owner: 10Arturo Borrero Gonzalez) [19:01:12] (03PS12) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 [19:02:17] (03CR) 10Jcrespo: [C: 032] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo) [19:02:22] beta seems down, is this known? https://en.wikipedia.beta.wmflabs.org/ gives a 503 [19:02:31] MatmaRex hi i think so [19:02:38] MatmaRex: I think greg sent an email recently [19:02:42] MatmaRex https://phabricator.wikimedia.org/T185055 [19:02:49] okay. thanks [19:02:57] but I cannot say if related, but worth checking [19:03:28] hmm, that task seems narrower. it only talks about the API, but the whole site is down [19:04:43] MatmaRex: yeah, seems to be a different issue [19:06:16] tgr according to logstash-beta this is most likly redis. Unless redis logback is hiding the true problem. [19:06:39] ...or not, the error just looks different because this has Varnish in the middle and the API doesn't [19:06:47] 10Operations, 10MediaWiki-API, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3904733 (10Paladox) On logstash-beta i see wiki:enwiki exception.trace:#0 [internal function]: MWExceptionHandler::handleErr... [19:08:18] 10Operations, 10MediaWiki-API, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3907077 (10Tgr) All of Beta MediaWiki seems to be down now. ``` tgr@deployment-mediawiki04:~$ curl -v -H 'Host: en.wikipedia.... [19:08:32] (03PS1) 10Jcrespo: mariadb-temporary_storage: Do not monitor mariadb [puppet] - 10https://gerrit.wikimedia.org/r/404761 [19:09:07] (03CR) 10Jcrespo: [C: 032] mariadb-temporary_storage: Do not monitor mariadb [puppet] - 10https://gerrit.wikimedia.org/r/404761 (owner: 10Jcrespo) [19:12:53] 10Operations, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3907085 (10Anomie) This has nothing to do with #mediawiki-api, or as far as I can tell anything to do with code in MediaWiki at all. It seems to... [19:13:45] (03PS1) 10Mark Bergsma: Expand test coverage of server.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/404762 [19:16:12] (03Restored) 10Zoranzoki21: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) (owner: 10Zoranzoki21) [19:16:18] (03PS12) 10Zoranzoki21: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) [19:17:18] (03PS1) 10Jcrespo: mariadb: Enable notifications on es2001-4 and set default behaviour [puppet] - 10https://gerrit.wikimedia.org/r/404768 [19:20:50] (03CR) 10Ladsgroup: [C: 031] Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [19:21:37] (03PS6) 10Zoranzoki21: Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [19:22:31] (03PS2) 10MaxSem: Add a test verifying that rtl.dblist is up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) [19:22:35] (03CR) 10MaxSem: Add a test verifying that rtl.dblist is up to date (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) (owner: 10MaxSem) [19:23:52] (03PS3) 10MaxSem: Add a test verifying that rtl.dblist is up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) [19:24:01] (03PS6) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [19:24:34] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [19:24:43] (03CR) 10Jcrespo: [C: 04-1] "Not until they have been reimaged." [puppet] - 10https://gerrit.wikimedia.org/r/404768 (owner: 10Jcrespo) [19:28:30] (03PS7) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [19:39:00] (03PS1) 10Ottomata: [WIP] point eventlogging processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) [19:39:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] point eventlogging processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [19:39:31] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184787#3907130 (10Papaul) a:05Papaul>03fgiunchedi Disk replacement complete. [19:41:27] (03PS2) 10Ottomata: [WIP] point eventlogging processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) [19:41:39] (03CR) 10Hashar: "[5/5] I will logout/login more often!" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox) [19:41:54] hashar thanks :). [19:42:30] (03PS7) 10Dzahn: mediawiki: Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [19:43:58] (03CR) 10Dzahn: [C: 032] mediawiki: Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [19:44:20] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/9769/eventlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [19:44:31] (03CR) 10Hashar: "Can you possibly reuse the commit message from the original change https://gerrit.wikimedia.org/r/#/c/398484/ ? And maybe explain why it " [puppet] - 10https://gerrit.wikimedia.org/r/404480 (owner: 10Herron) [19:44:44] (03PS3) 10Ottomata: [WIP] point eventlogging processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) [19:45:17] !log Powering down mw2140 for main board replacement [19:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:38] (03CR) 10Dzahn: "thanks paladox and markusguenther for making the design and releasing it !:)" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox) [19:46:31] !log rebooting labpuppetmaster1002 [19:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:50] (03Draft1) 10Paladox: Gerrit: Add attribution to background image [puppet] - 10https://gerrit.wikimedia.org/r/404777 (https://phabricator.wikimedia.org/T184778) [19:46:52] (03Draft2) 10Paladox: Gerrit: Add attribution to background image [puppet] - 10https://gerrit.wikimedia.org/r/404777 (https://phabricator.wikimedia.org/T184778) [19:48:34] (03CR) 10Dzahn: "is the compiler error my fault or just a limitation because i rename the hiera key in the same change? http://puppet-compiler.wmflabs.org/" [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn) [19:49:25] (03CR) 10Dzahn: "commit message has been updated since that last reviewer comment" [puppet] - 10https://gerrit.wikimedia.org/r/382930 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:50:10] (03CR) 10Dzahn: "it compiles http://puppet-compiler.wmflabs.org/9711/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/403730 (owner: 10Dzahn) [19:52:14] PROBLEM - Host mw2140.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:53:11] (03PS8) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [19:53:26] (03PS9) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [19:53:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [19:54:57] (03PS10) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [19:56:01] Krinkle https://gerrit.wikimedia.org/r/404777 [19:56:03] !log rebooting labpuppetmaster1001 [19:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:41] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3907183 (10Andrew) [19:57:25] (03PS3) 10Dzahn: druid: move firewall includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/397726 [19:59:23] (03CR) 10Dzahn: [C: 04-1] "i dont understand why this gets unrelated "Error: Could not find resource 'Exec[apt-get update]' for relationship from 'Class[Profile::C" [puppet] - 10https://gerrit.wikimedia.org/r/397726 (owner: 10Dzahn) [20:00:04] thcipriani: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180117T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:01:52] * thcipriani works on it [20:03:05] PROBLEM - Host labcontrol1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:04:01] (03CR) 10Dzahn: "i'll split it into multiple patches, i assume you prefer that, right" [puppet] - 10https://gerrit.wikimedia.org/r/399542 (owner: 10Dzahn) [20:04:05] (03Abandoned) 10Dzahn: wmcs/labs: move more firewall/standard includes into roles [puppet] - 10https://gerrit.wikimedia.org/r/399542 (owner: 10Dzahn) [20:05:01] !log rebooted labservices1002, labcontrol1002, labnet1002 [20:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:45] PROBLEM - Host labservices1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:07:24] RECOVERY - Host labservices1002 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [20:08:43] (03CR) 10Ottomata: [C: 031] "Hm, don't know either, but +1 ! :)" [puppet] - 10https://gerrit.wikimedia.org/r/397726 (owner: 10Dzahn) [20:09:38] (03PS8) 10Tjones: Updates to enable short URLs for transliteration for crhwiki production [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel) [20:09:56] (03PS8) 10Tjones: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) [20:14:07] (03PS1) 10Ottomata: No-op for refinery job camus to ease future analytics -> jumbo kafka [puppet] - 10https://gerrit.wikimedia.org/r/404789 (https://phabricator.wikimedia.org/T175461) [20:14:32] (03CR) 10jerkins-bot: [V: 04-1] No-op for refinery job camus to ease future analytics -> jumbo kafka [puppet] - 10https://gerrit.wikimedia.org/r/404789 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [20:16:04] RECOVERY - Host labcontrol1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:16:36] (03PS1) 10Dzahn: labtest: move firewall/standard includes to roles [puppet] - 10https://gerrit.wikimedia.org/r/404790 [20:16:43] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3907234 (10Andrew) [20:17:11] (03PS2) 10Ottomata: No-op for refinery job camus to ease future analytics -> jumbo kafka [puppet] - 10https://gerrit.wikimedia.org/r/404789 (https://phabricator.wikimedia.org/T175461) [20:17:45] (03CR) 10jerkins-bot: [V: 04-1] No-op for refinery job camus to ease future analytics -> jumbo kafka [puppet] - 10https://gerrit.wikimedia.org/r/404789 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [20:18:33] (03PS3) 10Ottomata: No-op for refinery job camus to ease future analytics -> jumbo kafka [puppet] - 10https://gerrit.wikimedia.org/r/404789 (https://phabricator.wikimedia.org/T175461) [20:21:06] (03CR) 10Ottomata: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9772/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404789 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [20:21:10] (03PS1) 10Dzahn: site: use role(test) for unused labs nodes [puppet] - 10https://gerrit.wikimedia.org/r/404791 [20:22:41] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/404790/" [puppet] - 10https://gerrit.wikimedia.org/r/399542 (owner: 10Dzahn) [20:23:04] (03PS4) 10Dzahn: druid: move firewall includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/397726 [20:25:20] (03PS5) 10Dzahn: druid: move firewall includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/397726 [20:25:34] PROBLEM - Host mw2140 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:56] (03CR) 10Dzahn: [C: 032] druid: move firewall includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/397726 (owner: 10Dzahn) [20:30:10] !log pnorman@tin Started deploy [kartotherian/deploy@ecdda41]: (no justification provided) [20:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:33] (03CR) 10Kaldari: [C: 032] Add a test verifying that rtl.dblist is up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) (owner: 10MaxSem) [20:34:55] RECOVERY - Host mw2140.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 37.70 ms [20:35:54] !log pnorman@tin Finished deploy [kartotherian/deploy@ecdda41]: (no justification provided) (duration: 05m 44s) [20:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:06] (03Merged) 10jenkins-bot: Add a test verifying that rtl.dblist is up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) (owner: 10MaxSem) [20:37:02] (03CR) 10jenkins-bot: Add a test verifying that rtl.dblist is up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) (owner: 10MaxSem) [20:37:44] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2861029 (10Isarra) So when's this happening? Wheeeeeen? [20:40:15] !log thcipriani@tin Synchronized php-1.31.0-wmf.17/includes/Storage/RevisionStore.php: [[gerrit:404757|[MCR] RevisionStore::getTitle final logged fallback to master]] PART I (duration: 01m 04s) [20:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:36] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3907310 (10dr0ptp4kt) Hi @Isarra , just wanted to note that @Deskana is taking on product owner duties on this and is working with @Tgr a... [20:41:35] PROBLEM - Host mw2140.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:41:54] !log thcipriani@tin Synchronized php-1.31.0-wmf.17/includes/ServiceWiring.php: [[gerrit:404757|[MCR] RevisionStore::getTitle final logged fallback to master]] PART II (duration: 01m 12s) [20:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:26] !log thcipriani@tin Synchronized php-1.31.0-wmf.17/vendor/wikibase/data-model-services: [[gerrit:404758|Add missing files from wikibase/data-model-services 3.9.0]] (duration: 01m 15s) [20:45:26] (03CR) 10Rush: [C: 031] site: use role(test) for unused labs nodes [puppet] - 10https://gerrit.wikimedia.org/r/404791 (owner: 10Dzahn) [20:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:44] RECOVERY - Host mw2140.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms [20:48:46] (03CR) 10Phantom42: "Thank you for merging this!" [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [20:49:59] (03PS1) 10Thcipriani: Group1 to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404801 [20:52:18] (03CR) 10Thcipriani: [C: 032] Group1 to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404801 (owner: 10Thcipriani) [20:52:42] (03PS2) 10Dzahn: site: use role(test) for unused labs nodes [puppet] - 10https://gerrit.wikimedia.org/r/404791 [20:53:04] (03CR) 10Dzahn: [C: 032] "thx" [puppet] - 10https://gerrit.wikimedia.org/r/404791 (owner: 10Dzahn) [20:53:22] (03CR) 10Dzahn: "thanks for the fix :)" [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [20:55:08] (03Merged) 10jenkins-bot: Group1 to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404801 (owner: 10Thcipriani) [20:56:44] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3907372 (10Ottomata) [20:56:54] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3840663 (10Ottomata) a:03Ottomata [20:57:11] (03CR) 10jenkins-bot: Group1 to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404801 (owner: 10Thcipriani) [20:57:38] !log thcipriani@tin rebuilt and synchronized wikiversions files: group1 to 1.31.0-wmf.17 [20:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:44] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3907388 (10Raymond) >>! In T133410#3907310, @dr0ptp4kt wrote: > Hi @Isarra , just wanted to note that @Deskana is taking on product owner... [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180117T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:00:19] Nothing for ORES today [21:03:25] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3907402 (10Isarra) If we want to propose specific projects for this, should we just do the usual discussion on-wiki to see if there's con... [21:03:41] !log thcipriani@tin Synchronized php: group1 to 1.31.0-wmf.17 (duration: 01m 11s) [21:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:15] 10Operations, 10ops-codfw: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3907403 (10Papaul) a:05Papaul>03MoritzAccountTest Main board replacement complete - Test ssh connection ( racadm power commands) - clear log - Update IDRAC firmware from version 2.21 to version... [21:11:49] (03PS2) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 [21:12:20] (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush) [21:12:59] (03PS3) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 [21:13:26] (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush) [21:15:04] PROBLEM - HHVM jobrunner on mw1308 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [21:16:04] RECOVERY - HHVM jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.005 second response time [21:17:17] (03PS4) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 [21:18:09] (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush) [21:18:09] (03PS5) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 [21:18:25] (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush) [21:19:00] (03PS6) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 [21:20:01] (03PS7) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 [21:27:44] (03PS1) 10Jdlrobson: Use the correct Pashto Wikipedia wordmark on mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404828 (https://phabricator.wikimedia.org/T184442) [21:37:32] (03CR) 10Framawiki: "This patch was created to create a redirect from techblog.wikimedia.org to blog.wikimedia.org/c/technology, instead of the main page of th" [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [21:44:15] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[nagios-nrpe-server],Exec[ip addr add 2620:0:860:102:10:192:16:139/64 dev eth0],Exec[absent_ensure_members] [21:44:34] PROBLEM - Keyholder SSH agent on labpuppetmaster1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [21:45:24] PROBLEM - Keyholder SSH agent on labpuppetmaster1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [21:50:34] RECOVERY - Keyholder SSH agent on labpuppetmaster1001 is OK: OK: Keyholder is armed with all configured keys. [21:54:53] herron: about? quick q [21:55:09] chasemp: hey, what’s up? [21:55:22] hey did we build our own trusty puppet packages in the end? [21:55:24] for 4.x I mean [21:55:30] 10Operations, 10ops-codfw: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3907591 (10RobH) a:05MoritzAccountTest>03MoritzMuehlenhoff [21:57:05] chasemp: yeah, ended up backporting the debian packages to trusty. there is some background about it in T182894 [21:57:06] T182894: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894 [21:58:48] herron: so I may be one of the few that will use util/localrun to test on an instance but it seems have broken and it looks like 'Error while evaluating a Function Call, uninitialized constant RGen::ECore::ELong' on Trusty, and I think it may be that ruby-rgen is not only no longer a dependency but conflicts for puppet 4.x or so https://bugzilla.redhat.com/show_bug.cgi?id=1411809 [22:00:01] but maybe that's not right if https://packages.debian.org/stretch/puppet is to be believed [22:00:08] hmm [22:00:23] !log rebooting californium, silver, labcontrol1001, labservices1001 [22:01:55] PROBLEM - nova-api http on labnet1001 is CRITICAL: connect to address 10.64.20.13 and port 8774: Connection refused [22:02:34] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [22:04:37] chasemp yes that package.debian.org link is right, ruby-rgen is a dependency of the trusty puppet package (and just double checked on a trusty host) [22:05:06] and is installed [22:05:45] herron: yeah, I think it's possible from puppetlabs perspecive ruby-rgen is no longer a dep and in practical terms may conflict [22:06:15] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [22:06:34] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [22:06:55] RECOVERY - nova-api http on labnet1001 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.002 second response time [22:07:08] herron: I'm grasping at straws atm here :) wanted to get your perspective [22:07:15] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [22:09:15] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:19:09] 10Operations, 10ops-esams, 10netops: replace msw1-esams - https://phabricator.wikimedia.org/T185151#3907680 (10ayounsi) [22:26:37] 10Operations, 10Mail: Disavow emails from wikipedia.com - https://phabricator.wikimedia.org/T184230#3907759 (10herron) Spent some time looking into this today. Sounds good overall. I'd like to roll these updates out in a phased way, and will need to split wikipedia.com into it's own dns zone file as it's cur... [22:34:43] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2861046 (10Iniquity) @Isarra we are waiting for T180817 this task. [22:38:44] hey - I'd like to scap deploy /srv/deployment/3d2png/deploy - can I do that now, or should I schedule it? [22:39:56] greg-g, ^ [22:41:01] matthiasmullie: there are services windows throughout the week [22:41:31] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#3907805 (10RobH) p:05Triage>03High [22:41:54] greg-g alright, I'll see if I can join that one tomorrow - thanks [22:41:59] matthiasmullie: what's the need? timeliness? [22:42:23] /urgency [22:42:35] not in a particular rush, I can wait :) [22:42:40] :) [22:44:15] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#3907847 (10RobH) Please note I've assigned this task to @faidon to review and approve, since he has been point person on this system's project. Ideally, once we have furud's existing 2 shelves,... [22:53:55] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:54:42] !log bootstrapping restbase1013-b - T184100 [22:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:55] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [22:58:55] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:38:57] !log [terbium:~] $ echo 'https://annual.wikimedia.org' | mwscript purgeList.php [23:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:35] (03PS1) 1020after4: Phabricator: Add translations library to phabricator profile [puppet] - 10https://gerrit.wikimedia.org/r/404887 (https://phabricator.wikimedia.org/T225) [23:48:03] (03PS6) 10Thcipriani: Scap canary: cache last good deploy time [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) [23:48:47] (03CR) 10Thcipriani: Scap canary: cache last good deploy time (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani) [23:50:39] (03CR) 1020after4: [C: 031] "This can be merged at any time, before or after tonight's phabricator deployment." [puppet] - 10https://gerrit.wikimedia.org/r/404887 (https://phabricator.wikimedia.org/T225) (owner: 1020after4) [23:58:35] PROBLEM - HHVM jobrunner on mw1337 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [23:59:35] RECOVERY - HHVM jobrunner on mw1337 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time