[00:00:05] <jouncebot>	 addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180117T0000).
[00:00:05] <jouncebot>	 kaldari and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:08] <ebernhardson>	 \o
[00:01:31] <ebernhardson>	 i suppose i can ship things
[00:07:30] <wikibugs>	 (03PS2) 10Dzahn: DHCP: switch to http to retrieve installer image [puppet] - 10https://gerrit.wikimedia.org/r/404607 (https://phabricator.wikimedia.org/T182215)
[00:07:36] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3904801 (10ayounsi) a:05ayounsi>03None
[00:08:39] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3890758 (10ayounsi) About: >  Postgres DB was empty after I was able to have it restarted. No tables defined, but a netbox DB is defined. I re-ran the scap script...
[00:08:48] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.31.0-wmf.17/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T182616 Turn off cirrus AB test on hewiki (duration: 01m 14s)
[00:09:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:09:03] <stashbot>	 T182616: Re-run AB test for Hebrew Wikipedia (has > 1% of search traffic) with new model - https://phabricator.wikimedia.org/T182616
[00:10:18] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.31.0-wmf.16/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T182616 Turn off cirrus AB test on hewiki (duration: 01m 12s)
[00:10:25] <ebernhardson>	 kaldari: here for swat?
[00:10:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:55] <kaldari>	 here
[00:14:58] <kaldari>	 o/
[00:15:22] <wikibugs>	 (03PS3) 10Dzahn: DHCP: switch to http to retrieve installer image [puppet] - 10https://gerrit.wikimedia.org/r/404607 (https://phabricator.wikimedia.org/T182215)
[00:15:32] <wikibugs>	 (03CR) 10EBernhardson: [C: 032] Updating fonts list and sorting it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403984 (https://phabricator.wikimedia.org/T184664) (owner: 10Kaldari)
[00:15:35] <ebernhardson>	 kaldari: anything to test?
[00:15:41] * ebernhardson has no clue what this even does :P
[00:17:21] <greg-g>	 solid +2 :)
[00:17:54] <wikibugs>	 (03Merged) 10jenkins-bot: Updating fonts list and sorting it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403984 (https://phabricator.wikimedia.org/T184664) (owner: 10Kaldari)
[00:18:04] <wikibugs>	 (03CR) 10jenkins-bot: Updating fonts list and sorting it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403984 (https://phabricator.wikimedia.org/T184664) (owner: 10Kaldari)
[00:18:11] <kaldari>	 ebernhardson: nope
[00:21:37] <logmsgbot>	 !log ebernhardson@tin Synchronized fc-list: SWAT: T184664 Updating fonts list and sorting it (duration: 01m 12s)
[00:21:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:51] <stashbot>	 T184664: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664
[00:22:01] <ebernhardson>	 kaldari: deployed
[00:22:34] <wikibugs>	 (03Merged) 10jenkins-bot: Remove cirrus AB test config for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404592 (https://phabricator.wikimedia.org/T182616) (owner: 10EBernhardson)
[00:22:47] <wikibugs>	 (03CR) 10jenkins-bot: Remove cirrus AB test config for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404592 (https://phabricator.wikimedia.org/T182616) (owner: 10EBernhardson)
[00:23:02] <mutante>	 isnt the test to render some specific SVG?:)
[00:23:29] <mutante>	 oh, you already did, saw ticket comment, cool
[00:25:09] <ebernhardson>	 turns out .. i didn't rebase before shipping that one :P one more time
[00:26:11] <logmsgbot>	 !log ebernhardson@tin Synchronized fc-list: SWAT: T184664 Updating fonts list and sorting it (duration: 01m 12s)
[00:26:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:28:51] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T182616 Remove cirrus AB test config for hewiki (duration: 01m 09s)
[00:29:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:29:05] <stashbot>	 T182616: Re-run AB test for Hebrew Wikipedia (has > 1% of search traffic) with new model - https://phabricator.wikimedia.org/T182616
[00:29:07] <ebernhardson>	 swat complete
[00:45:59] <wikibugs>	 (03PS4) 10Dzahn: DHCP: switch to http to retrieve installer image [puppet] - 10https://gerrit.wikimedia.org/r/404607 (https://phabricator.wikimedia.org/T182215)
[01:03:34] <wikibugs>	 (03CR) 10Dzahn: [C: 032] DHCP: switch to http to retrieve installer image [puppet] - 10https://gerrit.wikimedia.org/r/404607 (https://phabricator.wikimedia.org/T182215) (owner: 10Dzahn)
[01:17:16] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] Add my new key [puppet] - 10https://gerrit.wikimedia.org/r/404604 (owner: 10MaxSem)
[01:17:22] <wikibugs>	 (03PS3) 10Ayounsi: Add my new key [puppet] - 10https://gerrit.wikimedia.org/r/404604 (owner: 10MaxSem)
[01:17:46] <wikibugs>	 (03CR) 10Ayounsi: [V: 032 C: 032] Add my new key [puppet] - 10https://gerrit.wikimedia.org/r/404604 (owner: 10MaxSem)
[01:24:08] <wikibugs>	 10Operations, 10Patch-For-Review: install_server: switch to stretch as default install image - https://phabricator.wikimedia.org/T182215#3905122 (10Dzahn) 05Open>03Resolved both things done , in separate changes, and mailed ops list about it.
[01:27:36] <wikibugs>	 (03CR) 10Dzahn: "better add traffic team, i'm not a good reviewer for this" [puppet] - 10https://gerrit.wikimedia.org/r/317450 (https://phabricator.wikimedia.org/T133548) (owner: 10Alex Monk)
[01:42:14] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to stat1004, stat1005, stat1006 for mneisler - https://phabricator.wikimedia.org/T184838#3905129 (10MNeisler) I've confirmed with the team that I'll need access to the `analytics-privatedata-users` and `researchers` groups. I'll specifically need access to...
[01:58:31] <wikibugs>	 (03PS10) 10Dzahn: Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[01:58:39] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9753/ruthenium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[01:59:41] <wikibugs>	 (03PS11) 10Dzahn: parsoid::testing:: Switch to YAML configuration  [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[02:02:42] <wikibugs>	 (03PS1) 10MaxSem: Add a test verifying that rtl.dblist is up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337)
[02:03:42] <wikibugs>	 (03CR) 10Dzahn: "applied on ruthenium now" [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[02:31:06] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.16) (duration: 07m 11s)
[02:31:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:31:37] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10Research-2017-18-Q2: Permissions to upload data to the analytics cluster from a machine at Drexel - https://phabricator.wikimedia.org/T177521#3905172 (10DarTar) 05Open>03Resolved
[03:21:10] <wikibugs>	 (03PS47) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956)
[03:37:46] <wikibugs>	 (03PS1) 10TerraCodes: Move flaggedrevs to NS_MAIN on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404620 (https://phabricator.wikimedia.org/T148603)
[04:26:52] <wikibugs>	 (03Draft2) 10Jayprakash12345: Add Draft Namespace in enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404624 (https://phabricator.wikimedia.org/T184957)
[04:27:24] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[04:28:45] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[04:39:24] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[04:40:45] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[04:52:54] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.24 seconds
[04:53:34] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.59 seconds
[04:53:35] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.93 seconds
[04:53:35] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.88 seconds
[04:53:35] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.52 seconds
[04:53:35] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.54 seconds
[04:53:44] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.79 seconds
[05:00:05] <icinga-wm>	 PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[05:02:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 57.13 seconds
[05:02:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 56.24 seconds
[05:02:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 53.40 seconds
[05:02:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2018 is OK: OK slave_sql_lag Replication lag: 48.91 seconds
[05:02:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 47.94 seconds
[05:02:44] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 33.53 seconds
[05:02:54] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 18.45 seconds
[06:14:14] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404629 (https://phabricator.wikimedia.org/T174569)
[06:16:52] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404629 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[06:18:18] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404629 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[06:18:28] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404629 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[06:20:42] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1104 - T174569 (duration: 01m 14s)
[06:20:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:57] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[06:21:07] <marostegui>	 !log Upgrade mariadb and kernel on db1104
[06:21:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:02] <marostegui>	 !log Deploy schema change on db1104 - T174569
[06:28:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:15] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[06:29:31] <marostegui>	 !log Stop replication in sync on db1089 and s1 codfw master (db2048) - T162807
[06:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:42] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[06:30:04] <icinga-wm>	 RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:40:03] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3905339 (10Marostegui)
[06:40:21] <marostegui>	 !log Stop MySQL on labsdb1001 (already dead) and labsdb1003 - T184832
[06:40:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:32] <stashbot>	 T184832: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832
[06:40:55] <icinga-wm>	 RECOVERY - Disk space on labtestnet2001 is OK: DISK OK
[06:42:09] <wikibugs>	 (03PS4) 10Marostegui: mariadb: Set as spares labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/404323 (https://phabricator.wikimedia.org/T184832) (owner: 10Jcrespo)
[06:42:34] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[06:42:54] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[06:43:28] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3905342 (10Marostegui)
[06:43:32] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Set as spares labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/404323 (https://phabricator.wikimedia.org/T184832) (owner: 10Jcrespo)
[06:46:38] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3905344 (10Marostegui)
[06:47:18] <marostegui>	 !log Remove labsdb1001 and labsdb1003 from tendril - T184832
[06:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:30] <stashbot>	 T184832: Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832
[06:49:47] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3905347 (10Marostegui) a:03Cmjohnson I believe this is now ready for @Cmjohnson to proceed.
[07:45:48] <_joe_>	 !log depooling mw1209-1220 from the appserver cluster for decommissioning, T185004
[07:46:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:02] <stashbot>	 T185004: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004
[07:46:33] <logmsgbot>	 !log oblivian@neodymium conftool action : set/pooled=inactive; selector: cluster=appserver,name=mw12([0-1][0-9]|20)\.eqiad\.wmnet
[07:46:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:31] <elukey>	 !log restart varnish backend on cp4024 (ton of 503s, icinga alerting for mailbox lag)
[07:48:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: site.pp: reorganize MediaWiki appservers in codfw for role/row (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/404498 (owner: 10Giuseppe Lavagetto)
[07:54:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: site.pp: reorganize appservers in eqiad by function/row (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404453 (owner: 10Giuseppe Lavagetto)
[07:55:07] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1215 is CRITICAL: Host mw1215 is not in mediawiki-installation dsh group
[07:56:12] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: site.pp: reorganize appservers in eqiad by function/row [puppet] - 10https://gerrit.wikimedia.org/r/404453
[07:56:14] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: site.pp: decommission mw1201-1208 [puppet] - 10https://gerrit.wikimedia.org/r/404499 (https://phabricator.wikimedia.org/T185004)
[07:56:16] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: site.pp: decommission mw1209-1220 [puppet] - 10https://gerrit.wikimedia.org/r/404500 (https://phabricator.wikimedia.org/T185004)
[07:56:18] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: site.pp: reorganize MediaWiki appservers in codfw for role/row [puppet] - 10https://gerrit.wikimedia.org/r/404498
[07:58:51] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp4024 is OK: OK: expiry mailbox lag is 0
[07:59:41] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[07:59:50] <wikibugs>	 (03CR) 10Elukey: [C: 031] site.pp: reorganize appservers in eqiad by function/row [puppet] - 10https://gerrit.wikimedia.org/r/404453 (owner: 10Giuseppe Lavagetto)
[08:00:01] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[08:03:32] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1209 is CRITICAL: Host mw1209 is not in mediawiki-installation dsh group
[08:04:12] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1216 is CRITICAL: Host mw1216 is not in mediawiki-installation dsh group
[08:05:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: reorganize appservers in eqiad by function/row [puppet] - 10https://gerrit.wikimedia.org/r/404453 (owner: 10Giuseppe Lavagetto)
[08:11:31] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1220 is CRITICAL: Host mw1220 is not in mediawiki-installation dsh group
[08:12:12] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1210 is CRITICAL: Host mw1210 is not in mediawiki-installation dsh group
[08:12:12] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1219 is CRITICAL: Host mw1219 is not in mediawiki-installation dsh group
[08:13:37] <elukey>	 these are expected, part of decom --^
[08:15:08] <wikibugs>	 10Operations, 10hardware-requests, 10HHVM, 10Patch-For-Review, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#3905409 (10Joe)
[08:18:53] <_joe_>	 yeah but I downtimed the hosts...
[08:19:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: decommission mw1201-1208 [puppet] - 10https://gerrit.wikimedia.org/r/404499 (https://phabricator.wikimedia.org/T185004) (owner: 10Giuseppe Lavagetto)
[08:30:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: decommission mw1209-1220 [puppet] - 10https://gerrit.wikimedia.org/r/404500 (https://phabricator.wikimedia.org/T185004) (owner: 10Giuseppe Lavagetto)
[08:36:07] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404635 (https://phabricator.wikimedia.org/T174569)
[08:37:47] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404635 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[08:40:10] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404635 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[08:40:20] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404635 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[08:40:28] <wikibugs>	 10Operations, 10hardware-requests, 10HHVM, 10Patch-For-Review, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#3905423 (10Joe)
[08:40:42] <wikibugs>	 10Operations, 10hardware-requests, 10HHVM, 10Patch-For-Review, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#3902917 (10Joe) a:05Joe>03None
[08:42:30] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=61%)
[08:42:35] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1065, pool db1067 for vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404636 (https://phabricator.wikimedia.org/T162807)
[08:44:14] <elukey>	 !log reboot stat100[456] for kernel upgrades
[08:44:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:14] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1101:3318 (duration: 15m 42s)
[08:56:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:11] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065, pool db1067 for vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404636 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[08:58:43] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065, pool db1067 for vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404636 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[08:58:53] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065, pool db1067 for vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404636 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[09:00:31] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 - T162807 (duration: 01m 11s)
[09:00:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:43] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[09:03:35] <wikibugs>	 (03PS1) 10Filippo Giunchedi: restbase: reprovision restbase1016 [puppet] - 10https://gerrit.wikimedia.org/r/404638 (https://phabricator.wikimedia.org/T184100)
[09:04:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] restbase: reprovision restbase1016 [puppet] - 10https://gerrit.wikimedia.org/r/404638 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[09:06:05] <elukey>	 !log reboot analytics1003 for kernel upgrades
[09:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:57] <wikibugs>	 (03PS2) 10Gehel: wdqs: simplify logging of categories reload [puppet] - 10https://gerrit.wikimedia.org/r/404315
[09:11:27] <wikibugs>	 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3905487 (10Qgil)
[09:11:31] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for  db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404639
[09:11:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Postgres: remove hardcoded version [puppet] - 10https://gerrit.wikimedia.org/r/404516 (https://phabricator.wikimedia.org/T184634) (owner: 10Ayounsi)
[09:12:29] <icinga-wm>	 PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[init_superset]
[09:13:54] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for  db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404639 (owner: 10Marostegui)
[09:14:36] <godog>	 !log reimage restbase1016 - T184100
[09:14:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:47] <stashbot>	 T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100
[09:15:53] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for  db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404639 (owner: 10Marostegui)
[09:16:51] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for  db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404639 (owner: 10Marostegui)
[09:17:29] <icinga-wm>	 RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:17:34] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1101:3318 (duration: 01m 12s)
[09:17:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:29] <icinga-wm>	 PROBLEM - DPKG on furud is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[09:25:16] <elukey>	 thorium should be fixed
[09:25:28] <icinga-wm>	 RECOVERY - DPKG on furud is OK: All packages OK
[09:28:45] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404642
[09:30:29] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404642 (owner: 10Marostegui)
[09:30:34] <moritzm>	 !log rebooting flerovium and furud for kernel security update
[09:30:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:05] <wikibugs>	 10Operations, 10MediaWiki-Configuration, 10discovery-system: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078#3905518 (10Joe) p:05Triage>03Normal
[09:32:01] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404642 (owner: 10Marostegui)
[09:32:13] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404642 (owner: 10Marostegui)
[09:34:22] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Full repool db1101:3318 (duration: 01m 11s)
[09:34:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:20] <wikibugs>	 10Operations, 10ops-eqiad: mw1191 ipmi-sel cpu errors - https://phabricator.wikimedia.org/T179640#3905531 (10elukey) 05Open>03Resolved a:03elukey Host decommed in https://phabricator.wikimedia.org/T183895
[09:38:16] <wikibugs>	 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#3905536 (10fgiunchedi) I'm running into this issue again when reimaging restbase systems as part of {T184100}. From the comments above it seems that setting rootdelay= fixes the issue but we're not applyin...
[09:46:04] <elukey>	 !log removed upstart config for brrd on eventlog1001 (failing and spamming syslog, old leftover?)
[09:46:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:21] <elukey>	 Cc: Krinkle,ottomata --^
[09:52:13] <elukey>	 !log reboot druid1005 for kernel upgrades
[09:52:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:40] <elukey>	 uff it is druid1002, amending
[09:59:53] <wikibugs>	 10Operations, 10MediaWiki-Configuration, 10discovery-system: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078#3905554 (10Volans)
[10:00:41] <wikibugs>	 10Operations, 10MediaWiki-Configuration, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597#3905561 (10Volans)
[10:01:05] <godog>	 !log start cassandra-a on restbase1016
[10:01:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:01] <wikibugs>	 10Operations, 10MediaWiki-Configuration, 10discovery-system: Prepare conftool for safely editing mediawiki-config values - https://phabricator.wikimedia.org/T185080#3905564 (10Joe) p:05Triage>03Normal
[10:11:54] <moritzm>	 !log reset RAC on hydrogen, serial console was inaccessible
[10:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:28] <moritzm>	 !log depooling hydrogen (and keeping pdns-recursor stopped for a few minutes to check whether problems with load-balanced recdns traffic are still an issue)
[10:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:43] <wikibugs>	 (03PS1) 10Jcrespo: compare.py: Implement progress reporting, more than 2 servers comp. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404647
[10:19:05] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Configuration, 10discovery-system: Allow use of EtcdConfig to configure slave databases - https://phabricator.wikimedia.org/T185084#3905634 (10Joe) p:05Triage>03Normal
[10:19:50] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Configuration, 10discovery-system: Allow use of EtcdConfig to configure slave databases - https://phabricator.wikimedia.org/T185084#3905634 (10Joe)
[10:22:51] <moritzm>	 !log repooling hydrogen (and pdns-recursor restarted), experiment concluded
[10:23:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove hydrogen from LVS name server config [puppet] - 10https://gerrit.wikimedia.org/r/404648
[10:26:55] <wikibugs>	 (03CR) 10Ema: [C: 031] Remove hydrogen from LVS name server config [puppet] - 10https://gerrit.wikimedia.org/r/404648 (owner: 10Muehlenhoff)
[10:27:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove hydrogen from LVS name server config [puppet] - 10https://gerrit.wikimedia.org/r/404648 (owner: 10Muehlenhoff)
[10:34:57] <wikibugs>	 (03PS16) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647)
[10:35:00] <moritzm>	 !log depooling hydrogen again
[10:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:17] <moritzm>	 !log rebooting hydrogen for kernel security update
[10:39:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:49] <moritzm>	 !log repooling hydrogen
[10:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:12] <wikibugs>	 (03PS1) 10Ema: Revert "Remove hydrogen from LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/404649
[10:44:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Remove hydrogen from LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/404650
[10:44:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] Revert "Remove hydrogen from LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/404649 (owner: 10Ema)
[10:44:36] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: grafana: Enable grafana's LDAP [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150)
[10:44:38] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: grafana: Add migration script from proxy to LDAP auth [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150)
[10:44:42] <wikibugs>	 (03CR) 10Ema: [C: 032] Revert "Remove hydrogen from LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/404649 (owner: 10Ema)
[10:44:44] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Revert "Remove hydrogen from LVS name server config" [puppet] - 10https://gerrit.wikimedia.org/r/404650 (owner: 10Muehlenhoff)
[10:45:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] grafana: Enable grafana's LDAP [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[10:51:04] <moritzm>	 !log reset RAC on chromium, serial console is inaccessible
[10:51:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:13] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: grafana: Enable grafana's LDAP [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150)
[10:56:45] <wikibugs>	 10Operations, 10Traffic, 10media-storage, 10User-fgiunchedi: Swift invalid range requests causing 501s - https://phabricator.wikimedia.org/T183902#3905720 (10fgiunchedi)
[11:03:24] <wikibugs>	 (03PS2) 10Jcrespo: compare.py: Implement progress reporting, more than 2 servers comp. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404647
[11:07:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Looks correct puppet-wise, but I dislike the fact we're basically writing the grafana configurations in hiera directly and there is no val" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[11:07:57] <wikibugs>	 (03PS3) 10Jcrespo: compare.py: Implement progress reporting, more than 2 servers comp. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404647
[11:09:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Simplify profile::grafana::production [puppet] - 10https://gerrit.wikimedia.org/r/404319 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[11:09:49] <wikibugs>	 (03CR) 10Volans: "LGTM if this is a one-off script. But in this case do we really need it added into Puppet?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[11:10:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[11:11:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[11:11:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] grafana: Allow to modify the config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/404320 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[11:18:42] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.037 second response time
[11:19:00] <_joe_>	 !log restarted nginx on mw1346, was in a bad state
[11:19:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:31] <moritzm>	 !log rebooting neodymium for kernel security update
[11:24:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:30] <moritzm>	 !log rearmed keyholder on neodymium
[11:28:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:40] <icinga-wm>	 PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: / 2720 MB (3% inode=97%): /srv 429831 MB (6% inode=93%)
[11:37:09] <elukey>	 yeah we are aware of this --^
[11:37:27] <elukey>	 going to work on it this afternoon
[11:40:19] <godog>	 !log bootstrap cassandra-b on restbase1016
[11:40:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: restbase: reprovision restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/404652 (https://phabricator.wikimedia.org/T184100)
[11:48:32] <wikibugs>	 (03PS1) 10Aklapper: Allow discourse-mediawiki.wmflabs.org RSS feed on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404653 (https://phabricator.wikimedia.org/T185087)
[11:51:33] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: grafana: Enable grafana LDAP in production [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150)
[11:56:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "It is indeed a once-off. I 've uploaded it in puppet, mostly to get reviews for it (thanks btw!) and have people aware of what is going to" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[12:00:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Yes, I dislike that too, but it's the status quo for this module and my current aim is not to challenge it, but rather do T170150." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[12:07:25] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: grafana: Add migration script from proxy to LDAP auth [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150)
[12:07:27] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: grafana: Enable grafana LDAP in production [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150)
[12:10:27] <moritzm>	 !log updating HHVM in deployment-prep to 3.18.5+wmf4
[12:10:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:39] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3905851 (10jcrespo)
[12:16:42] <wikibugs>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3905852 (10jcrespo)
[12:20:27] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris)
[12:28:18] <moritzm>	 !log uploading HHVM 3.18.5+wmf4 for jessie-wikimedia to apt.wikimedia.org (3.18.7 with the patch https://github.com/facebook/hhvm/commit/bd7b2bcfe70b053a3a001480653012f68599250f backed out)
[12:28:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:51] <wikibugs>	 (03CR) 10Mark Bergsma: "A few minor nitpicks, otherwise good to go." (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema)
[12:35:57] <wikibugs>	 (03PS4) 10Mark Bergsma: Support multiple BGP peerings [debs/pybal] - 10https://gerrit.wikimedia.org/r/393066 (https://phabricator.wikimedia.org/T180069)
[12:35:59] <wikibugs>	 (03PS8) 10Mark Bergsma: Support per-service-IP BGP MED values [debs/pybal] - 10https://gerrit.wikimedia.org/r/393097 (https://phabricator.wikimedia.org/T165764)
[12:38:00] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 032] Support multiple BGP peerings [debs/pybal] - 10https://gerrit.wikimedia.org/r/393066 (https://phabricator.wikimedia.org/T180069) (owner: 10Mark Bergsma)
[12:38:26] <wikibugs>	 (03Merged) 10jenkins-bot: Support multiple BGP peerings [debs/pybal] - 10https://gerrit.wikimedia.org/r/393066 (https://phabricator.wikimedia.org/T180069) (owner: 10Mark Bergsma)
[12:48:52] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 031] "I still find passing verbose around to functions intended to upgrade packages etc. to be a bit icky, but OK, +1 as far as I'm concerned :)" [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez)
[12:53:30] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: grafana: Enable grafana LDAP in production [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150)
[12:58:34] <wikibugs>	 (03PS4) 10Faidon Liambotis: Update group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/402583 (https://phabricator.wikimedia.org/T184338) (owner: 10Framawiki)
[12:59:13] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] Update group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/402583 (https://phabricator.wikimedia.org/T184338) (owner: 10Framawiki)
[13:02:15] <wikibugs>	 10Operations, 10Patch-For-Review: Update people.wikimedia.org with the 2017 Wikimedia hackathon group photo - https://phabricator.wikimedia.org/T184338#3905931 (10faidon) 05Open>03Resolved Merged -- thanks :)
[13:09:21] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10HHVM: HHVM 3.18.5+dfsg-1+wmf3 changes parse_url causing unit tests to fail - https://phabricator.wikimedia.org/T185024#3905953 (10hashar)
[13:12:57] <marostegui>	 !log Fixing drifts on db1065 - T162807
[13:13:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:08] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[13:15:47] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404661 (https://phabricator.wikimedia.org/T174569)
[13:17:11] <moritzm>	 !log upgrading app server canaries to 3.18.5+wmf4
[13:17:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:11] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404661 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[13:20:28] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404661 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[13:20:38] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404661 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui)
[13:20:54] <wikibugs>	 (03PS4) 10Jcrespo: compare.py: Implement progress reporting, more than 2 servers comp. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404647
[13:21:38] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3905976 (10chasemp)
[13:22:13] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1099:3318 - T174569 (duration: 01m 12s)
[13:22:22] <marostegui>	 !log Deploy schema change on db1099:3318 - https://phabricator.wikimedia.org/T174569
[13:22:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:25] <stashbot>	 T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569
[13:22:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:51] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10chasemp)
[13:26:49] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906000 (10chasemp)
[13:31:36] <akosiaris>	 !log reboot acrab for PCID,INVPCID enabling
[13:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:25] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10HHVM: HHVM 3.18.5+dfsg-1+wmf3 changes parse_url causing unit tests to fail - https://phabricator.wikimedia.org/T185024#3906004 (10MoritzMuehlenhoff) I've built/uploaded new HHVM packages for jessie (stretch following soon) which disable the broken patc...
[13:32:49] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404664
[13:32:50] <icinga-wm>	 PROBLEM - Host acrab is DOWN: PING CRITICAL - Packet loss = 100%
[13:33:51] <icinga-wm>	 PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm]
[13:34:10] <icinga-wm>	 RECOVERY - Host acrab is UP: PING OK - Packet loss = 0%, RTA = 36.85 ms
[13:34:45] <wikibugs>	 10Operations, 10cloud-services-team: Labstore1006/7 profile for meltdown kernel - https://phabricator.wikimedia.org/T185101#3906007 (10chasemp) p:05Triage>03High
[13:35:24] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906023 (10chasemp)
[13:36:25] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404664 (owner: 10Marostegui)
[13:37:57] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404664 (owner: 10Marostegui)
[13:38:07] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404664 (owner: 10Marostegui)
[13:39:26] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1104 (duration: 01m 13s)
[13:39:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:35] <chasemp>	 !log labstore2001:~# /sbin/reboot
[13:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:50] <chasemp>	 !log labstore2002:~# sudo update-grub && /sbin/reboot
[13:45:59] <akosiaris>	 !log reboot sca2003 webperf2001 planet2001 poolcounter2002 mx2001 kubetcd200{1,2,3} install2002 dbmonitor2001 alsafi acrux hassaleh diadem nihal pybal-test200{1,2,3} releases2001 tureis for PCID, INVPCID 
[13:46:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:19] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "After looking at it, this is wrong- mariadb maintenace must be kept, including currently no tasks (we will include a checksum soon). the t" [puppet] - 10https://gerrit.wikimedia.org/r/403978 (https://phabricator.wikimedia.org/T184797) (owner: 10Dzahn)
[13:50:44] <wikibugs>	 (03PS1) 10Ema: eqiad: temporarily remove chromium from LVS nameservers [puppet] - 10https://gerrit.wikimedia.org/r/404672
[13:51:26] <chasemp>	 !log reboot labstore2003
[13:51:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:51] <icinga-wm>	 PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:52:00] <icinga-wm>	 PROBLEM - puppet last run on mw2124 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:52:02] <Krinkle>	 elukey: Thx for the notif
[13:52:11] <icinga-wm>	 PROBLEM - puppet last run on oresrdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:52:20] <icinga-wm>	 PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:52:30] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:52:40] <icinga-wm>	 PROBLEM - puppet last run on elastic2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:52:41] <icinga-wm>	 PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:52:51] <icinga-wm>	 PROBLEM - puppet last run on elastic2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:53:01] <icinga-wm>	 PROBLEM - puppet last run on kubetcd2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:53:11] <icinga-wm>	 PROBLEM - puppet last run on elastic2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:53:11] <icinga-wm>	 PROBLEM - etc request latencies on acrux is CRITICAL: CRITICAL - etcd_request_latencies is 68357 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:53:17] <volans>	 akosiaris: did you named puppetdb today too?
[13:53:44] <akosiaris>	 volans: yes. I rebooted nihal for the kernel upgrades
[13:53:51] <icinga-wm>	 PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:53:51] <icinga-wm>	 PROBLEM - puppet last run on elastic2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:54:11] <icinga-wm>	 RECOVERY - etc request latencies on acrux is OK: OK - etcd_request_latencies is 2536 https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:54:11] <volans>	 ack
[13:54:13] <akosiaris>	 the more interesting thing (which is something I wanted to test) is the etc request latencies on acrux thing
[13:54:37] <akosiaris>	 I force rebooted the etcd cluster without waiting much, on purpose
[13:54:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] eqiad: temporarily remove chromium from LVS nameservers [puppet] - 10https://gerrit.wikimedia.org/r/404672 (owner: 10Ema)
[13:55:09] <akosiaris>	 and yes it has recovered 
[13:55:16] <akosiaris>	 but I like it has alerted
[13:55:18] <akosiaris>	 that's nice
[13:55:20] <icinga-wm>	 PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:55:21] <icinga-wm>	 PROBLEM - puppet last run on elastic2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:55:21] <icinga-wm>	 PROBLEM - puppet last run on hassaleh is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:55:21] <volans>	 yeah
[13:55:21] <icinga-wm>	 PROBLEM - puppet last run on mw2239 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:55:30] <icinga-wm>	 PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:55:30] <icinga-wm>	 PROBLEM - puppet last run on mw2132 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:55:41] <icinga-wm>	 PROBLEM - puppet last run on acrux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:56:00] <icinga-wm>	 PROBLEM - puppet last run on db2084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:56:19] <akosiaris>	 cluster is healthy
[13:56:20] <akosiaris>	 nice
[13:56:20] <icinga-wm>	 PROBLEM - puppet last run on mc2033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:56:20] <icinga-wm>	 PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:57:01] <icinga-wm>	 PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:57:40] <elukey>	 akosiaris: ignorant question - why does the oom killer act every time on nitrogen even if there are (potentially) pages from the page-cache to reclaim? I can't see anything weird from the puppetdb, except the puppetdb's jvm crossing a certain threshold of committed memory (but not trashing afaics, or better, not changing its GC behavior much)
[13:58:42] <wikibugs>	 (03CR) 10Rush: [C: 031] apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez)
[13:58:54] <icinga-wm>	 RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:59:08] <akosiaris>	 elukey: that's my exactly my question. I am waiting to see what will happen next time
[13:59:37] <akosiaris>	 I can't understand how suddenly the VM is at top vm memory usage and OOM shows up
[14:00:04] <jouncebot>	 addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180117T1400).
[14:00:04] <jouncebot>	 Jayprakash12345: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:17] <zeljkof>	 I can SWAT today
[14:00:22] <Jayprakash12345>	 i am here
[14:00:44] <icinga-wm>	 RECOVERY - puppet last run on acrux is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:01:04] <volans>	 akosiaris, elukey: out of the last 9 OOM killer in dmesg, 8 of them coincide with the 26,56 minutes of the hour when puppet runs at the same time in both Icinga hosts
[14:01:14] <volans>	 can we try to force tegmen at a different time for some days
[14:01:14] <zeljkof>	 Jayprakash12345: I will let you know when the first patch is at mwdebug, in a few minutes
[14:01:19] <volans>	 and see if we still repro?
[14:01:25] <elukey>	 volans: sure let's do it
[14:01:36] <elukey>	 is nitrogen a VM on ganeti?
[14:01:44] <Jayprakash12345>	 zeljkof: ok :)
[14:01:50] <volans>	 elukey: yes
[14:01:57] <elukey>	 ahhh nice didn't know it
[14:02:00] <akosiaris>	 volans: a nice finding
[14:02:11] <akosiaris>	 so yeah that might trigger it
[14:02:40] <volans>	 it might just be an indicator of overload, and actually wasy j.oe that said have a look at icinga that is heavy on the puppetdb
[14:02:47] <volans>	 and I found that they have the same crontab :(
[14:03:04] <volans>	 and given is on a VM... maybe the new kernel doesn't help
[14:03:06] <wikibugs>	 (03CR) 10Ema: [C: 032] eqiad: temporarily remove chromium from LVS nameservers [puppet] - 10https://gerrit.wikimedia.org/r/404672 (owner: 10Ema)
[14:03:14] <elukey>	 this only happens though when the jvm reaches the ~6g committed memory
[14:03:16] <volans>	 and we "overload" easier now
[14:03:17] <elukey>	 not before
[14:03:37] <moritzm>	 !log depooling chromium
[14:03:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:19] <elukey>	 volans: what do you mean with "overload" in this case? Memory pressure ?
[14:04:27] <gehel>	 !log restart of elasticsearch / cirrus eqiad completed (cluster still recovering)
[14:04:29] <volans>	 elukey: we should monitor postgres memory too at the same times
[14:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:59] <elukey>	 volans: yeah this is another suspicion that I have - 6g for puppetdb + a spike for postgress == oom acting
[14:05:18] <elukey>	 but maybe it could work to simply tell it to drop page cache a bit
[14:05:33] <elukey>	 rather than being so harsh :D
[14:05:39] <volans>	 eheheh
[14:05:54] <volans>	 vm.swappiness = 0 too
[14:06:20] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez)
[14:06:21] <wikibugs>	 (03PS17) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647)
[14:06:23] <elukey>	 ah nice finding
[14:06:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez)
[14:06:33] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404624 (https://phabricator.wikimedia.org/T184957) (owner: 10Jayprakash12345)
[14:06:44] <elukey>	 we have 1g of swap in there
[14:06:57] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3906052 (10akosiaris) Scheduling this for February 12th 2018, say 10:00 am UTC. I 'll run a few more tests and then send an informational m...
[14:07:04] <volans>	 I was reading the other day https://chrisdown.name/2018/01/02/in-defence-of-swap.html 
[14:07:40] <volans>	 and it reminded me that vm.swappiness = 0 triggers some specific behaviour of the kernel regarding memory reclaim
[14:07:44] <moritzm>	 !log rebooting chromium for kernel security update
[14:07:48] <akosiaris>	 yeah me too... note that he has no numbers however
[14:07:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:06] <wikibugs>	 (03Merged) 10jenkins-bot: Add Draft Namespace in enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404624 (https://phabricator.wikimedia.org/T184957) (owner: 10Jayprakash12345)
[14:08:09] <volans>	 akosiaris: no numbers?
[14:08:31] <wikibugs>	 (03CR) 10jenkins-bot: Add Draft Namespace in enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404624 (https://phabricator.wikimedia.org/T184957) (owner: 10Jayprakash12345)
[14:08:33] <akosiaris>	 volans: yeah the entire post is kind of academic
[14:08:51] <volans>	 ah yeah, no benchmarks, indeed, and it didn't convince me completely
[14:08:51] <wikibugs>	 (03PS3) 10Zfilipin: Create "eliminator" user group on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404327 (https://phabricator.wikimedia.org/T184607) (owner: 10Jayprakash12345)
[14:09:16] <akosiaris>	 neither was I convinced entirely. I did keep a mental note to revisit the issue at some point
[14:09:24] <icinga-wm>	 PROBLEM - DPKG on es2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[14:10:11] <volans>	 but I thought that in same cases might be worth a test either with swappines = 1 or with the cgroup stuff
[14:10:20] <volans>	 just speculation for now ;)
[14:10:24] <icinga-wm>	 RECOVERY - DPKG on es2001 is OK: All packages OK
[14:10:25] <akosiaris>	 agreed
[14:10:41] <elukey>	 volans: how can we change the puppet run times for the icinga hosts?
[14:10:58] <akosiaris>	 I don't think you really can
[14:10:58] <volans>	 dunno, I asked you and you told me that we could force it :D
[14:11:10] <volans>	 I know it's the hash of the host
[14:11:19] <volans>	 so not sure if in hiera we can override
[14:11:21] <volans>	 didn't check the code
[14:11:27] <wikibugs>	 (03PS1) 10Ema: Revert "eqiad: temporarily remove chromium from LVS nameservers" [puppet] - 10https://gerrit.wikimedia.org/r/404674
[14:12:00] <Jayprakash12345>	 zeljkof: [config] 404624 Add Draft Namespace in enwikiversity (T184957) working good.
[14:12:01] <stashbot>	 T184957: en:wikiversity Draft Namespace - https://phabricator.wikimedia.org/T184957
[14:12:01] <zeljkof>	 Jayprakash12345: 404624 is at mwdebug1002, please test and let me know if I can deploy
[14:12:12] <zeljkof>	 Jayprakash12345: ok to deploy?
[14:12:13] <Jayprakash12345>	 zeljkof: deply
[14:12:21] <zeljkof>	 ok, deploying...
[14:12:21] <akosiaris>	 elukey: it's in base:puppet
[14:12:25] <volans>	 we could also just disable puppet on tegmen for a couple of days, so far it happened every 1~2 days
[14:12:48] <akosiaris>	 elukey: but $crontime is calculated in the puppet.cron.erb file
[14:13:12] <akosiaris>	 well, the $times variables is, not $crontime
[14:13:19] <wikibugs>	 (03PS2) 10Ema: Revert "eqiad: temporarily remove chromium from LVS nameservers" [puppet] - 10https://gerrit.wikimedia.org/r/404674
[14:13:27] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] Revert "eqiad: temporarily remove chromium from LVS nameservers" [puppet] - 10https://gerrit.wikimedia.org/r/404674 (owner: 10Ema)
[14:13:55] <akosiaris>	 $crontime = fqdn_rand(60, 'puppet-params-crontime')
[14:14:17] <akosiaris>	 heh.. not very configurable 
[14:14:26] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:404624|Add Draft Namespace in enwikiversity (T184957)]] (duration: 01m 12s)
[14:14:29] <moritzm>	 !log repooling chromium
[14:14:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:38] <zeljkof>	 Jayprakash12345: deployed, please check
[14:14:41] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906059 (10chasemp)
[14:14:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:24] <icinga-wm>	 RECOVERY - puppet last run on hassaleh is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[14:16:02] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404327 (https://phabricator.wikimedia.org/T184607) (owner: 10Jayprakash12345)
[14:17:27] <wikibugs>	 (03Merged) 10jenkins-bot: Create "eliminator" user group on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404327 (https://phabricator.wikimedia.org/T184607) (owner: 10Jayprakash12345)
[14:17:37] <wikibugs>	 (03CR) 10jenkins-bot: Create "eliminator" user group on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404327 (https://phabricator.wikimedia.org/T184607) (owner: 10Jayprakash12345)
[14:18:30] <zeljkof>	 Jayprakash12345: 404327 is at mwdebug1002, please test and let me know if I can deploy
[14:18:37] <Jayprakash12345>	 zeljkof: ok
[14:18:54] <wikibugs>	 (03PS2) 10Filippo Giunchedi: restbase: reprovision restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/404652 (https://phabricator.wikimedia.org/T184100)
[14:18:56] <wikibugs>	 (03PS1) 10Filippo Giunchedi: restbase: reprovision restbase101[35] [puppet] - 10https://gerrit.wikimedia.org/r/404675 (https://phabricator.wikimedia.org/T184100)
[14:20:23] <icinga-wm>	 RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[14:20:24] <icinga-wm>	 RECOVERY - puppet last run on elastic2029 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[14:20:24] <icinga-wm>	 RECOVERY - puppet last run on mw2239 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[14:20:33] <icinga-wm>	 RECOVERY - puppet last run on mw2181 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[14:20:55] <icinga-wm>	 RECOVERY - puppet last run on db2084 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[14:21:23] <icinga-wm>	 RECOVERY - puppet last run on mc2033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:21:23] <icinga-wm>	 RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[14:21:53] <icinga-wm>	 RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:22:03] <icinga-wm>	 RECOVERY - puppet last run on mw2124 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:22:03] <icinga-wm>	 RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:22:13] <icinga-wm>	 RECOVERY - puppet last run on oresrdb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:22:23] <icinga-wm>	 RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:22:25] <Jayprakash12345>	 zeljkof: ok, deploy
[14:22:32] <zeljkof>	 Jayprakash12345: deploying...
[14:22:33] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:22:43] <icinga-wm>	 RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:22:43] <icinga-wm>	 RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:22:53] <icinga-wm>	 RECOVERY - puppet last run on elastic2032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:23:04] <icinga-wm>	 RECOVERY - puppet last run on kubetcd2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[14:23:14] <icinga-wm>	 RECOVERY - puppet last run on elastic2009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[14:23:47] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:404327|Create "eliminator" user group on ur.wikipedia (T184607)]] (duration: 01m 12s)
[14:23:53] <icinga-wm>	 RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[14:23:54] <icinga-wm>	 RECOVERY - puppet last run on elastic2016 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[14:23:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:59] <stashbot>	 T184607: Create "eliminator" user group on ur.wikipedia - https://phabricator.wikimedia.org/T184607
[14:24:13] <zeljkof>	 Jayprakash12345: deployed, please check and thanks for deploying with #releng ;)
[14:25:33] <icinga-wm>	 RECOVERY - puppet last run on mw2132 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[14:25:46] <Jayprakash12345>	 zeljkof: Checked, Thanks for being here.
[14:27:05] <wikibugs>	 10Operations, 10cloud-services-team: labstore2003 reboots into mode missing /srv disks - https://phabricator.wikimedia.org/T185102#3906073 (10chasemp) p:05Triage>03High
[14:27:13] <zeljkof>	 !log EU SWAT finished
[14:27:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:09] <wikibugs>	 (03PS5) 10Ema: Use up-and-enabled servers in can-depool logic [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715)
[14:29:40] <wikibugs>	 (03CR) 10Ema: Use up-and-enabled servers in can-depool logic (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema)
[14:29:55] <wikibugs>	 (03CR) 10Volans: "Much better, thanks for improving it. Still some fixes needed, see inline." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez)
[14:30:53] <wikibugs>	 10Operations: Stack overflow on beta cluster API interaction - https://phabricator.wikimedia.org/T185103#3906085 (10Niedzielski)
[14:38:29] <chasemp>	 !log labstore1001:~# /sbin/reboot
[14:38:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:58] <wikibugs>	 (03CR) 10Ema: [C: 032] Use up-and-enabled servers in can-depool logic [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema)
[14:42:42] <wikibugs>	 (03CR) 10Volans: "A couple of questions inline" (033 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/403677 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema)
[14:43:38] <wikibugs>	 (03CR) 10Arlolra: "Thanks.  I ran update_parsoid.sh on ruthenium and so far so good." [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra)
[14:43:56] * volans 2 over 2 CR merged while reviewing... need to restart internal NTPd
[14:44:15] <ema>	 :)
[14:47:56] <ema>	 volans: the threshold is set in setUp()  
[14:48:33] <volans>	 ema: ah right, missed that, thx
[14:48:55] <volans>	 I assumed it was 50% anyway ;)
[14:49:41] <ema>	 volans: if I understand your other question correctly, no, we don't try to depool servers which are fine according to pybal's monitoring
[14:50:30] <volans>	 ema: what I was trying to ask is, is canDepool() independent of the server you want to depool?
[14:50:44] <ema>	 volans: yes
[14:50:58] <volans>	 shouldn't instead it be relative to it?
[14:51:13] <volans>	 dependening on the situation I might be able to depool *that* server, but not another one
[14:51:37] <ema>	 canDepool is only called upon monitoring changes, not in case of admin actions
[14:51:50] <volans>	 ok
[14:52:50] <ema>	 so it doesn't cover the case where an admin shoots herself in the foot by depooling the only host serving traffic
[14:53:04] <mark>	 btw, not sure I agree with sum() being cleaner
[14:53:11] <mark>	 sum to me suggests summing integers, not counting
[14:53:15] <moritzm>	 !log resetting RAC on labsdb1006 (serial console inaccessible)
[14:53:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:56] <volans>	 mark: it's personal, no strong feeling, just I though slightly unnecessary to create another list to just count it's size
[14:54:04] <volans>	 and not use it for anything else
[14:54:32] <volans>	 the sum return already the result needed in this specific case
[14:54:40] <volans>	 and is slightly shorter :)
[14:54:59] <mark>	 i guess
[14:55:04] * ema agrees with volans but blissfully ignores his comment
[14:55:09] <mark>	 there's a whole lot of that in pybal btw
[14:55:18] <mark>	 especially since a lot of it predates these features in python hehe
[14:55:44] <icinga-wm>	 PROBLEM - DPKG on es2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[14:55:53] <mark>	 i wanted to mention something similar with a set comprehension earlier in ema's code
[14:55:58] <mark>	 and then noticed pybal used a dict
[14:56:05] <mark>	 iirc, because pybal didn't have sets yet I abused a dict ;)
[14:56:08] <mark>	 python
[14:56:23] <wikibugs>	 (03PS1) 10Ema: Use up-and-enabled servers in can-depool logic [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/404680 (https://phabricator.wikimedia.org/T184715)
[14:56:36] <volans>	 sets were added in 2.3
[14:56:54] <mark>	 yup
[14:57:04] <mark>	 i think pybal was written for 2.2 or thereabouts
[14:57:24] <volans>	 ehehe
[14:57:25] <wikibugs>	 (03CR) 10Ema: [C: 032] Use up-and-enabled servers in can-depool logic [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/404680 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema)
[14:57:52] <moritzm>	 !log resetting RAC on labsdb1007 (serial console inaccessible)
[14:58:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:19] <wikibugs>	 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#3906167 (10fgiunchedi) So I tried to investigate this on restbase1013 which reliably failed to reboot cleanly after d-i finished:  ``` Loading Linux 4.9.0-0.bpo.5-amd64 ... Loading initial ramdisk ......
[15:04:38] <wikibugs>	 (03PS7) 10Ottomata: Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126)
[15:05:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) (owner: 10Ottomata)
[15:06:07] <wikibugs>	 (03PS8) 10Ottomata: Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126)
[15:06:33] <wikibugs>	 (03CR) 10Elukey: [C: 032] Allow to explicitly set the JAVA_HOME environment variable [puppet/cdh] - 10https://gerrit.wikimedia.org/r/403701 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey)
[15:07:00] <wikibugs>	 (03PS9) 10Ottomata: Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126)
[15:08:33] <wikibugs>	 (03PS1) 10Elukey: Update the cdh module to the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/404685 (https://phabricator.wikimedia.org/T166248)
[15:16:30] <wikibugs>	 (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9761/" [puppet] - 10https://gerrit.wikimedia.org/r/404685 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey)
[15:21:43] <wikibugs>	 (03CR) 10Eevans: [C: 031] "I always struggle getting the first bootstrap cleanly started when Puppet cannot run to completion (which it cannot while units are masked" [puppet] - 10https://gerrit.wikimedia.org/r/404675 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[15:23:39] <wikibugs>	 (03PS2) 10Filippo Giunchedi: restbase: reprovision restbase101[35] [puppet] - 10https://gerrit.wikimedia.org/r/404675 (https://phabricator.wikimedia.org/T184100)
[15:25:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] restbase: reprovision restbase101[35] [puppet] - 10https://gerrit.wikimedia.org/r/404675 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[15:25:54] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404688
[15:27:18] <wikibugs>	 (03CR) 10Eevans: [C: 031] restbase: reprovision restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/404652 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi)
[15:28:36] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404688 (owner: 10Marostegui)
[15:29:29] <wikibugs>	 (03PS1) 10Herron: add support for SSLCARevocationCheck setting in puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444)
[15:30:13] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404688 (owner: 10Marostegui)
[15:30:23] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404688 (owner: 10Marostegui)
[15:32:42] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1104 (duration: 01m 12s)
[15:32:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:41] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1065, pool db1067 for vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404690
[15:34:55] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1065, pool db1067 for vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404690
[15:36:17] <moritzm>	 !log upgrading nginx on mw servers in codfw to 1.13.6-2+wmf1~jessie1 
[15:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:34] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1065, pool db1067 for vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404690 (owner: 10Marostegui)
[15:38:55] <icinga-wm>	 PROBLEM - Host labstore2003 is DOWN: PING CRITICAL - Packet loss = 100%
[15:39:39] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065, pool db1067 for vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404690 (owner: 10Marostegui)
[15:39:49] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065, pool db1067 for vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404690 (owner: 10Marostegui)
[15:41:10] <_joe_>	 !log dropping ruwiki htmlCacheUpdate records stuck int he old jobqueue
[15:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:25] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 after fixing data drifts - T162807 (duration: 01m 12s)
[15:41:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:36] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[15:45:59] <wikibugs>	 (03PS1) 10Arlolra: Fix typo in parsoid-rt.config.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/404691
[15:48:02] <wikibugs>	 (03PS1) 10Marostegui: db1063.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/404692 (https://phabricator.wikimedia.org/T184397)
[15:48:42] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3906289 (10Andrew)
[15:48:45] <wikibugs>	 10Operations, 10Cloud-VPS, 10monitoring, 10cloud-services-team (Kanban): remove cloud VPS project 'ganglia' - https://phabricator.wikimedia.org/T183917#3906287 (10Andrew) 05Open>03Resolved yep, the project is gone.
[15:49:12] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db1063.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/404692 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui)
[15:50:19] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3906294 (10chasemp)
[15:51:41] <chasemp>	 !log labstore1002:~# /sbin/reboot
[15:51:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:22] <wikibugs>	 (03PS1) 10Ema: 1.14.3: canDepool and alert instrumentation bugfixes [debs/pybal] - 10https://gerrit.wikimedia.org/r/404694 (https://phabricator.wikimedia.org/T184715)
[15:53:13] <wikibugs>	 (03CR) 10Ema: [C: 032] 1.14.3: canDepool and alert instrumentation bugfixes [debs/pybal] - 10https://gerrit.wikimedia.org/r/404694 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema)
[15:53:21] <wikibugs>	 (03PS1) 10Ema: 1.14.3: canDepool and alert instrumentation bugfixes [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/404695 (https://phabricator.wikimedia.org/T184715)
[15:54:15] <wikibugs>	 (03CR) 10Ema: [C: 032] 1.14.3: canDepool and alert instrumentation bugfixes [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/404695 (https://phabricator.wikimedia.org/T184715) (owner: 10Ema)
[15:54:59] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906316 (10chasemp)
[15:55:33] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3897875 (10chasemp) thanks @Marostegui
[15:55:37] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#3906319 (10Bawolff) Thankyou
[15:57:31] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906331 (10chasemp)
[15:58:13] <AaronSchulz>	 godog: is there any reason that the mcrouter package doesn't add an init.d entry or was that just not finished?
[15:59:26] <wikibugs>	 (03PS1) 10Ottomata: Parameterize varnishkafka certificate name for easier setup in Cloud VPS. [puppet] - 10https://gerrit.wikimedia.org/r/404698 (https://phabricator.wikimedia.org/T121561)
[15:59:44] <godog>	 AaronSchulz: no idea specifically, but I'm not surprised since we shouldn't be shipping init.d scripts but systemd service files instead
[16:00:33] <ema>	 !log pybal 1.14.3 uploaded to apt.w.o
[16:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:15] <AaronSchulz>	 godog: it doesn't appear to do that either
[16:04:18] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906361 (10chasemp)
[16:04:21] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team: Reboot non-labvirt cloud provider hardware for meltdown - https://phabricator.wikimedia.org/T184730#3906362 (10chasemp)
[16:04:26] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3906358 (10chasemp) 05Open>03Resolved a:03chasemp Full working etherpad is archived at https://wikitech.wikimedia....
[16:05:33] <wikibugs>	 (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9763/cp1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/404698 (https://phabricator.wikimedia.org/T121561) (owner: 10Ottomata)
[16:06:09] <moritzm>	 !log upgrading nginx on mwdebug servers to 1.13.6-2+wmf1~jessie1 
[16:06:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:03] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906365 (10chasemp)
[16:07:08] <moritzm>	 !log upgrading HHVM in codfw to 3.18.7 (wmf4)
[16:07:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:19] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10chasemp)
[16:09:02] <godog>	 AaronSchulz: ok, can you point me to the repo?
[16:09:30] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 031] "does this need a rebase?" [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra)
[16:09:47] <AaronSchulz>	 godog: I assume it is https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/mcrouter
[16:09:53] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906372 (10chasemp)
[16:09:54] <wikibugs>	 (03PS1) 10Ottomata: Blacklist gwtoolsetUploadMetadataJob from Hive json refine job [puppet] - 10https://gerrit.wikimedia.org/r/404701
[16:09:55] <XioNoX>	 !log routing ns0 to codfw (baham)
[16:10:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:43] <wikibugs>	 (03PS2) 10Ottomata: Blacklist gwtoolsetUploadMetadataJob from Hive json refine job [puppet] - 10https://gerrit.wikimedia.org/r/404701
[16:10:52] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Blacklist gwtoolsetUploadMetadataJob from Hive json refine job [puppet] - 10https://gerrit.wikimedia.org/r/404701 (owner: 10Ottomata)
[16:12:55] <chasemp>	 !log labmon1001:~# /sbin/reboot
[16:13:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:11] <godog>	 AaronSchulz: thanks, I asked because I didn't work on it, _joe_ did
[16:14:36] <wikibugs>	 (03CR) 10Arlolra: "no" [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra)
[16:15:47] <moritzm>	 AaronSchulz: there's a debian/mcrouter.service in that repo, though?
[16:16:26] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906379 (10chasemp)
[16:16:42] <moritzm>	 AaronSchulz: the debian/rules file uses dh_systemd_enable --no-enable, so you need to manually enable it after installation
[16:17:06] <icinga-wm>	 RECOVERY - Check systemd state on labmon1001 is OK: OK - running: The system is fully operational
[16:17:34] <wikibugs>	 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#3906383 (10fgiunchedi) Booting both restbase1013 and restbase1015 without `quiet` it looks like a race condition: on 1015 assembly worked as intended but on 1013 it failed in the usual way we've experience...
[16:17:39] <ema>	 !log reboot radon (eqiad authdns) for kernel upgrade
[16:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:50] <chasemp>	 !log labmon1001:~# service grafana-server
[16:18:01] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10chasemp)
[16:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:57] <godog>	 moritzm: thanks for taking a look!
[16:19:29] <wikibugs>	 (03PS1) 10Mark Bergsma: Add unit test cases for Server [debs/pybal] - 10https://gerrit.wikimedia.org/r/404704
[16:21:02] <AaronSchulz>	 godog: I didn't see it in systemctl status, shouldn't it be there?
[16:21:09] <ema>	 XioNoX: radon is back online and serving queries, please revert the routing change!
[16:21:37] <XioNoX>	 cool
[16:21:50] <wikibugs>	 (03PS2) 10Mark Bergsma: Add unit test cases for Server [debs/pybal] - 10https://gerrit.wikimedia.org/r/404704
[16:22:00] <XioNoX>	 ema: reverted
[16:22:20] <ema>	 XioNoX: looks good
[16:22:20] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906403 (10chasemp)
[16:22:33] <AaronSchulz>	 godog: nvm
[16:22:38] <AaronSchulz>	 I was missing -a ;)
[16:22:48] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10chasemp) a:03Andrew
[16:23:03] <wikibugs>	 (03PS1) 10Eevans: [WIP] cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284)
[16:23:11] <wikibugs>	 (03CR) 10Gehel: "This patch only configures beta (production changes have been extracted to another patch). We are good to go for this one." [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones)
[16:23:27] <wikibugs>	 (03CR) 10Gehel: [C: 031] Updates to enable short URLs for transliteration for crhwiki - beta [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones)
[16:23:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans)
[16:23:50] <_joe_>	 urandom: do you want me to take a peek?
[16:24:04] <XioNoX>	 ema: ready for ns1
[16:24:08] <wikibugs>	 (03PS3) 10Mark Bergsma: Add unit test cases for Server [debs/pybal] - 10https://gerrit.wikimedia.org/r/404704
[16:24:30] <ema>	 XioNoX: ns1 to radon for baham reboot, sounds good!
[16:24:43] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906413 (10chasemp)
[16:24:59] <XioNoX>	 !log routing ns1 to eqiad
[16:25:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:41] <wikibugs>	 (03PS2) 10Eevans: [WIP] cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284)
[16:25:46] <urandom>	 _joe_: sure
[16:25:49] <XioNoX>	 ema: done
[16:26:01] <urandom>	 _joe_: if you promise not to think less of me as a person
[16:26:11] <urandom>	 _joe_: godog made me do it!
[16:26:14] <_joe_>	 urandom: ahahah :(
[16:26:15] <wikibugs>	 (03CR) 10MarkTraceur: [C: 031] "Looks good, ready to deploy from my POV" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403680 (https://phabricator.wikimedia.org/T184728) (owner: 10Matthias Mullie)
[16:26:16] <ema>	 !log reboot baham (codfw authdns) for kernel upgrade
[16:26:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:31] * urandom throws godog under the bus
[16:26:32] <wikibugs>	 (03PS10) 10Gehel: Updates to enable short URLs for transliteration for crhwiki - beta [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones)
[16:27:00] <_joe_>	 EWWWWW
[16:27:04] <urandom>	 ya
[16:27:06] <_joe_>	 :P
[16:27:15] <wikibugs>	 (03CR) 10Smalyshev: [C: 031] wdqs: simplify logging of categories reload [puppet] - 10https://gerrit.wikimedia.org/r/404315 (owner: 10Gehel)
[16:27:42] <godog>	 hahaha yeah mkdir -p equivalent still isn't a thing in puppet is it?
[16:28:01] <wikibugs>	 (03CR) 10Gehel: [C: 032] Updates to enable short URLs for transliteration for crhwiki - beta [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones)
[16:28:03] <wikibugs>	 (03PS1) 10Ottomata: Update secrets/certificates with deployment-prep certs for TLS Kafka [labs/private] - 10https://gerrit.wikimedia.org/r/404706 (https://phabricator.wikimedia.org/T121561)
[16:29:19] <wikibugs>	 (03PS2) 10Ottomata: Update secrets/certificates with deployment-prep certs for TLS Kafka [labs/private] - 10https://gerrit.wikimedia.org/r/404706 (https://phabricator.wikimedia.org/T121561)
[16:29:31] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Update secrets/certificates with deployment-prep certs for TLS Kafka [labs/private] - 10https://gerrit.wikimedia.org/r/404706 (https://phabricator.wikimedia.org/T121561) (owner: 10Ottomata)
[16:30:41] <ema>	 XioNoX: baham back online and serving queries
[16:30:54] <XioNoX>	 ema: rolling back routing changes
[16:31:02] <ema>	 yes please
[16:31:17] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906430 (10greg) Asking for help from @aaron (kind of object cache) and #operations / #performance-team to diagnose this. Ori was previously the be...
[16:31:29] <XioNoX>	 done
[16:31:40] <ema>	 last one is ns2 -> radon for eeden reboot
[16:31:49] <XioNoX>	 yup
[16:33:50] <XioNoX>	 !log routing ns2 to radon
[16:34:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:46] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906441 (10chasemp)
[16:36:12] <ema>	 XioNoX: looks like it's done, ok to reboot?
[16:36:27] <wikibugs>	 (03PS1) 10Mark Bergsma: Separate out coordinator.Server into its own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/404713
[16:36:29] <XioNoX>	 ema: nop, something is wrong with routing it seems
[16:36:45] <ema>	 XioNoX: ok, I don't see ns2 queries coming into eeden
[16:36:49] <ema>	 so that part works
[16:37:10] <ema>	 I also don't see them on radon, so that port does not work
[16:37:24] <icinga-wm>	 PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[16:37:57] <XioNoX>	 I gathered some data and rolledback
[16:38:06] <ema>	 XioNoX: ok I see ns2 queries back to eeden 
[16:38:43] <icinga-wm>	 RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 83.94 ms
[16:39:13] <icinga-wm>	 PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 21 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm]
[16:39:20] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906454 (10Tgr)
[16:39:30] <XioNoX>	 so for some reasons the router doesn't want acept the route to 208.80.154.93
[16:40:06] <ema>	 some kind of filter perhaps?
[16:40:15] <mark>	 no arp?
[16:41:23] <icinga-wm>	 PROBLEM - puppet last run on mw2100 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 23 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm-dbg],Package[hhvm]
[16:42:37] <XioNoX>	 maybe because of the no-resolve keyword
[16:45:32] <wikibugs>	 (03CR) 10Gehel: [C: 031] "The related patch for beta has been merged (and does not break anything - there isn't a crh-wiki on beta to test more). We can probably me" [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel)
[16:47:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron)
[16:48:04] <XioNoX>	 I'll investigate it, probably by adding a new VIP from the esams range on radon, and adding matching statics, that way no risk for production traffic
[16:49:12] <ema>	 XioNoX: sounds good to me, let's postpone eeden reboot then
[16:49:13] <icinga-wm>	 RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[16:49:16] <wikibugs>	 (03CR) 10Dzahn: "gotcha Jaime, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/403978 (https://phabricator.wikimedia.org/T184797) (owner: 10Dzahn)
[16:49:40] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906497 (10greg) p:05Triage>03High This is almost UBN! per "This is causing siteinfo API requests (probably all API requests) to fail, which is...
[16:49:41] <ema>	 moritzm: see above, radon and baham rebooted, eeden TODO
[16:49:51] <wikibugs>	 (03PS2) 10Dzahn: testreduce: Fix typo in parsoid-rt.config.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra)
[16:50:10] <wikibugs>	 (03PS3) 10Dzahn: testreduce: Fix typo in parsoid-rt.config.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra)
[16:50:45] <wikibugs>	 (03CR) 10Dzahn: "it does, but right before merge in any case, so doing more before that not needed" [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra)
[16:51:08] <wikibugs>	 (03CR) 10Dzahn: [C: 032] testreduce: Fix typo in parsoid-rt.config.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra)
[16:51:23] <icinga-wm>	 RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[16:52:37] <wikibugs>	 (03CR) 10Dzahn: "deployed on ruthenium" [puppet] - 10https://gerrit.wikimedia.org/r/404691 (owner: 10Arlolra)
[16:52:43] <ema>	 !log upgrade secondary LVSs to pybal 1.13.4 T184715, T184721
[16:52:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:55] <stashbot>	 T184721: Alert instrumentation returning 500 errors - https://phabricator.wikimedia.org/T184721
[16:52:56] <stashbot>	 T184715: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715
[16:53:24] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906510 (10greg)
[16:53:42] <mutante>	 @seen Ladsgroup
[16:53:43] <wm-bot>	 mutante: I have never seen Ladsgroup
[16:53:51] <paladox>	 mutante it's Amir1 :)
[16:54:00] <mutante>	 :) thanks!
[16:54:07] <paladox>	 your welcome :).
[16:54:27] <wikibugs>	 (03PS3) 10Eevans: [WIP] cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284)
[16:54:32] <mutante>	 Amir1: you were mentioned for this because you did work on standardizing error pages :) https://gerrit.wikimedia.org/r/#/c/395552/
[16:57:26] <Amir1>	 mutante: hey, I just got back
[16:57:29] <Amir1>	 let me check
[16:58:06] <Amir1>	 mutante: Can I run a quick test and let you know?
[16:58:26] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Install stretch on es200[1234] reinstall [puppet] - 10https://gerrit.wikimedia.org/r/404721
[16:59:28] <wikibugs>	 (03PS3) 10Gehel: wdqs: simplify logging of categories reload [puppet] - 10https://gerrit.wikimedia.org/r/404315
[16:59:53] <moritzm>	 ema: great
[17:00:15] <mutante>	 Amir1: of course, you can run tests for weeks ;)
[17:00:28] <mutante>	 welcome back
[17:00:34] <wikibugs>	 (03CR) 10Eevans: "[PC output](http://puppet-compiler.wmflabs.org/9765/)" [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans)
[17:00:59] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: simplify logging of categories reload [puppet] - 10https://gerrit.wikimedia.org/r/404315 (owner: 10Gehel)
[17:03:04] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Install stretch on es200[1234] reinstall [puppet] - 10https://gerrit.wikimedia.org/r/404721
[17:04:13] <icinga-wm>	 RECOVERY - Host labstore2003 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms
[17:04:30] <Amir1>	 mutante: tests have finished and it looks good IMO
[17:04:55] <Amir1>	 mutante: tell me when you merged it so I approve the GCI task
[17:05:04] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on labstore2003 is CRITICAL: Return code of 255 is out of bounds
[17:05:13] <icinga-wm>	 PROBLEM - MegaRAID on labstore2003 is CRITICAL: Return code of 255 is out of bounds
[17:06:10] <ema>	 !log upgrade pybal on primary LVSs to 1.14.3 T184715, T184721
[17:06:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:22] <stashbot>	 T184721: Alert instrumentation returning 500 errors - https://phabricator.wikimedia.org/T184721
[17:06:23] <stashbot>	 T184715: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715
[17:07:03] <icinga-wm>	 PROBLEM - SSH on labstore2003 is CRITICAL: connect to address 10.192.21.6 and port 22: Connection refused
[17:07:03] <icinga-wm>	 PROBLEM - Disk space on labstore2003 is CRITICAL: Return code of 255 is out of bounds
[17:07:13] <icinga-wm>	 PROBLEM - Check systemd state on labstore2003 is CRITICAL: Return code of 255 is out of bounds
[17:07:14] <icinga-wm>	 PROBLEM - DPKG on labstore2003 is CRITICAL: Return code of 255 is out of bounds
[17:07:14] <icinga-wm>	 PROBLEM - configured eth on labstore2003 is CRITICAL: Return code of 255 is out of bounds
[17:07:14] <icinga-wm>	 PROBLEM - dhclient process on labstore2003 is CRITICAL: Return code of 255 is out of bounds
[17:08:22] <godog>	 !log bootstrap cassandra-a on restbase1013
[17:08:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:34] <icinga-wm>	 PROBLEM - puppet last run on labstore2003 is CRITICAL: Return code of 255 is out of bounds
[17:08:34] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Install stretch on es200[1234] reinstall [puppet] - 10https://gerrit.wikimedia.org/r/404721 (owner: 10Jcrespo)
[17:11:40] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906569 (10greg)
[17:12:12] <madhuvishy>	 !log Rebooting labstore2004
[17:12:20] <wikibugs>	 (03PS4) 10Eevans: cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284)
[17:12:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:50] <wikibugs>	 (03CR) 10Chad: [C: 031] "Fine by me, anything is an improvement over the current page :D" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox)
[17:16:30] <wikibugs>	 (03PS5) 10Thcipriani: Scap canary: cache last good deploy time [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999)
[17:17:07] <chasemp>	 !log reboot labstore2003
[17:17:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:39] <wikibugs>	 (03CR) 10Chad: [C: 031] "Actually, minor nit re: file naming inside (not a blocker, but would be nice)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox)
[17:19:49] <wikibugs>	 (03CR) 10Thcipriani: "Fixups from Volans review." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani)
[17:20:12] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715#3906596 (10ema) 05Open>03Resolved a:03ema
[17:20:21] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Alert instrumentation returning 500 errors - https://phabricator.wikimedia.org/T184721#3906598 (10ema) 05Open>03Resolved a:03ema
[17:24:44] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[17:24:44] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[17:25:16] <volans>	 thcipriani: there should be only one line starting with MEDIAWIKI_STAGING_DIR right?
[17:26:08] <_joe_>	 whoa big big spike in 5xx
[17:26:26] <volans>	 reported in -tech too
[17:26:33] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[17:26:35] <_joe_>	 still ongoing?
[17:26:36] <_joe_>	 yes
[17:27:13] <_joe_>	 no it seems over
[17:28:13] <icinga-wm>	 RECOVERY - SSH on labstore2003 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[17:28:14] <icinga-wm>	 RECOVERY - Disk space on labstore2003 is OK: DISK OK
[17:28:23] <icinga-wm>	 RECOVERY - Check systemd state on labstore2003 is OK: OK - running: The system is fully operational
[17:28:23] <icinga-wm>	 RECOVERY - configured eth on labstore2003 is OK: OK - interfaces up
[17:28:23] <icinga-wm>	 RECOVERY - DPKG on labstore2003 is OK: All packages OK
[17:28:24] <icinga-wm>	 RECOVERY - dhclient process on labstore2003 is OK: PROCS OK: 0 processes with command name dhclient
[17:31:03] <wikibugs>	 (03CR) 10Kaldari: [C: 04-1] Add a test verifying that rtl.dblist is up to date (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) (owner: 10MaxSem)
[17:31:37] <thcipriani>	 volans: that's true, updating patch.
[17:31:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: add support for SSLCARevocationCheck setting in puppetmaster frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron)
[17:31:59] <volans>	 thcipriani: if you wait 2 min I'll finish the review ;)
[17:32:13] <icinga-wm>	 PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[17:32:37] <thcipriani>	 volans: heh, k, thanks for the review by the by :)
[17:32:51] <halfak>	 moritzm, jusrt saw your email re. labsdb1004
[17:32:56] <halfak>	 Sorry for the terrible delay. 
[17:33:26] <halfak>	 I'm available for the next 6 hours if today works.
[17:33:44] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[17:33:49] <halfak>	 Otherwise, I'll respond to the email with some ideas for earlier UTC tomorrow. 
[17:34:33] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[17:34:44] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[17:35:04] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on labstore2003 is OK: OK: synced at Wed 2018-01-17 17:35:02 UTC.
[17:35:13] <icinga-wm>	 RECOVERY - MegaRAID on labstore2003 is OK: OK: optimal, 1 logical, 2 physical
[17:35:29] <wikibugs>	 (03PS26) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778)
[17:35:42] <wikibugs>	 (03CR) 10Paladox: Update gerrit login display (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox)
[17:36:50] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: improvements for apt-upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/404736
[17:37:28] <wikibugs>	 (03CR) 10Volans: "Thanks for the fixes, much nicer! I've added just a couple of smaller comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani)
[17:38:35] <volans>	 thcipriani: done ^^ :)
[17:39:45] <thcipriani>	 thanks!
[17:40:00] <volans>	 yw
[17:40:11] <arturo>	 volans: ^
[17:40:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "Thanks Volans for the review." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez)
[17:41:16] <volans>	 arturo: ack, looking
[17:41:27] * volans hates when gerrit doesn't scroll properly
[17:42:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10procurement: eqiad: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176338#3906721 (10RobH) 05Open>03Resolved they are in racktables and now being tracked int eh spares rack in eqiad.
[17:42:29] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: improvements for apt-upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/404736
[17:43:45] <wikibugs>	 (03PS1) 10Ottomata: [WIP] Produce webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461)
[17:44:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Produce webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata)
[17:44:12] <moritzm>	 !log resetting RAC on labsdb1004 (serial console inaccessible)
[17:44:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:24] <icinga-wm>	 RECOVERY - Disk space on stat1005 is OK: DISK OK
[17:45:11] <wikibugs>	 (03PS2) 10Ottomata: [WIP] Produce webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461)
[17:45:55] <wikibugs>	 (03CR) 10Chad: [C: 031] "lgtm, let's do this!" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox)
[17:46:04] <paladox>	 no_justification thanks :).
[17:46:27] <no_justification>	 Is gerritLogin.js something hardcoded in Gerrit itself? ie: would gerritLogin.cache.js not work?
[17:46:39] <no_justification>	 A minor detail since it only loads on the login page, mostly curious
[17:47:16] <wikibugs>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/9766/cp1054.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata)
[17:47:24] <paladox>	 no_justification we can use gerritLogin.cache.js if you want
[17:47:29] <paladox>	 we load the js file in the css file
[17:47:37] <wikibugs>	 (03PS27) 10Dzahn: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox)
[17:47:45] <no_justification>	 Where's that JS file loaded from?
[17:48:04] <wikibugs>	 (03PS28) 10Paladox: Update gerrit login display [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778)
[17:48:11] <paladox>	 no_justification  here https://gerrit.wikimedia.org/r/#/c/402665/28/modules/gerrit/files/etc/GerritSite.css
[17:48:12] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: ensure python3-apt is installed [puppet] - 10https://gerrit.wikimedia.org/r/404740
[17:48:27] <paladox>	 <script src="/r/static/gerritLogin.cache.js"></script>
[17:49:05] <wikibugs>	 (03PS29) 10Dzahn: gerrit: new fancy login page design [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox)
[17:49:22] <no_justification>	 Hah!
[17:49:29] <no_justification>	 Really, that's injected as raw HTML?
[17:49:37] <paladox>	 no_justification apparently so
[17:49:47] <no_justification>	 #til
[17:49:57] <no_justification>	 Oh well, it's not urgent. Let's land it as-is, then maybe follow up
[17:50:24] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "there you go" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox)
[17:51:54] <paladox>	 no_justification mutante looks perfect https://gerrit.wikimedia.org/r/login/%23%2Fq%2Fstatus%3Aopen :).
[17:52:03] <mutante>	 https://gerrit.wikimedia.org/r/login/%23%2Fq%2Fstatus%3Aopen
[17:52:28] <mutante>	 thanks, it definitely looks more modern
[17:52:40] <mutante>	 and the part that upstream designer guy was on the change was also nice
[17:52:43] <mutante>	 for licensing
[17:52:44] <paladox>	 :)
[17:52:48] <no_justification>	 There's a bunch of negative space at the top for me, but minor
[17:52:50] <volans>	 arturo: if I'm not mistaken the puppet side of it is missing the dependency on the python3-apt lib
[17:53:10] <paladox>	 no_justification i guess it must be that margin-top: 10%;
[17:53:42] <no_justification>	 https://phabricator.wikimedia.org/F12619936
[17:54:05] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3904733 (10Krinkle) Is this task about session storage Redis or JobQueue Redis? I would assume JobQueue Redis given that ApiSiteInfo is supports ou...
[17:54:16] <arturo>	 volans: I sent this --> https://gerrit.wikimedia.org/r/404740
[17:54:28] <volans>	 ahhh separate one
[17:54:31] <volans>	 missed that
[17:54:45] <wikibugs>	 (03Draft1) 10Paladox: Gerrit: Remove margin-top: 10% from GerritSite.css [puppet] - 10https://gerrit.wikimedia.org/r/404741
[17:54:47] <wikibugs>	 (03Draft2) 10Paladox: Gerrit: Remove margin-top: 10% from GerritSite.css [puppet] - 10https://gerrit.wikimedia.org/r/404741
[17:54:55] <no_justification>	 paladox: Yeah, removing that rule made it work nicer. Does that affect anything else?
[17:55:15] <no_justification>	 I don't think so, it's targeted to the login page
[17:55:18] <paladox>	 no_justification from my quick testing nope.
[17:55:33] <volans>	 arturo: that's telling puppet the order, but not actually installing the package. You need also to explictely add the package
[17:55:49] <mutante>	 !log gerrit login page design changed (https://gerrit.wikimedia.org/r/402665) in case you were worried it was a fake page trying to steal your login, heh
[17:55:52] <paladox>	 no_justification it works :). https://gerrit.wikimedia.org/r/404741
[17:55:57] <volans>	 the order and the dependency, it will fail if there is no Package['python3-apt'] resource managed, to be precise
[17:55:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:10] <paladox>	 tested on https://gerrit.git.wmflabs.org/r/login/%23%2Fq%2Fstatus%3Aopen
[17:56:26] <arturo>	 oh volans I think I understand now
[17:56:56] <volans>	 we usually use require_pacakge around the code, that is wrapper that accepts both multiple params, one per package
[17:57:00] <volans>	 or a list of packages
[17:57:10] <wikibugs>	 (03PS1) 10Ottomata: Ensure samtar and samwalton9 are absent after account expiration [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878)
[17:57:13] <volans>	 *require_package ofc
[17:57:30] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: ensure python3-apt is installed [puppet] - 10https://gerrit.wikimedia.org/r/404740
[17:57:32] <wikibugs>	 (03PS2) 10Ottomata: Ensure samtar and samwalton9 are absent after account expiration [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878)
[17:57:48] <arturo>	 volans: not that solution then ^^
[17:58:22] <wikibugs>	 (03CR) 10Muehlenhoff: Ensure samtar and samwalton9 are absent after account expiration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878) (owner: 10Ottomata)
[17:58:39] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: Remove margin-top: 10% from GerritSite.css [puppet] - 10https://gerrit.wikimedia.org/r/404741 (owner: 10Paladox)
[17:58:44] <paladox>	 thanks :)
[17:59:16] <volans>	 arturo: that works too, the require_package allow you to do a single call for all the packages required in that file and also ensure that they are installed before anything in the same scope is executed
[17:59:26] * James_F waves in advance of jouncebot.
[17:59:33] <volans>	 see modules/wmflib/lib/puppet/parser/functions/require_package.rb for its implementation if you're curious ;)
[17:59:49] <paladox>	 https://gerrit.wikimedia.org/r/login/%23%2Fq%2Fstatus%3Aopen looks much better now :).
[17:59:51] <wikibugs>	 (03PS3) 10Ottomata: Ensure samtar and samwalton9 are absent after account expiration [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878)
[17:59:59] <wikibugs>	 10Operations, 10MediaWiki-API, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906780 (10Krinkle)
[18:00:04] <jouncebot>	 addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180117T1800).
[18:00:04] <jouncebot>	 James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] Ensure samtar and samwalton9 are absent after account expiration [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878) (owner: 10Ottomata)
[18:00:16] <mutante>	 paladox: confirmed :)
[18:00:19] <paladox>	 :)
[18:00:20] <wikibugs>	 10Operations, 10MediaWiki-API, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3904733 (10Krinkle)
[18:00:28] <wikibugs>	 (03CR) 10Volans: "ok, thanks for the reply and the follow up on I4ead8b545e57cd135cee313636c816da194cacfd" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez)
[18:00:28] <Niharika>	 o/ James_F I can SWAT. 
[18:00:50] <James_F>	 Ta.
[18:01:17] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM, optional nitpick inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404736 (owner: 10Arturo Borrero Gonzalez)
[18:02:06] <mutante>	 bblack: shall we merge the Letsencrypt license change now?
[18:03:01] <elukey>	 as side node, mod_md has been merged to 2.4.x and it will be part of 2.4.30 
[18:03:11] <elukey>	 https://httpd.apache.org/docs/2.4/mod/mod_md.html
[18:04:13] <wikibugs>	 (03PS4) 10Ottomata: Ensure samtar and samwalton9 are absent after account expiration [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878)
[18:04:16] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Ensure samtar and samwalton9 are absent after account expiration [puppet] - 10https://gerrit.wikimedia.org/r/404743 (https://phabricator.wikimedia.org/T170878) (owner: 10Ottomata)
[18:06:18] <wikibugs>	 (03CR) 10Dzahn: [C: 031] letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk)
[18:07:22] <Niharika>	 Zuul seems sleep deprived or something. 
[18:07:33] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM, optional change inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404740 (owner: 10Arturo Borrero Gonzalez)
[18:07:37] <wikibugs>	 10Operations, 10MediaWiki-API, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3906804 (10aaron) So I cannot contact redis via nutcracker on tin. I noticed the password was not actually set for redis (try...
[18:08:01] <wikibugs>	 (03PS1) 10Ottomata: Use log_retention params in profile::kafka::broker [puppet] - 10https://gerrit.wikimedia.org/r/404747
[18:09:08] <moritzm>	 !log uploading HHVM 3.18.5+wmf4 for stretch-wikimedia to apt.wikimedia.org (3.18.7 with the patch https://github.com/facebook/hhvm/commit/bd7b2bcfe70b053a3a001480653012f68599250f backed out)
[18:09:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:57] <moritzm>	 elukey: oh, nice
[18:12:41] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906818 (10madhuvishy)
[18:12:44] <wikibugs>	 10Operations, 10cloud-services-team: labstore2003 reboots into mode missing /srv disks - https://phabricator.wikimedia.org/T185102#3906816 (10madhuvishy) 05Open>03Resolved Drives not being mounted at /srv is the right behavior. The lvms aren't mounted by default because if they were, our bdsync based backu...
[18:16:43] <wikibugs>	 (03PS2) 10Herron: add support for SSLCARevocationCheck setting in puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444)
[18:17:43] <Niharika>	 Alrighty! James_F - You're good to test it on mwdebug1002. 
[18:18:04] <James_F>	 Niharika: Yup, LGTM.
[18:18:20] <Niharika>	 Let's sync it. 
[18:18:27] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9767/" [puppet] - 10https://gerrit.wikimedia.org/r/404747 (owner: 10Ottomata)
[18:19:49] <wikibugs>	 (03CR) 10Herron: add support for SSLCARevocationCheck setting in puppetmaster frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404689 (https://phabricator.wikimedia.org/T184444) (owner: 10Herron)
[18:20:44] <wikibugs>	 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3906844 (10jcrespo)
[18:20:47] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3906843 (10jcrespo)
[18:20:49] <logmsgbot>	 !log niharika29@tin Synchronized php-1.31.0-wmf.17/includes/EditPage.php: Update Save/Publish button flag from 'constructive' to 'progressive' https://gerrit.wikimedia.org/r/#/c/404733/ (duration: 01m 14s)
[18:20:51] <Niharika>	 James_F: Done. ^
[18:21:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:07] <James_F>	 Niharika: Thank you so much. :-)
[18:21:20] <Niharika>	 You're welcome. 
[18:21:32] <wikibugs>	 (03PS5) 10Dzahn: ganeti: create profiles, split monitoring/firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/392564
[18:21:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ganeti: create profiles, split monitoring/firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn)
[18:22:31] <wikibugs>	 (03CR) 10Dzahn: ganeti: create profiles, split monitoring/firewall classes (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn)
[18:23:46] <wikibugs>	 (03PS6) 10Dzahn: ganeti: create profiles, split monitoring/firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/392564
[18:26:33] <wikibugs>	 (03CR) 10Qgil: [C: 04-1] "This patch would enable one specific feed. We need to discuss whether it is possible to whitelist any RSS feed coming from a domain. Would" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404653 (https://phabricator.wikimedia.org/T185087) (owner: 10Aklapper)
[18:27:45] <bblack>	 mutante: yes please
[18:28:11] <wikibugs>	 (03PS1) 10Rush: icinga: add aborrero to sms group [puppet] - 10https://gerrit.wikimedia.org/r/404751 (https://phabricator.wikimedia.org/T178807)
[18:29:14] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3906886 (10chasemp)
[18:31:03] <chasemp>	 mutante: brief sanity check review https://gerrit.wikimedia.org/r/#/c/404751/ please?
[18:33:21] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754
[18:33:39] <wikibugs>	 (03PS2) 10Dzahn: deployment-prep: Commit hiera config for etcd [puppet] - 10https://gerrit.wikimedia.org/r/403205 (owner: 10Chad)
[18:33:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo)
[18:34:25] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "labs-only" [puppet] - 10https://gerrit.wikimedia.org/r/403205 (owner: 10Chad)
[18:34:41] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754
[18:35:08] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754
[18:35:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo)
[18:35:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo)
[18:37:11] <wikibugs>	 (03CR) 10Dzahn: [C: 031] icinga: add aborrero to sms group [puppet] - 10https://gerrit.wikimedia.org/r/404751 (https://phabricator.wikimedia.org/T178807) (owner: 10Rush)
[18:37:18] <mutante>	 chasemp: looks good!
[18:37:21] <mutante>	 bblack: ok :)
[18:37:29] <mutante>	 no_justification: done
[18:37:35] <no_justification>	 ty!
[18:37:46] <wikibugs>	 (03PS2) 10Rush: icinga: add aborrero to sms group [puppet] - 10https://gerrit.wikimedia.org/r/404751 (https://phabricator.wikimedia.org/T178807)
[18:37:48] <chasemp>	 tx mutante 
[18:37:53] <mutante>	 yw
[18:38:21] <wikibugs>	 (03PS3) 10Dzahn: letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk)
[18:38:25] <wikibugs>	 (03CR) 10Rush: [C: 032] icinga: add aborrero to sms group [puppet] - 10https://gerrit.wikimedia.org/r/404751 (https://phabricator.wikimedia.org/T178807) (owner: 10Rush)
[18:41:37] <wikibugs>	 (03PS4) 10Dzahn: letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk)
[18:43:00] <wikibugs>	 (03CR) 10Dzahn: [C: 032] letsencrypt: Update LE subscriber agreement URL [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk)
[18:43:31] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[18:43:31] <icinga-wm>	 PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[18:43:42] <icinga-wm>	 PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[18:43:53] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754
[18:44:02] <icinga-wm>	 PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[18:44:02] <icinga-wm>	 PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[18:44:02] <icinga-wm>	 PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[18:44:22] <icinga-wm>	 PROBLEM - dhclient process on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[18:44:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo)
[18:44:54] <wikibugs>	 (03CR) 10Dzahn: "no issues with puppet runs: cobalt, netmon1002, dbmonitor1001 ..." [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk)
[18:45:46] <wikibugs>	 (03PS5) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754
[18:46:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo)
[18:46:31] <icinga-wm>	 PROBLEM - SSH on pybal-test2001 is CRITICAL: Server answer
[18:47:12] <icinga-wm>	 PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[18:48:59] <mutante>	 Krenair: i merged the LE change, no problems in prod. i then went to deployment-mx02 to confirm, i guess need to wait for puppetmaster to sync
[18:50:27] <wikibugs>	 (03PS6) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754
[18:50:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo)
[18:51:15] <wikibugs>	 (03PS7) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754
[18:51:21] <wikibugs>	 (03CR) 10Dzahn: "so far the error has not changed on deployment-mx02  but i think it just needs to sync the puppetmaster with prod.. should work in a bit.." [puppet] - 10https://gerrit.wikimedia.org/r/403326 (owner: 10Alex Monk)
[18:51:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo)
[18:52:42] <icinga-wm>	 RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational
[18:53:02] <icinga-wm>	 RECOVERY - Disk space on pybal-test2001 is OK: DISK OK
[18:53:02] <icinga-wm>	 RECOVERY - DPKG on pybal-test2001 is OK: All packages OK
[18:53:02] <icinga-wm>	 RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full
[18:53:31] <icinga-wm>	 RECOVERY - dhclient process on pybal-test2001 is OK: PROCS OK: 0 processes with command name dhclient
[18:53:31] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set
[18:53:31] <icinga-wm>	 RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up
[18:53:32] <icinga-wm>	 RECOVERY - SSH on pybal-test2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[18:53:53] <wikibugs>	 (03PS8) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754
[18:54:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo)
[18:55:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: unattended-upgrades: improvements for apt-upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/404736 (owner: 10Arturo Borrero Gonzalez)
[18:55:49] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: improvements for apt-upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/404736
[18:56:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] apt: unattended-upgrades: improvements for apt-upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/404736 (owner: 10Arturo Borrero Gonzalez)
[18:57:11] <icinga-wm>	 RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 48 minutes ago with 0 failures
[18:57:56] <wikibugs>	 (03PS9) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754
[18:59:06] <wikibugs>	 (03PS10) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754
[19:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180117T1900)
[19:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:00:08] <wikibugs>	 (03PS11) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754
[19:00:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: unattended-upgrades: ensure python3-apt is installed [puppet] - 10https://gerrit.wikimedia.org/r/404740 (owner: 10Arturo Borrero Gonzalez)
[19:00:31] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: ensure python3-apt is installed [puppet] - 10https://gerrit.wikimedia.org/r/404740
[19:00:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] apt: unattended-upgrades: ensure python3-apt is installed [puppet] - 10https://gerrit.wikimedia.org/r/404740 (owner: 10Arturo Borrero Gonzalez)
[19:01:12] <wikibugs>	 (03PS12) 10Jcrespo: mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754
[19:02:17] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Add small tunings in preparation for es200[1234] reimage [puppet] - 10https://gerrit.wikimedia.org/r/404754 (owner: 10Jcrespo)
[19:02:22] <MatmaRex>	 beta seems down, is this known? https://en.wikipedia.beta.wmflabs.org/ gives a 503
[19:02:31] <paladox>	 MatmaRex hi i think so
[19:02:38] <jynus>	 MatmaRex: I think greg sent an email recently
[19:02:42] <paladox>	 MatmaRex https://phabricator.wikimedia.org/T185055
[19:02:49] <MatmaRex>	 okay. thanks
[19:02:57] <jynus>	 but I cannot say if related, but worth checking
[19:03:28] <MatmaRex>	 hmm, that task seems narrower. it only talks about the API, but the whole site is down
[19:04:43] <tgr>	 MatmaRex: yeah, seems to be a different issue
[19:06:16] <paladox>	 tgr according to logstash-beta this is most likly redis. Unless redis logback is hiding the true problem.
[19:06:39] <tgr>	 ...or not, the error just looks different because this has Varnish in the middle and the API doesn't
[19:06:47] <wikibugs>	 10Operations, 10MediaWiki-API, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3904733 (10Paladox) On logstash-beta i see  wiki:enwiki exception.trace:#0 [internal function]: MWExceptionHandler::handleErr...
[19:08:18] <wikibugs>	 10Operations, 10MediaWiki-API, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3907077 (10Tgr) All of Beta MediaWiki seems to be down now. ``` tgr@deployment-mediawiki04:~$ curl -v -H 'Host: en.wikipedia....
[19:08:32] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-temporary_storage: Do not monitor mariadb [puppet] - 10https://gerrit.wikimedia.org/r/404761
[19:09:07] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb-temporary_storage: Do not monitor mariadb [puppet] - 10https://gerrit.wikimedia.org/r/404761 (owner: 10Jcrespo)
[19:12:53] <wikibugs>	 10Operations, 10MediaWiki-JobQueue, 10Performance-Team, 10Beta-Cluster-reproducible: Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055#3907085 (10Anomie) This has nothing to do with #mediawiki-api, or as far as I can tell anything to do with code in MediaWiki at all. It seems to...
[19:13:45] <wikibugs>	 (03PS1) 10Mark Bergsma: Expand test coverage of server.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/404762
[19:16:12] <wikibugs>	 (03Restored) 10Zoranzoki21: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) (owner: 10Zoranzoki21)
[19:16:18] <wikibugs>	 (03PS12) 10Zoranzoki21: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151)
[19:17:18] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Enable notifications on es2001-4 and set default behaviour [puppet] - 10https://gerrit.wikimedia.org/r/404768
[19:20:50] <wikibugs>	 (03CR) 10Ladsgroup: [C: 031] Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42)
[19:21:37] <wikibugs>	 (03PS6) 10Zoranzoki21: Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42)
[19:22:31] <wikibugs>	 (03PS2) 10MaxSem: Add a test verifying that rtl.dblist is up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337)
[19:22:35] <wikibugs>	 (03CR) 10MaxSem: Add a test verifying that rtl.dblist is up to date (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) (owner: 10MaxSem)
[19:23:52] <wikibugs>	 (03PS3) 10MaxSem: Add a test verifying that rtl.dblist is up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337)
[19:24:01] <wikibugs>	 (03PS6) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297)
[19:24:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata)
[19:24:43] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Not until they have been reimaged." [puppet] - 10https://gerrit.wikimedia.org/r/404768 (owner: 10Jcrespo)
[19:28:30] <wikibugs>	 (03PS7) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297)
[19:39:00] <wikibugs>	 (03PS1) 10Ottomata: [WIP] point eventlogging processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297)
[19:39:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] point eventlogging processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata)
[19:39:31] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184787#3907130 (10Papaul) a:05Papaul>03fgiunchedi Disk replacement complete.
[19:41:27] <wikibugs>	 (03PS2) 10Ottomata: [WIP] point eventlogging processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297)
[19:41:39] <wikibugs>	 (03CR) 10Hashar: "[5/5] I will logout/login more often!" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox)
[19:41:54] <paladox>	 hashar thanks :).
[19:42:30] <wikibugs>	 (03PS7) 10Dzahn: mediawiki: Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42)
[19:43:58] <wikibugs>	 (03CR) 10Dzahn: [C: 032] mediawiki: Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42)
[19:44:20] <wikibugs>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/9769/eventlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata)
[19:44:31] <wikibugs>	 (03CR) 10Hashar: "Can you possibly reuse the commit message from the original change https://gerrit.wikimedia.org/r/#/c/398484/ ?  And maybe explain why it " [puppet] - 10https://gerrit.wikimedia.org/r/404480 (owner: 10Herron)
[19:44:44] <wikibugs>	 (03PS3) 10Ottomata: [WIP] point eventlogging processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297)
[19:45:17] <papaul>	 !log Powering down mw2140 for main board replacement 
[19:45:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:38] <wikibugs>	 (03CR) 10Dzahn: "thanks paladox and markusguenther for making the design and releasing it !:)" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox)
[19:46:31] <andrewbogott>	 !log rebooting labpuppetmaster1002
[19:46:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:50] <wikibugs>	 (03Draft1) 10Paladox: Gerrit: Add attribution to background image [puppet] - 10https://gerrit.wikimedia.org/r/404777 (https://phabricator.wikimedia.org/T184778)
[19:46:52] <wikibugs>	 (03Draft2) 10Paladox: Gerrit: Add attribution to background image [puppet] - 10https://gerrit.wikimedia.org/r/404777 (https://phabricator.wikimedia.org/T184778)
[19:48:34] <wikibugs>	 (03CR) 10Dzahn: "is the compiler error my fault or just a limitation because i rename the hiera key in the same change? http://puppet-compiler.wmflabs.org/" [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn)
[19:49:25] <wikibugs>	 (03CR) 10Dzahn: "commit message has been updated since that last reviewer comment" [puppet] - 10https://gerrit.wikimedia.org/r/382930 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn)
[19:50:10] <wikibugs>	 (03CR) 10Dzahn: "it compiles http://puppet-compiler.wmflabs.org/9711/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/403730 (owner: 10Dzahn)
[19:52:14] <icinga-wm>	 PROBLEM - Host mw2140.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[19:53:11] <wikibugs>	 (03PS8) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297)
[19:53:26] <wikibugs>	 (03PS9) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297)
[19:53:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata)
[19:54:57] <wikibugs>	 (03PS10) 10Ottomata: [WIP] Refactor cache::kafka::eventlogging into profile and enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297)
[19:56:01] <paladox>	 Krinkle https://gerrit.wikimedia.org/r/404777 
[19:56:03] <andrewbogott>	 !log rebooting labpuppetmaster1001
[19:56:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:41] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3907183 (10Andrew)
[19:57:25] <wikibugs>	 (03PS3) 10Dzahn: druid: move firewall includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/397726
[19:59:23] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "i dont understand why this gets  unrelated  "Error: Could not find resource 'Exec[apt-get update]' for relationship from 'Class[Profile::C" [puppet] - 10https://gerrit.wikimedia.org/r/397726 (owner: 10Dzahn)
[20:00:04] <jouncebot>	 thcipriani: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180117T2000).
[20:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[20:01:52] * thcipriani works on it
[20:03:05] <icinga-wm>	 PROBLEM - Host labcontrol1002 is DOWN: PING CRITICAL - Packet loss = 100%
[20:04:01] <wikibugs>	 (03CR) 10Dzahn: "i'll split it into multiple patches, i assume you prefer that, right" [puppet] - 10https://gerrit.wikimedia.org/r/399542 (owner: 10Dzahn)
[20:04:05] <wikibugs>	 (03Abandoned) 10Dzahn: wmcs/labs: move more firewall/standard includes into roles [puppet] - 10https://gerrit.wikimedia.org/r/399542 (owner: 10Dzahn)
[20:05:01] <andrewbogott>	 !log rebooted labservices1002, labcontrol1002, labnet1002
[20:05:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:45] <icinga-wm>	 PROBLEM - Host labservices1002 is DOWN: PING CRITICAL - Packet loss = 100%
[20:07:24] <icinga-wm>	 RECOVERY - Host labservices1002 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[20:08:43] <wikibugs>	 (03CR) 10Ottomata: [C: 031] "Hm, don't know either, but +1 ! :)" [puppet] - 10https://gerrit.wikimedia.org/r/397726 (owner: 10Dzahn)
[20:09:38] <wikibugs>	 (03PS8) 10Tjones: Updates to enable short URLs for transliteration for crhwiki production [puppet] - 10https://gerrit.wikimedia.org/r/398832 (https://phabricator.wikimedia.org/T23582) (owner: 10Gehel)
[20:09:56] <wikibugs>	 (03PS8) 10Tjones: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582)
[20:14:07] <wikibugs>	 (03PS1) 10Ottomata: No-op for refinery job camus to ease future analytics -> jumbo kafka [puppet] - 10https://gerrit.wikimedia.org/r/404789 (https://phabricator.wikimedia.org/T175461)
[20:14:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] No-op for refinery job camus to ease future analytics -> jumbo kafka [puppet] - 10https://gerrit.wikimedia.org/r/404789 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata)
[20:16:04] <icinga-wm>	 RECOVERY - Host labcontrol1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[20:16:36] <wikibugs>	 (03PS1) 10Dzahn: labtest: move firewall/standard includes to roles [puppet] - 10https://gerrit.wikimedia.org/r/404790
[20:16:43] <wikibugs>	 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3907234 (10Andrew)
[20:17:11] <wikibugs>	 (03PS2) 10Ottomata: No-op for refinery job camus to ease future analytics -> jumbo kafka [puppet] - 10https://gerrit.wikimedia.org/r/404789 (https://phabricator.wikimedia.org/T175461)
[20:17:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] No-op for refinery job camus to ease future analytics -> jumbo kafka [puppet] - 10https://gerrit.wikimedia.org/r/404789 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata)
[20:18:33] <wikibugs>	 (03PS3) 10Ottomata: No-op for refinery job camus to ease future analytics -> jumbo kafka [puppet] - 10https://gerrit.wikimedia.org/r/404789 (https://phabricator.wikimedia.org/T175461)
[20:21:06] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9772/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404789 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata)
[20:21:10] <wikibugs>	 (03PS1) 10Dzahn: site: use role(test) for unused labs nodes [puppet] - 10https://gerrit.wikimedia.org/r/404791
[20:22:41] <wikibugs>	 (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/404790/" [puppet] - 10https://gerrit.wikimedia.org/r/399542 (owner: 10Dzahn)
[20:23:04] <wikibugs>	 (03PS4) 10Dzahn: druid: move firewall includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/397726
[20:25:20] <wikibugs>	 (03PS5) 10Dzahn: druid: move firewall includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/397726
[20:25:34] <icinga-wm>	 PROBLEM - Host mw2140 is DOWN: PING CRITICAL - Packet loss = 100%
[20:28:56] <wikibugs>	 (03CR) 10Dzahn: [C: 032] druid: move firewall includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/397726 (owner: 10Dzahn)
[20:30:10] <logmsgbot>	 !log pnorman@tin Started deploy [kartotherian/deploy@ecdda41]: (no justification provided)
[20:30:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:33] <wikibugs>	 (03CR) 10Kaldari: [C: 032] Add a test verifying that rtl.dblist is up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) (owner: 10MaxSem)
[20:34:55] <icinga-wm>	 RECOVERY - Host mw2140.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 37.70 ms
[20:35:54] <logmsgbot>	 !log pnorman@tin Finished deploy [kartotherian/deploy@ecdda41]: (no justification provided) (duration: 05m 44s)
[20:36:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:06] <wikibugs>	 (03Merged) 10jenkins-bot: Add a test verifying that rtl.dblist is up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) (owner: 10MaxSem)
[20:37:02] <wikibugs>	 (03CR) 10jenkins-bot: Add a test verifying that rtl.dblist is up to date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404616 (https://phabricator.wikimedia.org/T172337) (owner: 10MaxSem)
[20:37:44] <wikibugs>	 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2861029 (10Isarra) So when's this happening? Wheeeeeen?
[20:40:15] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.31.0-wmf.17/includes/Storage/RevisionStore.php: [[gerrit:404757|[MCR] RevisionStore::getTitle final logged fallback to master]] PART I (duration: 01m 04s)
[20:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:36] <wikibugs>	 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3907310 (10dr0ptp4kt) Hi @Isarra , just wanted to note that @Deskana is taking on product owner duties on this and is working with @Tgr a...
[20:41:35] <icinga-wm>	 PROBLEM - Host mw2140.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:41:54] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.31.0-wmf.17/includes/ServiceWiring.php: [[gerrit:404757|[MCR] RevisionStore::getTitle final logged fallback to master]] PART II (duration: 01m 12s)
[20:42:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:26] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.31.0-wmf.17/vendor/wikibase/data-model-services: [[gerrit:404758|Add missing files from wikibase/data-model-services 3.9.0]] (duration: 01m 15s)
[20:45:26] <wikibugs>	 (03CR) 10Rush: [C: 031] site: use role(test) for unused labs nodes [puppet] - 10https://gerrit.wikimedia.org/r/404791 (owner: 10Dzahn)
[20:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:44] <icinga-wm>	 RECOVERY - Host mw2140.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms
[20:48:46] <wikibugs>	 (03CR) 10Phantom42: "Thank you for merging this!" [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42)
[20:49:59] <wikibugs>	 (03PS1) 10Thcipriani: Group1 to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404801
[20:52:18] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] Group1 to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404801 (owner: 10Thcipriani)
[20:52:42] <wikibugs>	 (03PS2) 10Dzahn: site: use role(test) for unused labs nodes [puppet] - 10https://gerrit.wikimedia.org/r/404791
[20:53:04] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "thx" [puppet] - 10https://gerrit.wikimedia.org/r/404791 (owner: 10Dzahn)
[20:53:22] <wikibugs>	 (03CR) 10Dzahn: "thanks for the fix :)" [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42)
[20:55:08] <wikibugs>	 (03Merged) 10jenkins-bot: Group1 to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404801 (owner: 10Thcipriani)
[20:56:44] <wikibugs>	 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3907372 (10Ottomata)
[20:56:54] <wikibugs>	 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3840663 (10Ottomata) a:03Ottomata
[20:57:11] <wikibugs>	 (03CR) 10jenkins-bot: Group1 to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404801 (owner: 10Thcipriani)
[20:57:38] <logmsgbot>	 !log thcipriani@tin rebuilt and synchronized wikiversions files: group1 to 1.31.0-wmf.17
[20:57:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:44] <wikibugs>	 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3907388 (10Raymond) >>! In T133410#3907310, @dr0ptp4kt wrote: > Hi @Isarra , just wanted to note that @Deskana is taking on product owner...
[21:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180117T2100).
[21:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[21:00:19] <awight>	 Nothing for ORES today
[21:03:25] <wikibugs>	 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3907402 (10Isarra) If we want to propose specific projects for this, should we just do the usual discussion on-wiki to see if there's con...
[21:03:41] <logmsgbot>	 !log thcipriani@tin Synchronized php: group1 to 1.31.0-wmf.17 (duration: 01m 11s)
[21:03:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:15] <wikibugs>	 10Operations, 10ops-codfw: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3907403 (10Papaul) a:05Papaul>03MoritzAccountTest Main board replacement complete - Test ssh connection ( racadm power commands)  - clear log  - Update IDRAC firmware from version 2.21 to version...
[21:11:49] <wikibugs>	 (03PS2) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202
[21:12:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush)
[21:12:59] <wikibugs>	 (03PS3) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202
[21:13:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush)
[21:15:04] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1308 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[21:16:04] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.005 second response time
[21:17:17] <wikibugs>	 (03PS4) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202
[21:18:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush)
[21:18:09] <wikibugs>	 (03PS5) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202
[21:18:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush)
[21:19:00] <wikibugs>	 (03PS6) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202
[21:20:01] <wikibugs>	 (03PS7) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202
[21:27:44] <wikibugs>	 (03PS1) 10Jdlrobson: Use the correct Pashto Wikipedia wordmark on mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404828 (https://phabricator.wikimedia.org/T184442)
[21:37:32] <wikibugs>	 (03CR) 10Framawiki: "This patch was created to create a redirect from techblog.wikimedia.org to blog.wikimedia.org/c/technology, instead of the main page of th" [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki)
[21:44:15] <icinga-wm>	 PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 5 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Service[nagios-nrpe-server],Exec[ip addr add 2620:0:860:102:10:192:16:139/64 dev eth0],Exec[absent_ensure_members]
[21:44:34] <icinga-wm>	 PROBLEM - Keyholder SSH agent on labpuppetmaster1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[21:45:24] <icinga-wm>	 PROBLEM - Keyholder SSH agent on labpuppetmaster1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[21:50:34] <icinga-wm>	 RECOVERY - Keyholder SSH agent on labpuppetmaster1001 is OK: OK: Keyholder is armed with all configured keys.
[21:54:53] <chasemp>	 herron: about? quick q
[21:55:09] <herron>	 chasemp: hey, what’s up?
[21:55:22] <chasemp>	 hey did we build our own trusty puppet packages in the end?
[21:55:24] <chasemp>	 for 4.x I mean
[21:55:30] <wikibugs>	 10Operations, 10ops-codfw: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3907591 (10RobH) a:05MoritzAccountTest>03MoritzMuehlenhoff
[21:57:05] <herron>	 chasemp: yeah, ended up backporting the debian packages to trusty.  there is some background about it in T182894
[21:57:06] <stashbot>	 T182894: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894
[21:58:48] <chasemp>	 herron: so I may be one of the few that will use util/localrun to test on an instance but it seems have broken and it looks like 'Error while evaluating a Function Call, uninitialized constant RGen::ECore::ELong' on Trusty, and I think it may be that ruby-rgen is not only no longer a dependency but conflicts for puppet 4.x or so https://bugzilla.redhat.com/show_bug.cgi?id=1411809
[22:00:01] <chasemp>	 but maybe that's not right if https://packages.debian.org/stretch/puppet is to be believed
[22:00:08] <herron>	 hmm
[22:00:23] <andrewbogott>	 !log rebooting californium, silver, labcontrol1001, labservices1001
[22:01:55] <icinga-wm>	 PROBLEM - nova-api http on labnet1001 is CRITICAL: connect to address 10.64.20.13 and port 8774: Connection refused
[22:02:34] <icinga-wm>	 PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
[22:04:37] <herron>	 chasemp yes that package.debian.org link is right, ruby-rgen is a dependency of the trusty puppet package (and just double checked on a trusty host)
[22:05:06] <herron>	 and is installed
[22:05:45] <chasemp>	 herron: yeah, I think it's possible from puppetlabs perspecive ruby-rgen is no longer a dep and in practical terms may conflict
[22:06:15] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating
[22:06:34] <icinga-wm>	 RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d
[22:06:55] <icinga-wm>	 RECOVERY - nova-api http on labnet1001 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.002 second response time
[22:07:08] <chasemp>	 herron: I'm grasping at straws atm here :) wanted to get your perspective
[22:07:15] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active
[22:09:15] <icinga-wm>	 RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:19:09] <wikibugs>	 10Operations, 10ops-esams, 10netops: replace msw1-esams - https://phabricator.wikimedia.org/T185151#3907680 (10ayounsi)
[22:26:37] <wikibugs>	 10Operations, 10Mail: Disavow emails from wikipedia.com - https://phabricator.wikimedia.org/T184230#3907759 (10herron) Spent some time looking into this today.  Sounds good overall.  I'd like to roll these updates out in a phased way, and will need to split wikipedia.com into it's own dns zone file as it's cur...
[22:34:43] <wikibugs>	 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2861046 (10Iniquity) @Isarra we are waiting for T180817 this task.
[22:38:44] <matthiasmullie>	 hey - I'd like to scap deploy /srv/deployment/3d2png/deploy - can I do that now, or should I schedule it?
[22:39:56] <Krenair>	 greg-g, ^
[22:41:01] <greg-g>	 matthiasmullie: there are services windows throughout the week
[22:41:31] <wikibugs>	 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#3907805 (10RobH) p:05Triage>03High
[22:41:54] <matthiasmullie>	 greg-g alright, I'll see if I can join that one tomorrow - thanks
[22:41:59] <greg-g>	 matthiasmullie: what's the need? timeliness?
[22:42:23] <greg-g>	  /urgency
[22:42:35] <matthiasmullie>	 not in a particular rush, I can wait :)
[22:42:40] <greg-g>	 :)
[22:44:15] <wikibugs>	 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#3907847 (10RobH) Please note I've assigned this task to @faidon to review and approve, since he has been point person on this system's project.  Ideally, once we have furud's existing 2 shelves,...
[22:53:55] <icinga-wm>	 PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:54:42] <urandom>	 !log bootstrapping restbase1013-b - T184100
[22:54:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:54:55] <stashbot>	 T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100
[22:58:55] <icinga-wm>	 RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[23:38:57] <mutante>	 !log [terbium:~] $ echo 'https://annual.wikimedia.org' | mwscript purgeList.php
[23:39:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:47:35] <wikibugs>	 (03PS1) 1020after4: Phabricator: Add translations library to phabricator profile [puppet] - 10https://gerrit.wikimedia.org/r/404887 (https://phabricator.wikimedia.org/T225)
[23:48:03] <wikibugs>	 (03PS6) 10Thcipriani: Scap canary: cache last good deploy time [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999)
[23:48:47] <wikibugs>	 (03CR) 10Thcipriani: Scap canary: cache last good deploy time (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani)
[23:50:39] <wikibugs>	 (03CR) 1020after4: [C: 031] "This can be merged at any time, before or after tonight's phabricator deployment." [puppet] - 10https://gerrit.wikimedia.org/r/404887 (https://phabricator.wikimedia.org/T225) (owner: 1020after4)
[23:58:35] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1337 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[23:59:35] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1337 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time