[00:41:01] <icinga-wm>	 PROBLEM - Host scb2006 is DOWN: PING CRITICAL - Packet loss = 100%
[00:42:29] <icinga-wm>	 RECOVERY - Host scb2006 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms
[00:48:39] <wikibugs>	 (03PS12) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[00:50:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis)
[01:00:03] <wikibugs>	 (03PS13) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[02:22:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32937024 and 1 seconds
[02:24:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3496 and 30 seconds
[03:13:25] <icinga-wm>	 PROBLEM - puppet last run on matomo1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:16:45] <wikibugs>	 (03PS14) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[03:44:15] <icinga-wm>	 RECOVERY - puppet last run on matomo1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[03:52:07] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10Aklapper) >>! In T219589#5072592, @Yann wrote: > Hi, There should be a clear way...
[04:14:27] <icinga-wm>	 PROBLEM - puppet last run on an-worker1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:19:30] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:26:52] <wikibugs>	 (03PS15) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[04:27:44] <icinga-wm>	 PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:32:07] <wikibugs>	 (03CR) 10Andrew Bogott: "> What's the deal with those hosts under 'Hosts that fail to compile" [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk)
[04:32:58] <wikibugs>	 (03PS16) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[04:33:24] <wikibugs>	 (03CR) 10Andrew Bogott: "This is reasonable but I'd prefer to merge things like this dead last, after the final Trusty VMs have been deleted." [puppet] - 10https://gerrit.wikimedia.org/r/499933 (owner: 10Muehlenhoff)
[04:36:29] <wikibugs>	 (03CR) 10Andrew Bogott: "I'd like to see longer explanations about what these scripts do, either in code comments or in the command-line usage output.  Seems fine " [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk)
[04:39:23] <wikibugs>	 (03PS17) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[04:39:50] <icinga-wm>	 PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:45:20] <icinga-wm>	 RECOVERY - puppet last run on an-worker1094 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[04:48:40] <icinga-wm>	 RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[05:01:17] <wikibugs>	 (03PS18) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[05:02:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis)
[05:04:10] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500372
[05:04:12] <wikibugs>	 (03PS19) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[05:06:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500372 (owner: 10Marostegui)
[05:06:10] <icinga-wm>	 RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[05:06:33] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove labsdb1004,labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/500373 (https://phabricator.wikimedia.org/T216749)
[05:07:16] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500372 (owner: 10Marostegui)
[05:07:32] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500372 (owner: 10Marostegui)
[05:08:27] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 (duration: 00m 53s)
[05:08:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:08:39] <marostegui>	 !log Deploy schema change on db1077, this will generate lag on s3 on labs
[05:08:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:10:41] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui)
[05:11:10] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) root@db2070:~# hpssacli controller all show config  Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380337FADD0)      Port Name: 1I     Port Name: 2I     Gen8 ServBP 12+2 at Port 1I,...
[05:12:06] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db2070 is CRITICAL: cluster=mysql device=cciss,5 instance=db2070:9100 job=node site=codfw Marostegui T208323 - The acknowledgement expires at: 2019-04-09 05:11:44. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2070&var-datasource=codfw+prometheus/ops
[05:25:02] <wikibugs>	 (03PS20) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[05:35:07] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500374
[05:36:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500374 (owner: 10Marostegui)
[05:37:23] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500374 (owner: 10Marostegui)
[05:38:42] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 (duration: 00m 50s)
[05:38:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:40:16] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500374 (owner: 10Marostegui)
[05:59:04] <wikibugs>	 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Marostegui) This wiki is triggering some false positives on our labs private data checking methods, even if it is correctly sanitized (T212625#5062038) it...
[06:00:36] <wikibugs>	 (03PS21) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[06:27:17] <_joe_>	 !log installing new bootstrap-vz on boron T219580
[06:27:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:25] <stashbot>	 T219580: Remove backports from wikimedia-jessie - https://phabricator.wikimedia.org/T219580
[06:28:38] <_joe_>	 !log pushing wikimedia-jessie:{20190401,latest} to docker-registry.w.o T219580
[06:28:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:46] <icinga-wm>	 PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:32:06] <icinga-wm>	 PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/disable-puppet]
[06:32:16] <icinga-wm>	 PROBLEM - puppet last run on cp1090 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/varnishospital]
[06:33:20] <icinga-wm>	 PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf]
[06:47:08] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Remove spurious depends completely [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500379
[06:47:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Rebuild jessie images for removal of backports, updates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500380 (https://phabricator.wikimedia.org/T219747)
[06:47:20] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational
[06:54:21] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10Peachey88) I would prefer to see someone over prioritize a task so it shows up ea...
[06:55:04] <icinga-wm>	 RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[06:58:24] <icinga-wm>	 RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:58:32] <icinga-wm>	 RECOVERY - puppet last run on cp1090 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:59:34] <icinga-wm>	 RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:11:57] <wikibugs>	 (03PS3) 10Muehlenhoff: Pull in kibana/logstash 5.6.15 [puppet] - 10https://gerrit.wikimedia.org/r/500066
[07:13:37] <wikibugs>	 10Operations, 10Tools, 10cloud-services-team: Rebuild toollabs docker images based on wikimedia-jessie - https://phabricator.wikimedia.org/T219751 (10Joe)
[07:13:46] <wikibugs>	 10Operations, 10Tools, 10cloud-services-team: Rebuild toollabs docker images based on wikimedia-jessie - https://phabricator.wikimedia.org/T219751 (10Joe) p:05Triage→03High
[07:16:20] <wikibugs>	 10Operations, 10Tools, 10cloud-services-team: Rebuild toollabs docker images based on wikimedia-jessie - https://phabricator.wikimedia.org/T219751 (10Joe) a:05Joe→03None
[07:16:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove spurious depends completely [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500379 (owner: 10Giuseppe Lavagetto)
[07:17:08] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Remove spurious depends completely [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500379
[07:17:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove spurious depends completely [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500379 (owner: 10Giuseppe Lavagetto)
[07:17:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Pull in kibana/logstash 5.6.15 [puppet] - 10https://gerrit.wikimedia.org/r/500066 (owner: 10Muehlenhoff)
[07:19:14] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Rebuild jessie images for removal of backports, updates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500380 (https://phabricator.wikimedia.org/T219747)
[07:19:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Rebuild jessie images for removal of backports, updates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500380 (https://phabricator.wikimedia.org/T219747) (owner: 10Giuseppe Lavagetto)
[07:20:57] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500383
[07:22:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Edit Project Config [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500384
[07:23:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Edit Project Config [docker-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500385
[07:24:53] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Edit Project Config [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500386
[07:26:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500383 (owner: 10Marostegui)
[07:27:18] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500383 (owner: 10Marostegui)
[07:27:50] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Edit Project Config [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500384
[07:28:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Edit Project Config [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500384 (owner: 10Giuseppe Lavagetto)
[07:29:28] <icinga-wm>	 RECOVERY - MariaDB disk space on dbstore1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[07:29:46] <wikibugs>	 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10MoritzMuehlenhoff)
[07:29:49] <wikibugs>	 10Operations, 10Dumps-Generation: Switch dumps to component/php7.2 - https://phabricator.wikimedia.org/T218193 (10MoritzMuehlenhoff) 05Open→03Resolved This is done
[07:30:13] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] Cookbook to reset frozen writes on elasticsearch / cirrus. [cookbooks] - 10https://gerrit.wikimedia.org/r/500064 (https://phabricator.wikimedia.org/T219638) (owner: 10Gehel)
[07:30:36] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2033 (duration: 00m 51s)
[07:30:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:45] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500383 (owner: 10Marostegui)
[07:39:19] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] Cookbook to reset frozen writes on elasticsearch / cirrus. [cookbooks] - 10https://gerrit.wikimedia.org/r/500064 (https://phabricator.wikimedia.org/T219638) (owner: 10Gehel)
[07:42:11] <wikibugs>	 (03PS1) 10Muehlenhoff: toolforge: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/500388
[07:44:26] <wikibugs>	 (03PS17) 10Vgutierrez: Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk)
[07:44:28] <wikibugs>	 (03PS3) 10Vgutierrez: acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705)
[07:44:30] <wikibugs>	 (03PS4) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705)
[07:44:32] <wikibugs>	 (03PS11) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705)
[07:44:34] <wikibugs>	 (03PS6) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705)
[07:44:36] <wikibugs>	 (03PS9) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[07:44:38] <wikibugs>	 (03PS3) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[07:52:33] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] use MediaWiki maintenance script to get db user and password [dumps] - 10https://gerrit.wikimedia.org/r/498245 (https://phabricator.wikimedia.org/T218923) (owner: 10ArielGlenn)
[07:54:04] <logmsgbot>	 !log ariel@deploy1001 Started deploy [dumps/dumps@7abb6c8]: get db user/passwd va mw maint script
[07:54:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:07] <logmsgbot>	 !log ariel@deploy1001 Finished deploy [dumps/dumps@7abb6c8]: get db user/passwd va mw maint script (duration: 00m 03s)
[07:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:38] <wikibugs>	 (03PS4) 10Vgutierrez: acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705)
[07:58:40] <wikibugs>	 (03PS5) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705)
[07:58:42] <wikibugs>	 (03PS12) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705)
[07:58:45] <wikibugs>	 (03PS7) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705)
[07:58:47] <wikibugs>	 (03PS10) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[07:58:49] <wikibugs>	 (03PS4) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[08:07:00] <wikibugs>	 (03PS18) 10Vgutierrez: Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk)
[08:07:01] <wikibugs>	 (03PS5) 10Vgutierrez: acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705)
[08:07:03] <wikibugs>	 (03PS6) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705)
[08:07:05] <wikibugs>	 (03PS13) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705)
[08:07:07] <wikibugs>	 (03PS8) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705)
[08:07:09] <wikibugs>	 (03PS11) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[08:07:11] <wikibugs>	 (03PS5) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[08:09:05] <marostegui>	 !log Deploy testing schema change on enwiki.echo_event on db2033 and upgrade mysql - T143961
[08:09:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:09] <stashbot>	 T143961: Add index on event_page_id on echo_event table - https://phabricator.wikimedia.org/T143961
[08:17:08] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500392
[08:22:13] <wikibugs>	 (03PS1) 10Rxy: Add 'unwatchedpages' permission to rollbacker and patroller at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500393 (https://phabricator.wikimedia.org/T219285)
[08:25:26] <rxy>	 jouncebot: next
[08:25:27] <jouncebot>	 In 2 hour(s) and 4 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1030)
[08:29:18] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "The smallest nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk)
[08:30:47] <wikibugs>	 (03CR) 10Gehel: "LGTM, waiting for https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/499951 to be deployed first" [puppet] - 10https://gerrit.wikimedia.org/r/500359 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev)
[08:34:29] <wikibugs>	 (03CR) 10Volans: [C: 03+2] check_icinga: fix retry logic [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/499368 (owner: 10Volans)
[08:34:42] <icinga-wm>	 PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[08:35:11] <wikibugs>	 (03Merged) 10jenkins-bot: check_icinga: fix retry logic [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/499368 (owner: 10Volans)
[08:35:52] <icinga-wm>	 RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 79454 bytes in 3.246 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[08:36:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500392 (owner: 10Marostegui)
[08:38:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500392 (owner: 10Marostegui)
[08:40:00] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2033 (duration: 00m 51s)
[08:40:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:20] <wikibugs>	 (03PS19) 10Vgutierrez: Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk)
[08:40:22] <wikibugs>	 (03PS6) 10Vgutierrez: acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705)
[08:40:24] <wikibugs>	 (03PS7) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705)
[08:40:26] <wikibugs>	 (03PS14) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705)
[08:40:28] <wikibugs>	 (03PS9) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705)
[08:40:30] <wikibugs>	 (03PS12) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[08:40:32] <wikibugs>	 (03PS6) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[08:40:34] <wikibugs>	 (03CR) 10Vgutierrez: Allow acme-chief to provide unified cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk)
[08:42:08] <wikibugs>	 (03CR) 10Ema: [C: 03+1] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk)
[08:46:30] <wikibugs>	 (03CR) 10Ema: [C: 03+1] acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[08:46:38] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk)
[08:49:15] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500392 (owner: 10Marostegui)
[08:53:49] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[08:55:47] <wikibugs>	 (03CR) 10Ema: nagios_common: provide check_ssl_unified variants for LE certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[08:58:42] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[09:00:35] <wikibugs>	 (03CR) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[09:01:09] <wikibugs>	 (03PS8) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705)
[09:01:11] <wikibugs>	 (03PS15) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705)
[09:01:14] <wikibugs>	 (03PS10) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705)
[09:01:16] <wikibugs>	 (03PS13) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[09:01:18] <wikibugs>	 (03PS7) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[09:02:32] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[09:06:54] <wikibugs>	 (03CR) 10Ema: cache: serve wikiba.se traffic using cache::canary servers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[09:07:15] <wikibugs>	 (03CR) 10Ema: [C: 03+1] hieradata: Deploy acme-chief unified certificate on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[09:07:49] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hieradata: Deploy acme-chief unified certificate on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[09:09:27] <moritzm>	 !log installing Chromium security updates on proton* (tested the new release in deployment-prep)
[09:09:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:29] <wikibugs>	 (03PS5) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203)
[09:14:38] <icinga-wm>	 PROBLEM - Check systemd state on cp1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:17:31] <vgutierrez>	 ^^ that's me
[09:21:15] <Krenair>	 vgutierrez, am looking at this on deployment-prep, is there an nginx vs. update-ocsp catch-22?
[09:21:23] <vgutierrez>	 yes
[09:21:40] <vgutierrez>	 the nginx service unit is preventing from writing in /etc
[09:21:56] <vgutierrez>	 so the update-ocsp-all script triggered by the nginx systemd service unit fails
[09:22:14] <vgutierrez>	 I'm considering the options to fix it
[09:22:19] <vgutierrez>	 sorry about the noise
[09:28:59] <wikibugs>	 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10Krenair)
[09:31:07] <wikibugs>	 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10Krenair) This was on deployment-sca02 but the list of deployment-prep instances failing puppet grew suddenly, so I expect a few of these were affected by this problem, and I imagine it's not...
[09:32:19] <wikibugs>	 (03PS16) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705)
[09:32:21] <wikibugs>	 (03PS11) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705)
[09:32:23] <wikibugs>	 (03PS14) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[09:32:25] <wikibugs>	 (03PS8) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[09:32:27] <wikibugs>	 (03PS1) 10Vgutierrez: tlsproxy: Allow update-ocsp-all writing in /etc/acmecerts [puppet] - 10https://gerrit.wikimedia.org/r/500397 (https://phabricator.wikimedia.org/T213705)
[09:33:45] <icinga-wm>	 RECOVERY - Check systemd state on cp1008 is OK: OK - running: The system is fully operational
[09:34:19] <wikibugs>	 10Operations, 10Tools, 10cloud-services-team (Kanban): Rebuild toollabs docker images based on wikimedia-jessie - https://phabricator.wikimedia.org/T219751 (10aborrero)
[09:36:58] <wikibugs>	 (03CR) 10Gehel: Elasticsearch: make unfreezing writes more robust. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel)
[09:37:36] <wikibugs>	 (03PS2) 10Vgutierrez: tlsproxy: Allow update-ocsp-all writing in /etc/acmecerts [puppet] - 10https://gerrit.wikimedia.org/r/500397 (https://phabricator.wikimedia.org/T213705)
[09:37:38] <wikibugs>	 (03PS17) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705)
[09:37:41] <wikibugs>	 (03PS12) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705)
[09:37:42] <wikibugs>	 (03PS15) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[09:37:45] <wikibugs>	 (03PS9) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[09:38:45] <wikibugs>	 (03CR) 10Ema: tlsproxy: Allow update-ocsp-all writing in /etc/acmecerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500397 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[09:39:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Please reference T219362 in the commit message. Other than that, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/500388 (owner: 10Muehlenhoff)
[09:40:13] <wikibugs>	 (03PS3) 10Vgutierrez: tlsproxy: Allow update-ocsp-all writing in /etc/acmecerts [puppet] - 10https://gerrit.wikimedia.org/r/500397 (https://phabricator.wikimedia.org/T213705)
[09:40:15] <wikibugs>	 (03PS18) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705)
[09:40:17] <wikibugs>	 (03PS13) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705)
[09:40:19] <wikibugs>	 (03PS16) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[09:40:21] <wikibugs>	 (03PS10) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[09:40:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[09:40:29] <wikibugs>	 (03PS9) 10Giuseppe Lavagetto: Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793
[09:40:41] <wikibugs>	 (03CR) 10Ema: [C: 03+1] tlsproxy: Allow update-ocsp-all writing in /etc/acmecerts [puppet] - 10https://gerrit.wikimedia.org/r/500397 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[09:40:46] <Krenair>	 I am having some trouble with the update-ocsp stuff vgutierrez 
[09:40:52] <wikibugs>	 (03PS2) 10Muehlenhoff: toolforge: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/500388 (https://phabricator.wikimedia.org/T219362)
[09:41:07] <Krenair>	 oh wait this may be my fault
[09:41:17] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] tlsproxy: Allow update-ocsp-all writing in /etc/acmecerts [puppet] - 10https://gerrit.wikimedia.org/r/500397 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[09:41:33] <Krenair>	 sort of
[09:42:09] <vgutierrez>	 merging that ^^ right now
[09:42:18] <vgutierrez>	 that should do the trick
[09:42:24] <vgutierrez>	 (tested manually in cp1008 and worked like a charm)
[09:42:58] <Krenair>	 Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Acme_chief::Cert[unified]/Exec[unified-live-ec-prime256v1-create-ocsp]/returns: Exception: Command openssl ocsp -resp_text -respout /etc/acmecerts/unified/live/update-ocsp-fdhzvx.tmp/ec-prime256v1.client.ocsp -issuer /etc/ssl/certs/4f06f81d.0 -verify_other /etc/ssl/certs/4f06f81d.0 -url http://ocsp.int-x3.letsencrypt.org -header Host ocsp.int-x3.letsencrypt.org 
[09:42:59] <Krenair>	 -cert /etc/acmecerts/unified/live/ec-prime256v1.crt failed with exit code 1, stderr:
[09:42:59] <Krenair>	 Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Acme_chief::Cert[unified]/Exec[unified-live-ec-prime256v1-create-ocsp]/returns: Missing = in header key=value
[09:44:49] <vgutierrez>	 hmmm, $proxy rendering issues on the template?
[09:45:08] <Krenair>	 vgutierrez, well first of all it tried to hardcode the prod proxy
[09:45:22] <Krenair>	 then I tried removing the proxy line, not much happier
[09:45:26] <Krenair>	 then I tried setting it to empty string
[09:45:50] <Krenair>	 but actually, read the command it's complaining about closely
[09:45:55] <Krenair>	 -header Host ocsp.int-x3.letsencrypt.org 
[09:46:00] <Krenair>	 Missing = in header key=value
[09:46:15] <Krenair>	 isn't openssl ocsp complaining about update-ocsp passing it a -header without = ?
[09:47:21] <vgutierrez>	 Krenair: which version of openssl is running that instance?
[09:47:35] <vgutierrez>	 cause that's working as expected in cp1008 as we speak
[09:47:56] <Krenair>	 1.1.0j
[09:47:58] <Krenair>	 what do you have?
[09:48:03] <vgutierrez>	 1.0.2r
[09:48:14] <Krenair>	 specifically I've got 1.1.0j-1~deb9u1 out of stretch/main
[09:48:46] <vgutierrez>	 cp1008 is still running jessie
[09:48:50] <Krenair>	 ah
[09:49:01] <Krenair>	 that might be something to watch out for in prod
[09:49:09] <Krenair>	 don't those other cp hosts run stretch?
[09:49:21] <vgutierrez>	 you're right
[09:49:24] <Krenair>	 fun
[09:49:31] <Krenair>	 okay
[09:49:34] <vgutierrez>	 they run stretch, and ocsp stapling is currently working as expected there
[09:49:40] <Krenair>	 ... huh.
[09:49:42] <vgutierrez>	 for the current unified cert
[09:49:50] <vgutierrez>	 (not the acme-chief managed one)
[09:49:53] <Krenair>	 ah
[09:50:15] <vgutierrez>	 so that part should be exactly the same
[09:50:44] <Krenair>	 ah it looks like header is only passed if you don't have a proxy
[09:51:00] <vgutierrez>	 ok.. so it's an existing bug there
[09:51:04] <Krenair>	 yeah
[09:51:11] <Krenair>	 one sec
[09:51:14] <vgutierrez>	 awesome, let's check that code :)
[09:52:53] <Krenair>	 yeah that did the trick
[09:53:43] <wikibugs>	 (03PS1) 10Alex Monk: sslcert: update-ocsp: Fix passing Host header in absence of proxy [puppet] - 10https://gerrit.wikimedia.org/r/500398
[09:54:32] <wikibugs>	 (03PS1) 10Muehlenhoff: base: Remove support for trusty/Ubuntu in multiple places [puppet] - 10https://gerrit.wikimedia.org/r/500400
[09:54:57] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/500398 (owner: 10Alex Monk)
[10:04:06] <wikibugs>	 (03CR) 10Elukey: admin: allow users to be removed preserving their home directories (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/498399 (https://phabricator.wikimedia.org/T215171) (owner: 10Elukey)
[10:04:35] <wikibugs>	 (03PS1) 10Muehlenhoff: base/ntp: Remove trusty/Ubuntu support [puppet] - 10https://gerrit.wikimedia.org/r/500403
[10:04:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/500388 (https://phabricator.wikimedia.org/T219362) (owner: 10Muehlenhoff)
[10:07:07] <wikibugs>	 (03PS1) 10Muehlenhoff: kube-proxy: Remove support for Ubuntu/trusty [puppet] - 10https://gerrit.wikimedia.org/r/500404
[10:09:27] <wikibugs>	 (03PS1) 10ArielGlenn: get rid of references to deprecated MW maintenance script [dumps] - 10https://gerrit.wikimedia.org/r/500405
[10:11:28] <wikibugs>	 (03Abandoned) 10Jbond: jessie-backports: remove updates from jessie bootstrap-vz config [puppet] - 10https://gerrit.wikimedia.org/r/500069 (https://phabricator.wikimedia.org/T219580) (owner: 10Jbond)
[10:11:59] <Lucas_WMDE>	 jouncebot: next
[10:12:00] <jouncebot>	 In 0 hour(s) and 17 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1030)
[10:13:17] <wikibugs>	 (03PS4) 10Elukey: admin: allow users to be removed preserving their home directories [puppet] - 10https://gerrit.wikimedia.org/r/498399 (https://phabricator.wikimedia.org/T215171)
[10:17:49] <wikibugs>	 (03PS1) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406
[10:18:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove labs_vmbuilder [puppet] - 10https://gerrit.wikimedia.org/r/500407
[10:19:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 (owner: 10Alex Monk)
[10:22:34] <wikibugs>	 (03PS2) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406
[10:24:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 (owner: 10Alex Monk)
[10:25:53] <wikibugs>	 (03CR) 10Jbond: admin: allow users to be removed preserving their home directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498399 (https://phabricator.wikimedia.org/T215171) (owner: 10Elukey)
[10:27:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Upgrade to 1.1.2 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500408
[10:27:21] <arturo>	 !log T219626 reimaging cloudcontrol2001-dev
[10:27:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:25] <stashbot>	 T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626
[10:28:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Upgrade to 1.1.2 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500408 (owner: 10Giuseppe Lavagetto)
[10:30:04] <jouncebot>	 jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1030).
[10:30:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove tools-checker-grid-start-trusty monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/500409
[10:31:17] <logmsgbot>	 !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@7ef5ca3]: Upgrade to 1.1.2
[10:31:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:23] <wikibugs>	 (03PS3) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406
[10:31:43] <logmsgbot>	 !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@7ef5ca3]: Upgrade to 1.1.2 (duration: 00m 26s)
[10:31:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:51] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500410 (https://phabricator.wikimedia.org/T128546)
[10:32:59] <_joe_>	 !log pruning old images on boron
[10:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove trusty-wikimedia from aptrepo config [puppet] - 10https://gerrit.wikimedia.org/r/500411
[10:34:00] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Ladsgroup) >>! In T218155#5071692, @MarcoAurelio wrote: > Per docs, ops must be notified. I've been doing so emailing the ops list every time....
[10:35:23] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[10:35:39] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[10:38:14] <wikibugs>	 (03PS1) 10Muehlenhoff: dnsrecursor: Remove support for Ubuntu/trusty [puppet] - 10https://gerrit.wikimedia.org/r/500413
[10:38:58] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500410 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:39:11] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[10:39:27] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[10:39:58] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500410 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:42:23] <jbond42>	 !log rolling security update of tshark
[10:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:46] <Krenair>	 vgutierrez, fyi I found https://phabricator.wikimedia.org/T182927#5073598 to be necessary to start serving traffic using the new cert
[10:43:53] <Krenair>	 was there an easier way I missed?
[10:45:23] <wikibugs>	 (03PS1) 10Muehlenhoff: redis::instance: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/500415
[10:45:45] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[10:46:19] <vgutierrez>	 Krenair: right, in prod we aren't there yet, that's why isn't provided in the puppetization
[10:46:19] <wikibugs>	 (03CR) 10Alex Monk: "These designate/gdnsd sync scripts are what acme_chief use to set DNS TXT records up. In gdnsd's case, it will expire challenges for us - " [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk)
[10:46:31] <wikibugs>	 10Operations, 10Operations-Software-Development: wmf-auto-reimage-host: puppet first run error leads some weird behaviour - https://phabricator.wikimedia.org/T219775 (10aborrero)
[10:47:02] <Krenair>	 vgutierrez, I imagine when moving prod to use these certs we wouldn't want to roll it out everywhere at the same time... what mechanism should be added to permit prod rollout?
[10:47:09] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[10:47:14] <Krenair>	 I imagine I could use the exact same mechanism to enable it in deployment-prep without this nasty hack
[10:47:30] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:500410| Bumping portals to master (T128546)]] (duration: 00m 52s)
[10:47:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:33] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[10:47:35] <vgutierrez>	 yes, basically moving that to hieradata
[10:47:44] <Krenair>	 so like
[10:48:14] <wikibugs>	 (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500410 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:48:17] <Krenair>	 serve_acme_chief_certs = false
[10:48:21] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:500410| Bumping portals to master (T128546)]] (duration: 00m 50s)
[10:48:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:41] <Krenair>	 which we can set to true in specific cases, but only set certs/certs_active if it's false vgutierrez ?
[10:49:32] <arturo>	 volans: just opened T219775
[10:49:32] <stashbot>	 T219775: wmf-auto-reimage-host: puppet first run error leads some weird behaviour - https://phabricator.wikimedia.org/T219775
[10:50:05] <vgutierrez>	 hmm not right now, cause we need to be able to deliver both "certs" and acme_chief managed certificates to the same server
[10:50:33] <Krenair>	 will need some more complicated logic in localssl.erb then
[10:51:07] <Krenair>	 maybe replacing @certs_nginx.empty? with a check for serve_acme_chief_certs
[10:51:22] <Krenair>	 so not more complicated really
[10:51:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/500407 (owner: 10Muehlenhoff)
[10:54:29] <wikibugs>	 10Operations, 10Operations-Software-Development: wmf-auto-reimage-host: puppet first run error leads some weird behaviour - https://phabricator.wikimedia.org/T219775 (10Volans) p:05Triage→03Normal a:03Volans It's kinda expected, the line when it says: ` Scheduled delayed downtime on Icinga ` spawn a subp...
[10:56:50] <wikibugs>	 10Operations, 10Operations-Software-Development: wmf-auto-reimage-host: puppet first run error leads to some weird behaviour - https://phabricator.wikimedia.org/T219775 (10aborrero)
[10:57:51] <wikibugs>	 10Operations, 10Operations-Software-Development: wmf-auto-reimage-host: puppet first run error leads to some weird behaviour - https://phabricator.wikimedia.org/T219775 (10aborrero) Regarding the `puppet_first_run` error, it would be interesting if a bit more information or context is provided by the script. W...
[10:59:01] <wikibugs>	 10Operations, 10Operations-Software-Development: wmf-auto-reimage-host: puppet first run error leads to some weird behaviour - https://phabricator.wikimedia.org/T219775 (10Volans) It's all in the logs mentioned at the start of the script: ` Could not retrieve catalog from remote server: Error 500 on SERVER: Se...
[10:59:05] <icinga-wm>	 PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark]
[10:59:26] <jbond42>	 ^^ looking
[11:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1100).
[11:00:04] <jouncebot>	 Daimona, odder, Lucas_WMDE, and rxy: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:11] <rxy>	 o/
[11:00:13] <Daimona>	 o/
[11:00:21] <jbond42>	 puppet runs fine, suspect this is just because there was a lock on the dpkg db
[11:00:47] <zeljkof>	 I can SWAT today
[11:00:52] <jbond42>	 !log halt rolling updates of tshark untill after SWAT
[11:00:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:05] <Lucas_WMDE>	 o/
[11:01:15] <jbond42>	 also in relation to the error above puppet runs fine, suspect this is just because there was a lock on the dpkg db
[11:01:52] <moritzm>	 jbond42: tshark is ensured via "package/present" in puppet, this kind of puppet spam sometimes happens if the package is being upgraded with debdeploy and then puppet tries to also install it
[11:01:54] <Lucas_WMDE>	 zeljkof: is it okay if I already +2 my backport? it’ll probably take a while to go through CI
[11:02:05] <Lucas_WMDE>	 looks like everything else is config changes, those can be deployed first
[11:02:21] <jbond42>	 moritzm: thatnks that was my suspicion
[11:02:43] <zeljkof>	 Lucas_WMDE: sure
[11:02:48] <Lucas_WMDE>	 Daimona: congrats on Gerrit change #500000 btw :)
[11:03:19] <Daimona>	 Lucas_WMDE: Thanks, an important milestone for all of us :)
[11:03:30] <zeljkof>	 500k?*
[11:04:17] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[11:04:21] <icinga-wm>	 RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[11:04:53] <Lucas_WMDE>	 btw Daimona you seem to have the same patch twice in the Deployments calendar
[11:04:53] <icinga-wm>	 PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark]
[11:05:11] <Lucas_WMDE>	 IIRC there were two config var removals, so I guess that should be two different patches (instead of just removing the duplicate)?
[11:05:34] <zeljkof>	 odder: around for swat?
[11:05:41] <Daimona>	 Ah yes Lucas, thanks
[11:05:43] <Daimona>	 Fixing right now
[11:05:58] <Lucas_WMDE>	 fortunately the hashtag makes it easy to find the other one :)
[11:06:37] <wikibugs>	 (03CR) 10Zfilipin: "This is scheduled for EU SWAT but will not be deployed unless the developer joins #wikimedia-operations." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (https://phabricator.wikimedia.org/T219373) (owner: 10Odder)
[11:06:55] <Daimona>	 Yay, I like this new hashtag thingy
[11:08:19] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498817 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy)
[11:08:28] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Various fixes to 1.1.2 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500417
[11:09:26] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Remove $wgAbuseFilterProfile"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498817 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy)
[11:10:12] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: bootstrap hiera keys for codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/500418 (https://phabricator.wikimedia.org/T219626)
[11:10:21] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Revert "Remove $wgAbuseFilterProfile"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498817 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy)
[11:11:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Various fixes to 1.1.2 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500417 (owner: 10Giuseppe Lavagetto)
[11:11:15] <zeljkof>	 Daimona: 498817 is at mwdebug1002, please test and let me know if I can deploy it
[11:11:20] <wikibugs>	 (03PS1) 10ArielGlenn: use yaml safe_load everywhere [dumps] - 10https://gerrit.wikimedia.org/r/500419
[11:11:21] <Daimona>	 Testing
[11:11:36] <wikibugs>	 (03CR) 10jenkins-bot: Various fixes to 1.1.2 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500417 (owner: 10Giuseppe Lavagetto)
[11:12:13] <Daimona>	 Ahhhh today debug servers are slow I see
[11:13:08] <Daimona>	 It'll take a while to check
[11:13:11] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[11:13:15] <icinga-wm>	 PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[11:13:40] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Upgrade to 1.1.3 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500420
[11:14:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Upgrade to 1.1.3 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500420 (owner: 10Giuseppe Lavagetto)
[11:15:38] <logmsgbot>	 !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@0c32dc1]: Upgrade to 1.1.2
[11:15:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:41] <icinga-wm>	 RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 79653 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:16:46] <logmsgbot>	 !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@0c32dc1]: Upgrade to 1.1.2 (duration: 01m 08s)
[11:16:47] <Daimona>	 zeljkof looks good, please go ahead
[11:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:53] <zeljkof>	 Daimona: ok, deploying
[11:16:56] <Daimona>	 Thanks
[11:17:38] <wikibugs>	 (03PS1) 10Volans: wmf-auto-reimage: fix Icinga delayed downtime [puppet] - 10https://gerrit.wikimedia.org/r/500421 (https://phabricator.wikimedia.org/T219775)
[11:17:54] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/abusefilter.php: SWAT: [[gerrit:498817|Revert "Revert "Remove $wgAbuseFilterProfile"" (T191039)]] (duration: 00m 52s)
[11:17:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:57] <stashbot>	 T191039: Re-enable filter profiling on every wiki - https://phabricator.wikimedia.org/T191039
[11:18:01] <zeljkof>	 Daimona: deployed
[11:18:13] <Daimona>	 Nice
[11:18:40] <wikibugs>	 (03PS3) 10Zfilipin: Revert "Revert "Remove $wgAbuseFilterRuntimeProfile"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498818 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy)
[11:19:42] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498818 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy)
[11:19:44] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 03+1] Update Daniel Kinzler’s email address [puppet] - 10https://gerrit.wikimedia.org/r/499230 (owner: 10Lucas Werkmeister (WMDE))
[11:20:13] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] get rid of references to deprecated MW maintenance script [dumps] - 10https://gerrit.wikimedia.org/r/500405 (owner: 10ArielGlenn)
[11:20:50] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Remove $wgAbuseFilterRuntimeProfile"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498818 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy)
[11:20:58] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] use yaml safe_load everywhere [dumps] - 10https://gerrit.wikimedia.org/r/500419 (owner: 10ArielGlenn)
[11:21:11] <zeljkof>	 Daimona: is there a reason 498817 and 498818 are not in the same patch?
[11:21:20] <wikibugs>	 (03PS2) 10Rxy: Add 'unwatchedpages' permission to rollbacker and patroller at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500393 (https://phabricator.wikimedia.org/T219285)
[11:21:26] <Daimona>	 zeljkof: I think that's because they depended-on different patches in AF
[11:21:33] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Revert "Remove $wgAbuseFilterRuntimeProfile"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498818 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy)
[11:21:34] <zeljkof>	  ah, ok
[11:21:34] <Daimona>	 And AF had two different patches for review ease
[11:21:43] <Daimona>	 Although they were merged together
[11:21:50] <zeljkof>	 Daimona: 498818 is at mwdebug
[11:21:55] <Daimona>	 Testing
[11:22:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] openstack: bootstrap hiera keys for codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/500418 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez)
[11:23:50] <Daimona>	 This one looks good, too
[11:24:13] <zeljkof>	 Daimona: ok, deploying
[11:25:11] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:498818|Revert "Revert "Remove $wgAbuseFilterRuntimeProfile"" (T191039)]] (duration: 00m 51s)
[11:25:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:15] <stashbot>	 T191039: Re-enable filter profiling on every wiki - https://phabricator.wikimedia.org/T191039
[11:25:19] <zeljkof>	 Daimona: deployed
[11:25:31] <aharoni>	 hi zeljkof 
[11:25:40] <wikibugs>	 (03PS3) 10Zfilipin: Enable logging of private filters on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497236 (https://phabricator.wikimedia.org/T218527) (owner: 10Ammarpad)
[11:25:43] <zeljkof>	 hi aharoni!
[11:25:53] <aharoni>	 zeljkof: are you OK with me testing the deployment of https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/499210/ ?
[11:26:00] <Daimona>	 Cool, keeping an eye on it
[11:26:07] <aharoni>	 Great, so I'm around
[11:26:14] <zeljkof>	 aharoni: sure
[11:26:26] <zeljkof>	 aharoni: I'll ping you after I'm done with current deployment
[11:26:30] <aharoni>	 great :)
[11:27:55] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497236 (https://phabricator.wikimedia.org/T218527) (owner: 10Ammarpad)
[11:29:13] <wikibugs>	 (03CR) 10Amire80: [C: 03+1] "I'll do it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (https://phabricator.wikimedia.org/T219373) (owner: 10Odder)
[11:29:26] <wikibugs>	 (03Merged) 10jenkins-bot: Enable logging of private filters on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497236 (https://phabricator.wikimedia.org/T218527) (owner: 10Ammarpad)
[11:30:24] <zeljkof>	 Daimona: 497236 is at mwdebug
[11:30:43] <Lucas_WMDE>	 CI for my backport is still running btw o_O
[11:30:47] <Daimona>	 zeljkof Thanks, I'm trying to figure out how to test it
[11:30:59] <Daimona>	 I'm unsure if there's a direct way
[11:31:11] <icinga-wm>	 RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:31:27] <Daimona>	 I don't know where udp notifications are sent for commons
[11:31:46] <zeljkof>	 Daimona: should I deploy it?
[11:31:51] <icinga-wm>	 PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:31:54] <zeljkof>	 Lucas_WMDE: I still have a few more patches
[11:32:00] <Daimona>	 I'm checking logstash just to be sure no fatals etc.
[11:32:22] <Lucas_WMDE>	 zeljkof: go ahead with rxy and aharoni then :)
[11:32:30] <Lucas_WMDE>	 (btw there seems to be a parser error in fatalmonitor?)
[11:32:33] <zeljkof>	 Lucas_WMDE: ok
[11:32:45] <wikibugs>	 (03CR) 10jenkins-bot: Enable logging of private filters on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497236 (https://phabricator.wikimedia.org/T218527) (owner: 10Ammarpad)
[11:32:49] <zeljkof>	 Lucas_WMDE: which error?
[11:32:55] <aharoni>	 [ I'm around ]
[11:32:58] <Lucas_WMDE>	 “start tag expected, '<' not found”
[11:33:06] <wikibugs>	 (03PS3) 10ArielGlenn: explicitly start wikidata entity dumps on the 1st and 20th of each month [puppet] - 10https://gerrit.wikimedia.org/r/498164 (https://phabricator.wikimedia.org/T216160)
[11:33:07] <Daimona>	 zeljkof Everything seems quiet, cleared for deployment I guess
[11:33:09] <Lucas_WMDE>	 checking logstash rn
[11:33:19] <zeljkof>	 Daimona: ok, deploying
[11:34:09] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:34:19] <Zppix>	 jouncebot: now
[11:34:19] <jouncebot>	 For the next 0 hour(s) and 25 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1100)
[11:34:25] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/abusefilter.php: SWAT: [[gerrit:497236|Enable logging of private filters on commonswiki (T218527)]] (duration: 00m 50s)
[11:34:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:33] <stashbot>	 T218527: Enable wgAbuseFilterNotificationsPrivate on commons.wikimedia.org - https://phabricator.wikimedia.org/T218527
[11:34:44] <wikibugs>	 (03PS1) 10Mathew.onipe: icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601)
[11:35:09] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[11:35:29] <zeljkof>	 Daimona: deployed, thanks for deploying with #releng and please check production and logs :)
[11:35:30] <wikibugs>	 10Operations: Change main branch of puppet repository to be 'master' instead of production - https://phabricator.wikimedia.org/T101632 (10ArielGlenn) I usually just grumble to myself and move on when I forget and try to push to refs/for/master but today I reached MAX_GRUMBLES. How much work would this be?
[11:35:45] <Daimona>	 zeljkof: Noooice, thank you :-)
[11:35:59] <Lucas_WMDE>	 zeljkof: looks like the last such error in mwlog1001:/srv/mw-log/hhvm.log was at 9:51:20, so ignore it I guess
[11:36:04] <wikibugs>	 (03PS3) 10Gehel: Elasticsearch: make unfreezing writes more robust. [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640)
[11:36:18] <Lucas_WMDE>	 it’s just still in the last 1000 lines of hhvm.log, that’s why it appears in the `fatalmonitor` command
[11:36:18] <zeljkof>	 Lucas_WMDE: ok, thanks
[11:36:41] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: labtestnet2003: rename to cloudnet2003-dev and reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/500423 (https://phabricator.wikimedia.org/T219776)
[11:37:18] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (https://phabricator.wikimedia.org/T219373) (owner: 10Odder)
[11:37:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestnet2003: rename to cloudnet2003-dev and reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/500423 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez)
[11:37:43] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[11:37:57] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:38:22] <Lucas_WMDE>	 afk for a few minutes
[11:38:23] <wikibugs>	 (03Merged) 10jenkins-bot: Correct logos for the Gujarati Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (https://phabricator.wikimedia.org/T219373) (owner: 10Odder)
[11:38:41] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[11:39:04] <zeljkof>	 aharoni: 499210 is at mwdebug1002, please test and let me know if I can deploy it
[11:40:15] <aharoni>	 zeljkof: tested on mwdebug1002, looks good, please go on deploying.
[11:40:25] <zeljkof>	 aharoni: ok, deploying
[11:41:27] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized static/images/project-logos/: SWAT: [[gerrit:499210|Correct logos for the Gujarati Wikipedia (T219373)]] (duration: 00m 52s)
[11:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:30] <stashbot>	 T219373: Correct logo for the Gujarati Wikipedia - https://phabricator.wikimedia.org/T219373
[11:41:41] <zeljkof>	 aharoni: it's deployed, purging logos
[11:42:11] <zeljkof>	 aharoni: purged, please test, and thanks for deploying with #releng :)
[11:42:35] <Zppix>	 zeljkof: sounds like a sales transaction :P
[11:42:53] <zeljkof>	 rxy: please stand by, you're next
[11:42:59] <rxy>	 k
[11:43:01] <zeljkof>	 Zppix: :D
[11:43:06] <aharoni>	 zeljkof: works in production. Thanks!
[11:43:21] <Zppix>	 Alright my jokes are over, good luck with SWAT :)
[11:43:26] <wikibugs>	 (03CR) 10jenkins-bot: Correct logos for the Gujarati Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (https://phabricator.wikimedia.org/T219373) (owner: 10Odder)
[11:43:57] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] Elasticsearch: make unfreezing writes more robust. [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel)
[11:44:11] <zeljkof>	 aharoni: great!
[11:44:18] <zeljkof>	 Zppix: thanks :)
[11:44:29] <Zppix>	 Np
[11:44:39] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500424 (https://phabricator.wikimedia.org/T219776)
[11:44:54] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500393 (https://phabricator.wikimedia.org/T219285) (owner: 10Rxy)
[11:44:59] <Lucas_WMDE>	 my backport finally went through CI, halleluja
[11:44:59] <Lucas_WMDE>	 only took ¾ of the SWAT window :D
[11:46:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500424 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez)
[11:46:21] <zeljkof>	 Lucas_WMDE: that is pretty long :/
[11:46:21] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] elasticsearch: add profile for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[11:46:21] <zeljkof>	 I'm deploying my final patch, you're next :)
[11:46:21] <zeljkof>	 Lucas_WMDE: ^
[11:46:28] <wikibugs>	 (03Merged) 10jenkins-bot: Add 'unwatchedpages' permission to rollbacker and patroller at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500393 (https://phabricator.wikimedia.org/T219285) (owner: 10Rxy)
[11:46:28] <Zppix>	 I thought the swat pipeline was supposed to speed ci up for swat?
[11:46:29] <Lucas_WMDE>	 I’ll wait for the config change to be done
[11:46:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "Will workaround the jenkins -1 because is totally expected:" [dns] - 10https://gerrit.wikimedia.org/r/500424 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez)
[11:46:31] <Lucas_WMDE>	 ack
[11:46:52] <Lucas_WMDE>	 Zppix: I don’t think there was any contention waiting for a container or something
[11:46:56] <zeljkof>	 Zppix: some tests take long time to run
[11:46:59] <Lucas_WMDE>	 it just takes a really long time to run the tests
[11:47:14] <Zppix>	 Hmm i wonder if theres any way to speed that test up
[11:47:15] <Lucas_WMDE>	 (IIRC WikibaseLexeme pulls in a bunch of other extensions as well, to avoid breaking them?)
[11:47:36] <Zppix>	 (I may take a looksie with the very very little knowledge i have of ci)
[11:47:42] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Cookbook to reset frozen writes on elasticsearch / cirrus. [cookbooks] - 10https://gerrit.wikimedia.org/r/500064 (https://phabricator.wikimedia.org/T219638) (owner: 10Gehel)
[11:47:58] <zeljkof>	 Zppix: there are many things we could do... but there is little time... :)
[11:48:20] <Zppix>	 zeljkof: true thats why i said ill see (although i know little about ci)
[11:48:32] <zeljkof>	 rxy: 500393 is at mwdebug1002, please test and let me know if I can deploy it
[11:49:48] <rxy>	 zeljkof: ok, It works expectedly . please deploy to prod
[11:49:52] <Lucas_WMDE>	 hm, https://gerrit.wikimedia.org/r/441178 was just abandoned and looks like it could have helped?
[11:49:55] <Lucas_WMDE>	 (with the test time)
[11:50:02] <zeljkof>	 rxy: ok, deploying
[11:51:00] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:500393|Add unwatchedpages permission to rollbacker and patroller at zhwiki (T219285)]] (duration: 00m 52s)
[11:51:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:09] <stashbot>	 T219285: Add "unwatchedpages" to patrollers and rollbackers on zhwiki - https://phabricator.wikimedia.org/T219285
[11:51:28] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs" [dns] - 10https://gerrit.wikimedia.org/r/500426
[11:51:32] <zeljkof>	 rxy: it's deployed, please test in production and thanks for deploying with #releng :)
[11:51:38] <zeljkof>	 Lucas_WMDE: the swat is yours
[11:51:40] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs" [dns] - 10https://gerrit.wikimedia.org/r/500427
[11:51:43] <Lucas_WMDE>	 ok thanks, going ahead
[11:52:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs" [dns] - 10https://gerrit.wikimedia.org/r/500427 (owner: 10Arturo Borrero Gonzalez)
[11:52:29] <wikibugs>	 (03PS1) 10Jbond: facter3: add augeas-tools [puppet] - 10https://gerrit.wikimedia.org/r/500428
[11:52:53] <moritzm>	 !log uploaded logstash/kibana/elasticsearch to component thirdparty/elastic56
[11:52:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:01] <Lucas_WMDE>	 backport is on mwdebug1002, testing
[11:53:07] <rxy>	 zeljkof: ok, It works correctly  at server: mw1268.eqiad.wmnet   prod
[11:53:11] <rxy>	 thanks :)
[11:53:29] <moritzm>	 !log uploaded logstash/kibana/elasticsearch 5.6.15 to component thirdparty/elastic56
[11:53:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:32] <wikibugs>	 (03CR) 10jenkins-bot: Add 'unwatchedpages' permission to rollbacker and patroller at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500393 (https://phabricator.wikimedia.org/T219285) (owner: 10Rxy)
[11:54:48] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "Good enough for me, since we're going to replace this with a cookbook." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500421 (https://phabricator.wikimedia.org/T219775) (owner: 10Volans)
[11:55:20] <Lucas_WMDE>	 hm, the debug server seems to time out when POSTing an edit
[11:55:32] <Lucas_WMDE>	 but the fix is really for the frontend UI, and that part seems to be working as expected
[11:55:39] <Lucas_WMDE>	 so I’m going to chalk that error up to the debug server being slow
[11:55:45] <Lucas_WMDE>	 and will go ahead with deploying the backport
[11:57:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.33.0-wmf.23/extensions/WikibaseLexeme: SWAT: [[gerrit:500237|Fix GrammaticalFeatureListWidget (T219134, T219734)]] (duration: 01m 00s)
[11:57:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:59] <stashbot>	 T219734: Not possible to edit forms of Wikidata lexemes - https://phabricator.wikimedia.org/T219734
[11:58:00] <stashbot>	 T219134: `text.trim is not a function` in node selenium tests blocking Wikibase CI; blocking test/merge in WB/WBL/WBMI/etc. - https://phabricator.wikimedia.org/T219134
[11:58:17] <icinga-wm>	 RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[11:58:18] <Lucas_WMDE>	 !log EU SWAT done
[11:58:19] <wikibugs>	 (03PS1) 10Hashar: Fix build error logging $s -> %s [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500429 (https://phabricator.wikimedia.org/T219778)
[11:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:22] <Lucas_WMDE>	 just in time, yay
[12:02:10] <logmsgbot>	 !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@UNKNOWN]: Rollback to 1.0.0, T219778
[12:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:13] <stashbot>	 T219778: docker-pkg is unhappy on contint1001 - https://phabricator.wikimedia.org/T219778
[12:02:44] <logmsgbot>	 !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@UNKNOWN]: Rollback to 1.0.0, T219778 (duration: 00m 34s)
[12:02:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:05] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] icinga: add mediawiki cirrus update lag check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe)
[12:04:25] <wikibugs>	 (03PS4) 10Alaa Sarhan: Add wgScoreLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191)
[12:05:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add wgScoreLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan)
[12:05:46] <wikibugs>	 (03PS2) 10Muehlenhoff: Update Daniel Kinzler’s email address [puppet] - 10https://gerrit.wikimedia.org/r/499230 (owner: 10Lucas Werkmeister (WMDE))
[12:08:30] <logmsgbot>	 !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@0c32dc1]: Rollback to 1.0.0, T219778
[12:08:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:33] <stashbot>	 T219778: docker-pkg is unhappy on contint1001 - https://phabricator.wikimedia.org/T219778
[12:08:48] <logmsgbot>	 !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@0c32dc1]: Rollback to 1.0.0, T219778 (duration: 00m 18s)
[12:08:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update Daniel Kinzler’s email address [puppet] - 10https://gerrit.wikimedia.org/r/499230 (owner: 10Lucas Werkmeister (WMDE))
[12:12:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "I've also moved Daniel from the cn=nda group to the cn=wmf LDAP group." [puppet] - 10https://gerrit.wikimedia.org/r/499230 (owner: 10Lucas Werkmeister (WMDE))
[12:12:31] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:13:41] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776)
[12:14:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez)
[12:14:10] <wikibugs>	 (03PS11) 10Gehel: Pass flag use_nodejs10 for maps services [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) (owner: 10MSantos)
[12:14:37] <wikibugs>	 (03PS5) 10Alaa Sarhan: Add wgScoreLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191)
[12:15:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add wgScoreLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan)
[12:16:32] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Pass flag use_nodejs10 for maps services [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) (owner: 10MSantos)
[12:18:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Update Birgit Mueller's email address [puppet] - 10https://gerrit.wikimedia.org/r/500432
[12:19:25] <icinga-wm>	 PROBLEM - puppet last run on mc1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:19:26] <wikibugs>	 (03PS6) 10Alaa Sarhan: Add wgMusicalNotationLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191)
[12:19:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "Upgrade to 1.1.3", "Upgrade to 1.1.2" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500433
[12:20:22] <wikibugs>	 (03PS2) 10Alaa Sarhan: Add wgScoreLineWidthInches to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191)
[12:22:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "Upgrade to 1.1.3", "Upgrade to 1.1.2" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500433 (owner: 10Giuseppe Lavagetto)
[12:23:28] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: add DB configuration for stretch [puppet] - 10https://gerrit.wikimedia.org/r/500434 (https://phabricator.wikimedia.org/T219626)
[12:23:31] <logmsgbot>	 !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@46ba982]: Rollback - third time is the charm
[12:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/500428 (owner: 10Jbond)
[12:24:01] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:24:30] <logmsgbot>	 !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@46ba982]: Rollback - third time is the charm (duration: 00m 43s)
[12:24:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix build error logging $s -> %s [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500429 (https://phabricator.wikimedia.org/T219778) (owner: 10Hashar)
[12:26:20] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Elasticsearch: make unfreezing writes more robust. [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel)
[12:26:40] <wikibugs>	 (03Merged) 10jenkins-bot: Fix build error logging $s -> %s [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500429 (https://phabricator.wikimedia.org/T219778) (owner: 10Hashar)
[12:27:08] <wikibugs>	 (03CR) 10jenkins-bot: Fix build error logging $s -> %s [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500429 (https://phabricator.wikimedia.org/T219778) (owner: 10Hashar)
[12:27:13] <wikibugs>	 (03CR) 10jenkins-bot: Elasticsearch: make unfreezing writes more robust. [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel)
[12:31:45] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:33:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] facter3: add augeas-tools [puppet] - 10https://gerrit.wikimedia.org/r/500428 (owner: 10Jbond)
[12:33:53] <wikibugs>	 (03PS1) 10Jbond: pbuilder: fix backports hook and add archive hook [puppet] - 10https://gerrit.wikimedia.org/r/500435
[12:34:00] <wikibugs>	 (03PS2) 10Jbond: facter3: add augeas-tools [puppet] - 10https://gerrit.wikimedia.org/r/500428
[12:40:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/500435 (owner: 10Jbond)
[12:41:53] <icinga-wm>	 RECOVERY - Disk space on dbprov2001 is OK: DISK OK
[12:43:06] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437
[12:43:08] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438
[12:45:47] <icinga-wm>	 RECOVERY - puppet last run on mc1023 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[12:46:09] <wikibugs>	 10Operations, 10monitoring, 10Proposal: [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number - https://phabricator.wikimedia.org/T126158 (10jcrespo) 05Open→03Declined I was convinced, this is desirable, but I don't see a way to move forward that doesn't create many...
[12:47:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 (owner: 10Gehel)
[12:47:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 (owner: 10Gehel)
[12:47:30] <wikibugs>	 (03PS2) 10Jbond: pbuilder: fix backports hook and add archive hook [puppet] - 10https://gerrit.wikimedia.org/r/500435
[12:51:20] <wikibugs>	 (03PS2) 10Gehel: elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437
[12:51:22] <wikibugs>	 (03PS2) 10Gehel: elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438
[12:51:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pbuilder: fix backports hook and add archive hook [puppet] - 10https://gerrit.wikimedia.org/r/500435 (owner: 10Jbond)
[12:52:11] <wikibugs>	 (03PS2) 10Mathew.onipe: icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601)
[12:52:52] <wikibugs>	 (03CR) 10Mathew.onipe: icinga: add mediawiki cirrus update lag check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe)
[12:53:57] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: add some missing bits for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/500434 (https://phabricator.wikimedia.org/T219626)
[12:55:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: add some missing bits for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/500434 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez)
[13:00:04] <jouncebot>	 Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) New wikis deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1300).
[13:00:18] <Amir1>	 o/
[13:02:14] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Trivial, LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 (owner: 10Gehel)
[13:05:59] <wikibugs>	 (03PS1) 10Volans: tests/docs: unify usage of example.com domain [software/spicerack] - 10https://gerrit.wikimedia.org/r/500440
[13:06:24] <wikibugs>	 (03PS4) 10Ladsgroup: Reinstate "Initial configuration for hyw.wikipedia", take 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500190 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio)
[13:08:03] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Reinstate "Initial configuration for hyw.wikipedia", take 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500190 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio)
[13:09:10] <wikibugs>	 (03Merged) 10jenkins-bot: Reinstate "Initial configuration for hyw.wikipedia", take 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500190 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio)
[13:09:14] <jbond42>	 !log rolling security update of tshark
[13:09:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:20] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, couple of very minor nitpicks inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 (owner: 10Gehel)
[13:09:23] <wikibugs>	 (03CR) 10jenkins-bot: Reinstate "Initial configuration for hyw.wikipedia", take 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500190 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio)
[13:09:30] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: add some missing bits for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/500434 (https://phabricator.wikimedia.org/T219626)
[13:09:43] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:11:03] <hashar>	 !log Upgraded CI Jenkins jobs to Quibble 0.0.30 # T219647
[13:11:13] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: openstack: add some missing bits for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/500434 (https://phabricator.wikimedia.org/T219626)
[13:11:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:19] <stashbot>	 T219647: Upgrade CI jobs to Quibble 0.0.30 - https://phabricator.wikimedia.org/T219647
[13:11:31] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[13:11:37] <wikibugs>	 (03PS3) 10Gehel: elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437
[13:11:39] <wikibugs>	 (03PS3) 10Gehel: elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438
[13:12:04] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid upgrade staging -f citoid-staging-values.yaml stable/citoid [namespace: citoid, clusters: staging]
[13:12:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:05] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid cluster staging completed
[13:12:05] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid finished
[13:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:39] <icinga-wm>	 PROBLEM - puppet last run on an-worker1094 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools]
[13:13:00] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: add some missing bits for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/500434 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez)
[13:13:29] <wikibugs>	 (03PS19) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705)
[13:13:31] <wikibugs>	 (03PS14) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705)
[13:13:33] <wikibugs>	 (03PS17) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[13:13:35] <wikibugs>	 (03PS11) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[13:13:37] <wikibugs>	 (03PS1) 10Vgutierrez: localssl: Avoid acme-chief puppetization triggers nginx restart [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705)
[13:14:19] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:14:22] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: admin_scripts: libvirt-bin doesn't exists in stretch [puppet] - 10https://gerrit.wikimedia.org/r/500444 (https://phabricator.wikimedia.org/T215407)
[13:15:14] <wikibugs>	 (03CR) 10Ema: [C: 03+1] localssl: Avoid acme-chief puppetization triggers nginx restart [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[13:15:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500432 (owner: 10Muehlenhoff)
[13:15:41] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:15:48] <wikibugs>	 (03CR) 10Alex Monk: "This should probably be put behind the same hiera check we'll be using to determine whether the host is using the cert to serve traffic or" [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[13:15:55] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Log in to the registry if credentials are provided [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500445
[13:16:27] <icinga-wm>	 PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark]
[13:16:27] <icinga-wm>	 PROBLEM - puppet last run on mw1321 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools]
[13:16:33] <icinga-wm>	 PROBLEM - puppet last run on ores1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools]
[13:16:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] localssl: Avoid acme-chief puppetization triggers nginx restart [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[13:16:45] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools]
[13:16:59] <icinga-wm>	 PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark]
[13:17:03] <icinga-wm>	 PROBLEM - puppet last run on mw1344 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark]
[13:17:19] <icinga-wm>	 PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools]
[13:17:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: admin_scripts: libvirt-bin doesn't exists in stretch [puppet] - 10https://gerrit.wikimedia.org/r/500444 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez)
[13:17:27] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:17:43] <icinga-wm>	 PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark]
[13:17:55] <icinga-wm>	 PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark]
[13:17:57] <icinga-wm>	 PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark]
[13:18:05] <icinga-wm>	 PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark]
[13:18:17] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:18:47] <icinga-wm>	 PROBLEM - puppet last run on sessionstore1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools]
[13:18:59] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:19:04] <wikibugs>	 (03CR) 10Alex Monk: "Oh, yeah, restart instead of reload. Let's not do that." [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[13:20:13] <icinga-wm>	 PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools]
[13:20:17] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:20:17] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[13:20:23] <icinga-wm>	 PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools]
[13:20:37] <wikibugs>	 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10User-zeljkofilipin: npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos - https://phabricator.wikimedia.org/T215562 (10MoritzMuehlenhoff) @Krinkle I've prepared a new build and uploaded it to https://...
[13:20:45] <wikibugs>	 (03PS2) 10Vgutierrez: localssl: Avoid acme-chief puppetization triggers nginx restart [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705)
[13:20:47] <wikibugs>	 (03PS20) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705)
[13:20:49] <wikibugs>	 (03PS15) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705)
[13:20:51] <wikibugs>	 (03PS18) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[13:20:53] <wikibugs>	 (03PS12) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[13:22:00] <wikibugs>	 (03CR) 10Alex Monk: [C: 03+1] localssl: Avoid acme-chief puppetization triggers nginx restart [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[13:23:06] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid upgrade production -f citoid-eqiad-values.yaml stable/citoid [namespace: citoid, clusters: eqiad]
[13:23:07] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid cluster eqiad completed
[13:23:07] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid finished
[13:23:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:07] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] localssl: Avoid acme-chief puppetization triggers nginx restart [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[13:23:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:21] <icinga-wm>	 RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:26:52] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid upgrade production -f citoid-codfw-values.yaml stable/citoid [namespace: citoid, clusters: codfw]
[13:26:53] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid cluster codfw completed
[13:26:53] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid finished
[13:26:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:15] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:27:32] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Log in to the registry if credentials are provided [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500445 (https://phabricator.wikimedia.org/T219778)
[13:28:16] <wikibugs>	 (03PS2) 10Muehlenhoff: Update Birgit Mueller's email address [puppet] - 10https://gerrit.wikimedia.org/r/500432
[13:29:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update Birgit Mueller's email address [puppet] - 10https://gerrit.wikimedia.org/r/500432 (owner: 10Muehlenhoff)
[13:29:43] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:33:35] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:36:43] <wikibugs>	 (03CR) 10Ema: [C: 03+1] hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[13:38:46] <Amir1>	 marostegui: I looked at lots of options, the only thing I can do for now is to drop hywwiki on s3 and make it again (make sure you don't drop hywiki, with one y, that's a huge wiki)
[13:39:07] <Amir1>	 Is it okay if I drop it?
[13:39:15] <marostegui>	 Amir1: that is not that straightforward, we have to drop it on all the es hosts and x1
[13:39:43] <Reedy>	 What will dropping actually solve?
[13:40:31] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[13:40:56] <Amir1>	 marostegui: not es or x1, the maintenance script can skip re-adding those clusters now
[13:41:19] <icinga-wm>	 PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:41:22] <Krenair>	 can't you comment out the bit of the script that creates the s3 DB?
[13:41:45] <Amir1>	 Reedy: data corruption on the main page
[13:42:19] <Reedy>	 Has the cause of the corruption been fixed?
[13:42:25] <icinga-wm>	 PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:42:33] <Reedy>	 Krenair: Aaron has added "skipclusters" so we don't have to comment stuff out
[13:42:43] <icinga-wm>	 RECOVERY - puppet last run on mw1321 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:42:53] <icinga-wm>	 RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[13:43:04] <Amir1>	 It gives out this
[13:43:05] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:43:08] <Amir1>	 https://www.irccloud.com/pastebin/FkL7Xptw/
[13:43:14] <Krenair>	 Reedy: Wow, living in the future.
[13:43:17] <icinga-wm>	 RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[13:43:18] <Reedy>	 ikr?
[13:43:23] <icinga-wm>	 RECOVERY - puppet last run on mw1344 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:43:27] <Amir1>	 exactly :D
[13:43:29] <wikibugs>	 (03PS21) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705)
[13:43:30] <Krenair>	 Back in the day we didn't have these nice things
[13:43:37] <icinga-wm>	 RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[13:43:59] <icinga-wm>	 RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[13:44:00] <Krenair>	 As a deployer you would have to customise addWiki against whichever kind of breakage was live on that particular day
[13:44:11] <icinga-wm>	 RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[13:44:11] <icinga-wm>	 RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:44:11] <icinga-wm>	 RECOVERY - puppet last run on an-worker1094 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[13:44:16] <Amir1>	 Reedy: Maybe there is something wrong somewhere else. Do you want to double check the error?
[13:45:09] <icinga-wm>	 RECOVERY - puppet last run on sessionstore1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[13:45:37] <wikibugs>	 (03PS3) 10Herron: logstash: send varnish syslogs via kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/498467 (https://phabricator.wikimedia.org/T213899)
[13:45:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) Any update from HP?
[13:46:09] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.653 second response time https://phabricator.wikimedia.org/T174916
[13:46:28] <Reedy>	 Amir1: I've no idea what tt:1 is supposed to be
[13:46:31] <icinga-wm>	 RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[13:46:41] <icinga-wm>	 RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[13:46:43] <Reedy>	 But it kinda looks like it's having issues loading/saving the revisions from/to ES
[13:47:28] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5001.eqsin.wmnet
[13:47:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:01] <icinga-wm>	 RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[13:49:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Log in to the registry if credentials are provided [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500445 (https://phabricator.wikimedia.org/T219778) (owner: 10Giuseppe Lavagetto)
[13:50:02] <hashar>	 !log Reverted CI Jenkins jobs to Quibble 0.0.28 # T219647
[13:50:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:06] <wikibugs>	 (03PS37) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921)
[13:50:07] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[13:50:07] <stashbot>	 T219647: Upgrade CI jobs to Quibble 0.0.30 - https://phabricator.wikimedia.org/T219647
[13:52:03] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch: add profile for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[13:56:17] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5001.eqsin.wmnet
[13:56:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:05] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5007.eqsin.wmnet
[13:57:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:38] <wikibugs>	 (03PS1) 10Elukey: aptrepo: update cloudera-jessie to 5.16.1 [puppet] - 10https://gerrit.wikimedia.org/r/500453 (https://phabricator.wikimedia.org/T218343)
[13:59:53] <Amir1>	 so the extstore is empty for both of them
[14:00:36] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Looks good!  One small/optional thing inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500099 (https://phabricator.wikimedia.org/T213899) (owner: 10Cwhite)
[14:02:00] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Add rsyslog kafka to service nodes. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko)
[14:03:21] <wikibugs>	 (03PS1) 10Ladsgroup: Add hywwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500454 (https://phabricator.wikimedia.org/T212597)
[14:03:25] <wikibugs>	 (03CR) 10Ppchelko: "Oh, I've collected a +4 on this one..." [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko)
[14:03:47] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add hywwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500454 (https://phabricator.wikimedia.org/T212597) (owner: 10Ladsgroup)
[14:04:10] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Expose rsyslog_udp_port to services configs. [puppet] - 10https://gerrit.wikimedia.org/r/498872 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko)
[14:04:31] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5007.eqsin.wmnet
[14:04:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:25] <wikibugs>	 (03Merged) 10jenkins-bot: Add hywwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500454 (https://phabricator.wikimedia.org/T212597) (owner: 10Ladsgroup)
[14:05:33] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:05:38] <wikibugs>	 (03CR) 10jenkins-bot: Add hywwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500454 (https://phabricator.wikimedia.org/T212597) (owner: 10Ladsgroup)
[14:07:20] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized dblists: (no justification provided) (duration: 00m 52s)
[14:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:34] <wikibugs>	 (03PS6) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375
[14:08:55] <wikibugs>	 (03CR) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[14:09:18] <logmsgbot>	 !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided)
[14:09:43] <moritzm>	 !log uploaded debdeploy 0.0.99.10 to apt.wikimedia.org (jessie, stretch, buster)
[14:09:52] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Cleanup: Remove obsolete WikimediaEditorTasks beta cluster prefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway)
[14:11:08] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-2] "DO NOT DEPLOY. I'm at a middle of deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway)
[14:11:19] <stashbot>	 ladsgroup@deploy1001: Failed to log message to wiki. Somebody should check the error logs.
[14:11:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:47] <wikibugs>	 (03CR) 10Mholloway: "Yikes, sorry about that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway)
[14:14:03] <wikibugs>	 (03CR) 10Gehel: elasticsearch: cleanup test by introducing a method to mock API calls (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 (owner: 10Gehel)
[14:15:02] <wikibugs>	 (03PS1) 10Jbond: pdebuild: ensure proxy config exists for apt-get install [puppet] - 10https://gerrit.wikimedia.org/r/500458
[14:16:45] <Amir1>	 forgot to rebase, need to do wikiversions again
[14:17:10] <wikibugs>	 (03PS19) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[14:17:14] <wikibugs>	 (03PS13) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[14:17:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[14:17:36] <wikibugs>	 (03CR) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[14:18:07] <logmsgbot>	 !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided)
[14:18:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:30] <wikibugs>	 (03PS4) 10Gehel: elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437
[14:21:31] <wikibugs>	 (03PS4) 10Gehel: elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438
[14:21:33] <wikibugs>	 (03PS7) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375
[14:22:51] <wikibugs>	 (03PS16) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705)
[14:22:53] <wikibugs>	 (03PS20) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[14:22:55] <wikibugs>	 (03PS14) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[14:26:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[14:28:54] <wikibugs>	 10Operations, 10User-fgiunchedi: Session storage Cassandra metrics (Prometheus) not being collected - https://phabricator.wikimedia.org/T219523 (10Eevans)
[14:28:59] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[14:29:01] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 4 others: Session storage Cassandra cluster configuration - https://phabricator.wikimedia.org/T215883 (10Eevans)
[14:29:18] <Amir1>	 okay, it's only the main page is grumpy :D
[14:29:19] <Amir1>	 https://hyw.wikipedia.org/wiki/Foo
[14:29:42] <wikibugs>	 10Operations, 10Cassandra, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), and 2 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Eevans)
[14:29:50] <wikibugs>	 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 4 others: Session storage Cassandra cluster configuration - https://phabricator.wikimedia.org/T215883 (10Eevans)
[14:30:59] <wikibugs>	 (03CR) 10Ema: [C: 03+1] cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[14:31:33] <wikibugs>	 (03PS5) 10Gehel: elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437
[14:31:36] <wikibugs>	 (03PS5) 10Gehel: elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438
[14:31:38] <wikibugs>	 (03PS8) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375
[14:31:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[14:32:05] <Amir1>	 !log wikiadmin@10.64.32.136(hywwiki)> update text set old_text = 'DB://cluster25/1';
[14:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:51] <wikibugs>	 (03PS1) 10Jbond: pdebuild: add a new repo for build dependencies [puppet] - 10https://gerrit.wikimedia.org/r/500464
[14:33:52] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized multiversion/MWMultiVersion.php: T212597 (duration: 00m 51s)
[14:33:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:00] <stashbot>	 T212597: Create Wikipedia Western Armenian - https://phabricator.wikimedia.org/T212597
[14:34:02] <wikibugs>	 (03CR) 10Ema: nagios_common: provide check_ssl_unified variants for LE certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[14:35:17] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 50s)
[14:35:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:04] <Amir1>	 marostegui: There is no need to drop anything, I manually fixed it in db
[14:36:39] <Reedy>	 Amir1: Well, doesn't look completley fixed
[14:36:43] <Reedy>	 curprev 12:33, 27 March 2019‎ 127.0.0.1 talk‎  1,380 bytes +1,380‎  Created page with "==This subdomain is reserved for the creation of a Wikipedia in '''արեւմտահայերէն''' language==..."
[14:36:50] <Reedy>	 Content is ".
[14:36:50] <Reedy>	 "
[14:37:06] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized langlist: (no justification provided) (duration: 00m 50s)
[14:37:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:57] <Amir1>	 yeah, I basically changed pointer to the content 
[14:38:02] <Reedy>	 I know, I read it
[14:38:16] <Amir1>	 because the content doesn't exist 
[14:38:19] <wikibugs>	 (03PS17) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705)
[14:38:21] <wikibugs>	 (03PS21) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705)
[14:38:23] <wikibugs>	 (03PS15) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[14:38:25] <Amir1>	 How can I make it better?
[14:38:30] <wikibugs>	 (03CR) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[14:38:41] <Reedy>	 I think you delete the main page (or someone does), just to clear up the history
[14:38:48] <Reedy>	 Editing other pages seems fine
[14:38:57] <Reedy>	 So it's some bug/context thing in addWiki that's broken
[14:39:07] <Reedy>	 I've no idea where tt comes from
[14:39:31] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 (owner: 10Gehel)
[14:39:44] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 (owner: 10Gehel)
[14:39:53] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 (owner: 10Gehel)
[14:40:11] <wikibugs>	 (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[14:40:13] <Amir1>	 I also made a page called Foo, that needs to be deleted as well
[14:40:16] <Amir1>	 I'll handle it
[14:40:36] <marostegui>	 Amir1: yaaaay :)
[14:40:40] <marostegui>	 Amir1: thanks
[14:41:05] <marostegui>	 Amir1: let me know when I can also try to register my user so I can see if it gets correctly filtered before sending it for the views creation
[14:41:54] <Amir1>	 !log ladsgroup@mwmaint1002:~$ mwscript maintenance/createAndPromote.php --wiki=hywwiki --force --sysop Ladsgroup
[14:41:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:25] <moritzm>	 !log restarting superset on analytics-tool1004 to pick up latest Python
[14:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:32] <wikibugs>	 (03CR) 10jenkins-bot: elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 (owner: 10Gehel)
[14:43:10] <wikibugs>	 (03PS1) 10Elukey: cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343)
[14:43:22] <wikibugs>	 (03CR) 10jenkins-bot: elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 (owner: 10Gehel)
[14:43:22] <Amir1>	 How can I desysop myself now 🤔
[14:43:51] <Lucas_WMDE>	 can’t you un-sysop yourself on Special:UserRights?
[14:43:53] <Reedy>	 Removed
[14:44:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[14:44:43] <moritzm>	 !log rolling out debdeploy 0.0.99.10 for jessie, buster, stretch systems
[14:44:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:03] <wikibugs>	 (03CR) 10Ema: [C: 03+1] nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[14:45:12] <Amir1>	 nope :P
[14:45:20] <Amir1>	 Thanks Reedy 
[14:46:12] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[14:46:14] <wikibugs>	 (03PS9) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375
[14:46:39] <marostegui>	 Amir1: Just checked my user, and it gets filtered correctly
[14:47:04] <Amir1>	 marostegui: \o/
[14:47:42] <marostegui>	 Amir1: so from your side this is all done?
[14:47:49] <Amir1>	 yessssss
[14:48:00] <Amir1>	 I'm handing it out to community
[14:48:01] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, optional nitpick inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey)
[14:48:13] <Amir1>	 (Connection to wikidata is needed) etc.
[14:48:31] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[14:48:36] <marostegui>	 Amir1: So can I send it to cloud for the views creation?=
[14:48:45] <Amir1>	 marostegui: sure
[14:48:48] <wikibugs>	 (03PS2) 10Elukey: cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343)
[14:49:14] <wikibugs>	 (03PS3) 10Elukey: cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343)
[14:50:11] <wikibugs>	 (03CR) 10Ladsgroup: "Safe to go now, sorry for this. I was at the middle of very complex deploy that could blow up in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway)
[14:50:30] <marostegui>	 Amir1: let me run some more checks on my side
[14:50:34] <marostegui>	 just to be 100% clear
[14:50:40] <Amir1>	 Sure
[14:50:55] <wikibugs>	 10Operations, 10Acme-chief, 10Traffic, 10Goal, 10Patch-For-Review: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez)
[14:51:02] <marostegui>	 as we had some reports today about it (that I commented on the ticket) - I want to see if those clear out
[14:51:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[14:52:11] <Amir1>	 marostegui: This https://phabricator.wikimedia.org/T212597#5065429 ?
[14:52:33] <marostegui>	 Amir1: No, this: https://phabricator.wikimedia.org/T212625#5072845
[14:52:43] <marostegui>	 So far looks solved, I am waiting for the check to fully finish
[14:52:48] <marostegui>	 To be 100% sure
[14:52:57] <Amir1>	 oh okay, sure
[14:53:17] <wikibugs>	 (03CR) 10Mholloway: "Thanks, that's OK, and sorry for the scare!  I should have double-checked the calendar." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway)
[14:53:29] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey)
[14:54:18] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Cleanup: Remove obsolete WikimediaEditorTasks beta cluster prefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway)
[14:54:40] <wikibugs>	 (03PS2) 10Volans: wmf-auto-reimage: fix Icinga delayed downtime [puppet] - 10https://gerrit.wikimedia.org/r/500421 (https://phabricator.wikimedia.org/T219775)
[14:55:28] <wikibugs>	 (03Merged) 10jenkins-bot: Cleanup: Remove obsolete WikimediaEditorTasks beta cluster prefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway)
[14:55:30] <wikibugs>	 (03CR) 10Volans: "replies inline, agreed" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500421 (https://phabricator.wikimedia.org/T219775) (owner: 10Volans)
[14:55:50] <wikibugs>	 (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: fix Icinga delayed downtime [puppet] - 10https://gerrit.wikimedia.org/r/500421 (https://phabricator.wikimedia.org/T219775) (owner: 10Volans)
[14:55:59] <wikibugs>	 (03CR) 10Muehlenhoff: cumin: add aliases for Hadoop HDFS journalnodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey)
[14:57:14] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Patch-For-Review: wmf-auto-reimage-host: puppet first run error leads to some weird behaviour - https://phabricator.wikimedia.org/T219775 (10Volans) 05Open→03Resolved
[14:57:27] <wikibugs>	 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10User-zeljkofilipin: npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos - https://phabricator.wikimedia.org/T215562 (10Krinkle) a:05MoritzMuehlenhoff→03Krinkle
[14:57:29] <wikibugs>	 (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey)
[14:59:12] <logmsgbot>	 !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Cleanup: Remove obsolete WikimediaEditorTasks beta cluster prefs (duration: 00m 50s)
[14:59:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:42] <wikibugs>	 (03CR) 10jenkins-bot: Cleanup: Remove obsolete WikimediaEditorTasks beta cluster prefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway)
[15:01:01] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: codfw1dev: nova: add missing nova keys [puppet] - 10https://gerrit.wikimedia.org/r/500468 (https://phabricator.wikimedia.org/T219626)
[15:01:38] <Amir1>	 Search is broken on hywwiki
[15:01:40] <Amir1>	 yay
[15:02:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ah, that makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey)
[15:03:17] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[15:05:07] <wikibugs>	 (03PS10) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375
[15:06:05] <wikibugs>	 (03Abandoned) 10Andrew Bogott: nova: add wmcs-rescue-console.sh to compute hosts [puppet] - 10https://gerrit.wikimedia.org/r/489230 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott)
[15:06:35] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: hieradata: openstack: codfw1dev: nova: add missing nova keys [puppet] - 10https://gerrit.wikimedia.org/r/500468 (https://phabricator.wikimedia.org/T219626)
[15:08:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc https://puppet-compiler.wmflabs.org/compiler1002/15467/" [puppet] - 10https://gerrit.wikimedia.org/r/500468 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez)
[15:08:55] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[15:09:06] <Amir1>	 !log mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki=hywwiki --baseName hywwiki --cluster (eqiad|codfw)
[15:09:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:17] <wikibugs>	 (03PS1) 10Marostegui: realm.pp: Add urlshortcodes to private table [puppet] - 10https://gerrit.wikimedia.org/r/500470 (https://phabricator.wikimedia.org/T219777)
[15:11:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey)
[15:11:22] <wikibugs>	 (03PS4) 10Elukey: cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343)
[15:12:27] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Remove labsdb1004,labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/500373 (https://phabricator.wikimedia.org/T216749)
[15:13:26] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] mariadb: Remove labsdb1004,labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/500373 (https://phabricator.wikimedia.org/T216749) (owner: 10Marostegui)
[15:13:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove labsdb1004,labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/500373 (https://phabricator.wikimedia.org/T216749) (owner: 10Marostegui)
[15:14:03] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[15:14:03] <wikibugs>	 (03PS7) 10Herron: ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899)
[15:14:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] labsdb: remove old and likely unused cname for labsdb1004 [dns] - 10https://gerrit.wikimedia.org/r/500090 (https://phabricator.wikimedia.org/T216749) (owner: 10Bstorm)
[15:15:16] <wikibugs>	 (03PS2) 10Bstorm: labsdb: remove old and likely unused cname for labsdb1004 [dns] - 10https://gerrit.wikimedia.org/r/500090 (https://phabricator.wikimedia.org/T216749)
[15:15:55] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] labsdb: remove old and likely unused cname for labsdb1004 [dns] - 10https://gerrit.wikimedia.org/r/500090 (https://phabricator.wikimedia.org/T216749) (owner: 10Bstorm)
[15:16:58] <wikibugs>	 (03PS5) 10Elukey: cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343)
[15:17:03] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey)
[15:18:01] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776)
[15:18:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez)
[15:18:23] <wikibugs>	 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Marostegui) >>! In T212625#5072845, @Marostegui wrote: > This wiki is triggering some false positives on our labs private data checking methods, even if it...
[15:18:55] <wikibugs>	 (03CR) 10Herron: "discussed on irc a bit and following up here -- switched away from hiera regex in favor of a new hiera key called profile::ores::logstash_" [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron)
[15:19:15] <wikibugs>	 (03PS1) 10BBlack: Add wikiba.se to HTTPS/HSTS regexes for canonicals [puppet] - 10https://gerrit.wikimedia.org/r/500472 (https://phabricator.wikimedia.org/T213705)
[15:20:08] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.343 second response time https://phabricator.wikimedia.org/T174916
[15:20:16] <wikibugs>	 10Operations, 10Traffic, 10monitoring: prometheus: slow dashboards due to suboptimal query_range performance - https://phabricator.wikimedia.org/T190992 (10Volans) @ema  given the speedup due to prometheus 2 do you think this still needs to be worked on or could be resolved?
[15:20:36] <wikibugs>	 (03PS38) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921)
[15:20:47] <wikibugs>	 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Bstorm)
[15:22:01] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Add rsyslog kafka to service nodes. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko)
[15:22:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Log in to the registry if credentials are provided [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500445 (https://phabricator.wikimedia.org/T219778) (owner: 10Giuseppe Lavagetto)
[15:22:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add rsyslog kafka to service nodes. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko)
[15:23:16] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[15:23:20] <wikibugs>	 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Bstorm) a:05Bstorm→03RobH
[15:23:26] <_joe_>	 Pchelolo: ^^ now merging
[15:23:34] <Pchelolo>	 awesome, thank you!
[15:23:39] <wikibugs>	 (03Merged) 10jenkins-bot: Log in to the registry if credentials are provided [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500445 (https://phabricator.wikimedia.org/T219778) (owner: 10Giuseppe Lavagetto)
[15:23:45] <wikibugs>	 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Bstorm)
[15:24:03] <wikibugs>	 (03PS11) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375
[15:24:05] <wikibugs>	 (03CR) 10jenkins-bot: Log in to the registry if credentials are provided [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500445 (https://phabricator.wikimedia.org/T219778) (owner: 10Giuseppe Lavagetto)
[15:24:12] <wikibugs>	 (03Abandoned) 10Gehel: elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 (owner: 10Gehel)
[15:24:20] <Pchelolo>	 _joe_: there's also https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/498872/ and I can finally switch things over
[15:24:28] <wikibugs>	 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Bstorm) @Marostegui it was supposed to be down.  It needed a kill -9.
[15:24:36] <_joe_>	 I know
[15:24:36] <wikibugs>	 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Marostegui) #cloud-services-team this is ready for the views creation. I have added the usual GRANT so the views can be created for this wiki:   Please rem...
[15:24:41] <_joe_>	 one at a time please :)
[15:24:43] <vgutierrez>	 !log disable puppet in the cache text cluster - T213705
[15:24:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:47] <stashbot>	 T213705: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705
[15:25:15] <wikibugs>	 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Marostegui) Thanks :-) So fully ready for @RobH to take over
[15:25:31] <wikibugs>	 (03CR) 10Mathew.onipe: "PCC output is OK: https://puppet-compiler.wmflabs.org/compiler1002/15471/" [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[15:25:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[15:25:58] <wikibugs>	 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10bd808)
[15:26:02] <wikibugs>	 (03PS16) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705)
[15:26:06] <_joe_>	 Pchelolo: there is some issue with the patch, damn
[15:26:27] <wikibugs>	 (03PS2) 10BBlack: Add wikiba.se to HSTS regex [puppet] - 10https://gerrit.wikimedia.org/r/500472 (https://phabricator.wikimedia.org/T213705)
[15:26:29] <wikibugs>	 (03PS1) 10BBlack: Add wikiba.se to HTTPS redirect regex [puppet] - 10https://gerrit.wikimedia.org/r/500473 (https://phabricator.wikimedia.org/T213705)
[15:26:44] <Pchelolo>	 how can you see that?
[15:26:45] <wikibugs>	 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Marostegui) a:05Marostegui→03None
[15:27:36] <_joe_>	 Pchelolo: by running puppet on one host
[15:27:48] <icinga-wm>	 PROBLEM - DPKG on restbase2010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:27:52] <_joe_>	 Pchelolo: this is also why I wanted people who work on this to merge your change
[15:27:56] <_joe_>	 Pchelolo: this ^^
[15:27:57] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776)
[15:28:02] <_joe_>	 and now I have a meeting
[15:28:18] <wikibugs>	 (03CR) 10BryanDavis: "I'm working on the big picture fix for this in Iefcc0a8ea51a3cddc0e79218809e14d97acfc186, but removing this check now is ok with me." [puppet] - 10https://gerrit.wikimedia.org/r/500409 (owner: 10Muehlenhoff)
[15:28:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez)
[15:28:29] <Pchelolo>	 oh. damn..
[15:28:43] <_joe_>	 Pchelolo: and for some reason the second attempt at installing the update to rsyslog works
[15:28:50] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: add cookbook to reboot all nodes in a cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/500474
[15:28:51] <_joe_>	 which makes me think this is a puppet ordering issue
[15:28:56] <icinga-wm>	 RECOVERY - DPKG on restbase2010 is OK: All packages OK
[15:29:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[15:30:31] <wikibugs>	 (03PS12) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375
[15:30:55] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5007.eqsin.wmnet
[15:30:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:15] <wikibugs>	 10Operations, 10Analytics, 10EventBus, 10vm-requests, and 3 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Milimetric) p:05Triage→03High
[15:40:18] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup)
[15:40:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/500458 (owner: 10Jbond)
[15:42:07] <wikibugs>	 10Operations, 10Analytics, 10Discovery, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Milimetric) p:05Triage→03High
[15:42:08] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5007.eqsin.wmnet
[15:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500453 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey)
[15:42:26] <icinga-wm>	 PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 17 seconds ago with 2 failures. Failed resources (up to 3 shown)
[15:43:05] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4032.ulsfo.wmnet
[15:43:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:08] <wikibugs>	 10Operations, 10Analytics, 10Wikimedia-Mailing-lists: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10Milimetric) p:05Triage→03Normal
[15:44:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pdebuild: ensure proxy config exists for apt-get install [puppet] - 10https://gerrit.wikimedia.org/r/500458 (owner: 10Jbond)
[15:44:14] <wikibugs>	 (03PS2) 10Jbond: pdebuild: ensure proxy config exists for apt-get install [puppet] - 10https://gerrit.wikimedia.org/r/500458
[15:44:34] <icinga-wm>	 PROBLEM - DPKG on scb2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:45:50] <icinga-wm>	 RECOVERY - DPKG on scb2002 is OK: All packages OK
[15:46:06] <icinga-wm>	 PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:48:11] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Wikimedia-Incident: Create cookbook to reset readonly indices on elasticsearch clusters - https://phabricator.wikimedia.org/T219799 (10Gehel)
[15:48:51] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4032.ulsfo.wmnet
[15:48:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:36] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3042.esams.wmnet
[15:49:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:41] <wikibugs>	 (03CR) 10Muehlenhoff: pdebuild: add a new repo for build dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500464 (owner: 10Jbond)
[15:49:50] <icinga-wm>	 PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 51 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog-kafka]
[15:49:58] <icinga-wm>	 PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 56 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog-kafka]
[15:50:25] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Archival of home directories on servers with very large homes - https://phabricator.wikimedia.org/T215171 (10Milimetric) p:05Normal→03High
[15:51:06] <icinga-wm>	 RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[15:51:08] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776)
[15:51:57] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: add cookbook to reboot all nodes in a cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/500474 (owner: 10Gehel)
[15:52:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez)
[15:52:36] <icinga-wm>	 PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog-kafka]
[15:52:40] <icinga-wm>	 PROBLEM - puppet last run on restbase2011 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 28 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog-kafka]
[15:54:52] <icinga-wm>	 RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[15:55:00] <icinga-wm>	 RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:55:18] <wikibugs>	 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) p:05Triage→03Normal
[15:56:25] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3042.esams.wmnet
[15:56:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:51] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10greg) Just drive-by checking in on UBN!s: is this task still UBN!?
[15:57:26] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:57:39] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2023.codfw.wmnet
[15:57:40] <icinga-wm>	 RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[15:57:40] <icinga-wm>	 RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[15:57:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:44] <icinga-wm>	 RECOVERY - puppet last run on restbase2011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[15:59:48] <wikibugs>	 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) ## notes building facter3 for debian:   facter3 has a dependency on debhelper 11, however for now i am testing with debhelper v10 by updating the package locally   pbuilder-sa...
[16:00:04] <icinga-wm>	 PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100%
[16:00:33] <_joe_>	 jbond42: if you're upgrading puppet to puppet 5, we might want to talk about fixing the hiera mess
[16:00:45] <_joe_>	 it will help when we need to rewrite the hiera backends having just one left
[16:00:47] <_joe_>	 :)
[16:01:06] <icinga-wm>	 PROBLEM - keystone admin endpoint port 35357 on cloudcontrol2001-dev is CRITICAL: connect to address 208.80.153.59 and port 35357: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[16:01:06] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cloudcontrol2001-dev is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[16:01:09] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10ema)
[16:01:20] <icinga-wm>	 RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms
[16:01:25] <wikibugs>	 10Operations, 10Traffic, 10monitoring: prometheus: slow dashboards due to suboptimal query_range performance - https://phabricator.wikimedia.org/T190992 (10ema) 05Open→03Resolved a:03ema >>! In T190992#5074344, @Volans wrote: > @ema  given the speedup due to prometheus 2 do you think this still needs t...
[16:01:34] <wikibugs>	 10Operations, 10Traffic, 10monitoring: prometheus-based graph significantly slower than statsd equivalent - https://phabricator.wikimedia.org/T212312 (10ema) 05Open→03Resolved a:03ema
[16:01:41] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10ema)
[16:02:32] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:02:54] <icinga-wm>	 PROBLEM - keystone public endoint port 5000 on cloudcontrol2001-dev is CRITICAL: connect to address 208.80.153.59 and port 5000: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[16:04:44] <icinga-wm>	 PROBLEM - puppet last run on cloudcontrol2001-dev is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 23 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[rabbit_nova_create],Package[keystone]
[16:05:27] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2023.codfw.wmnet
[16:05:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:07] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet
[16:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:36] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:11:10] <wikibugs>	 (03CR) 10Jbond: pdebuild: add a new repo for build dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500464 (owner: 10Jbond)
[16:12:54] <icinga-wm>	 PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100%
[16:13:01] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet
[16:13:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:08] <icinga-wm>	 RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms
[16:15:45] <arturo>	 !log T219776 reimaging + renaming labtestnet2003 into cloudnet2003-dev
[16:15:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:48] <stashbot>	 T219776: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776
[16:18:49] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: labtestnet2003: cleanup old FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500483 (https://phabricator.wikimedia.org/T219776)
[16:20:36] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: add cookbook to reboot all nodes in a cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/500474 (owner: 10Gehel)
[16:25:12] <bblack>	 !log uploading gdnsd-3.1.0-1~wmf1 to stretch-wikimedia
[16:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:22] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.414 second response time https://phabricator.wikimedia.org/T174916
[16:26:50] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:28:33] <bblack>	 !log upgrade gdnsd -> 3.1.0 on cp1099 (authdns test)
[16:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:47] <wikibugs>	 (03PS37) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381)
[16:28:48] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[16:30:08] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:30:47] <wikibugs>	 (03PS1) 10Jgreen: flip payments.wm.o back to eqiad cluster [dns] - 10https://gerrit.wikimedia.org/r/500487
[16:32:37] <vgutierrez>	 !log slowly reenabling puppet in cache text cluster - T213705
[16:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:40] <stashbot>	 T213705: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705
[16:38:07] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/500440 (owner: 10Volans)
[16:43:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) Thanks for the update @Cmjohnson! Are the HP hosts that can have the BBU changed with no disruption or should we plan a failover for this host? Thanks!
[16:43:57] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] flip payments.wm.o back to eqiad cluster [dns] - 10https://gerrit.wikimedia.org/r/500487 (owner: 10Jgreen)
[16:48:12] <wikibugs>	 (03PS3) 10Santhosh: ExternalGuidance: Allow google translate hosts as known services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948)
[16:50:44] <wikibugs>	 10Operations, 10Acme-chief, 10Traffic, 10Goal, 10Patch-For-Review: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez)
[16:51:05] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe)
[16:51:08] <wikibugs>	 (03PS3) 10Mholloway: Add cron job to update WikimediaEditorTasks suggestions table [puppet] - 10https://gerrit.wikimedia.org/r/500104 (https://phabricator.wikimedia.org/T218136)
[16:52:39] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: install new GPU in stat1005 - https://phabricator.wikimedia.org/T219522 (10elukey) If you don't find anybody online to shutdown stat1005 please go ahead and do it, we are not running anything on it!
[16:57:36] <wikibugs>	 (03PS3) 10EBernhardson: Disable wbcs dispatching query builder on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954)
[16:58:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Disable wbcs dispatching query builder on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson)
[16:59:14] <wikibugs>	 (03PS4) 10EBernhardson: Disable wbcs dispatching query builder on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954)
[16:59:42] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[17:00:04] <jouncebot>	 gehel and onimisionipe: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1700).
[17:00:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Disable wbcs dispatching query builder on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson)
[17:00:15] <onimisionipe>	 jouncebot: here here
[17:02:37] <wikibugs>	 (03CR) 10Volans: [C: 04-1] puppet compiler: collect facts from cloud VMs as well as prod hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott)
[17:03:13] <wikibugs>	 (03PS5) 10EBernhardson: Disable wbcs dispatching query builder on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954)
[17:03:35] <logmsgbot>	 !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@115a6bf]: Added more endpoint, GUI updates and new bot pattern
[17:03:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) I've just rechecked, and the following hosts are either empty or only running canary instances:  labvirt1008 cloudvirt1009 cloudvirt101...
[17:06:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Making this a -1 but I indicated in the comments what needs to be done before this patch can be merged." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott)
[17:06:36] <icinga-wm>	 PROBLEM - Host cloudcontrol2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[17:07:22] <icinga-wm>	 RECOVERY - Host cloudcontrol2001-dev is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms
[17:07:59] <arturo>	 !log restart dhcp server in install2002 to release old lease for labtestnet2003
[17:08:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:11] <wikibugs>	 (03PS13) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986)
[17:10:22] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:10:34] <icinga-wm>	 PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:10:56] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cloudcontrol2001-dev is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[17:12:16] <icinga-wm>	 RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms
[17:14:32] <icinga-wm>	 PROBLEM - puppet last run on cloudcontrol2001-dev is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Exec[rabbit_nova_create],Package[keystone]
[17:15:31] <wikibugs>	 (03PS2) 10Volans: tests/docs: unify usage of example.com domain [software/spicerack] - 10https://gerrit.wikimedia.org/r/500440
[17:15:45] <logmsgbot>	 !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@115a6bf]: Added more endpoint, GUI updates and new bot pattern (duration: 12m 10s)
[17:15:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:51] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Cmjohnson) Dell sent the correct size disk, thanks to @robh.   Raid is rebuildingcmjohnson@sodium:~$ sudo megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Rebuild Firmwa...
[17:18:20] <wikibugs>	 (03CR) 10ArielGlenn: "not quite rebased but close, it will need some inspection and testing before merge, but since the shortUrl project seems to have new life," [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn)
[17:21:15] <bblack>	 !log uploading gdnsd-3.1.0-1~wmf2 to stretch-wikimedia
[17:21:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:14] <bblack>	 !log upgrade gdnsd -> 3.1.0 (wmf2) on cp1099 (authdns test)
[17:22:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:49] <wikibugs>	 (03CR) 10Andrew Bogott: "new version with --wmcs switch coming up" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott)
[17:23:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestnet2003: cleanup old FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500483 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez)
[17:24:05] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: labtestnet2003: cleanup old FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500483 (https://phabricator.wikimedia.org/T219776)
[17:24:49] <hauskatze>	 Reedy: can we regenerate the interwiki cache for `hyw:` to work now that the wiki is created?
[17:25:33] <icinga-wm>	 PROBLEM - Host stat1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:26:25] <icinga-wm>	 PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:27:31] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[17:31:09] <bblack>	 !log authdns1001 (ns0) upgrade gdnsd -> 3.1.0
[17:31:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:31] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10aborrero)
[17:32:31] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: `...
[17:32:38] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[17:33:25] <icinga-wm>	 PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:33:29] <wikibugs>	 (03PS39) 10Gehel: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[17:35:31] <icinga-wm>	 RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms
[17:40:43] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[17:41:35] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmnet: fix typo in cloudnet2003-dev FQDN [dns] - 10https://gerrit.wikimedia.org/r/500500 (https://phabricator.wikimedia.org/T219776)
[17:41:51] <icinga-wm>	 RECOVERY - Host stat1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.64 ms
[17:42:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmnet: fix typo in cloudnet2003-dev FQDN [dns] - 10https://gerrit.wikimedia.org/r/500500 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez)
[17:42:58] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) >>! In T216195#5075091, @Andrew wrote: > I've just rechecked, and the following hosts are either empty or only running canary instances:...
[17:43:34] <wikibugs>	 (03PS18) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921)
[17:43:50] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudnet2003-dev.codfw.wmnet'] `  Of which those...
[17:44:09] <wikibugs>	 (03Abandoned) 10Andrew Bogott: puppet compiler: add more puppet masters to the fact-collection stage [puppet] - 10https://gerrit.wikimedia.org/r/499584 (owner: 10Andrew Bogott)
[17:44:13] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tests/docs: unify usage of example.com domain [software/spicerack] - 10https://gerrit.wikimedia.org/r/500440 (owner: 10Volans)
[17:44:35] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1010 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:44:37] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2005 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:44:39] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on relforge1002 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:44:43] <wikibugs>	 (03PS4) 10Andrew Bogott: puppet-compiler: restore the ability to export facts without puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/499007 (https://phabricator.wikimedia.org/T219430)
[17:44:43] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:44:45] <wikibugs>	 (03PS1) 10Andrew Bogott: compiler-update-facts: better support addition of arbitrary fact sets [puppet] - 10https://gerrit.wikimedia.org/r/500501 (https://phabricator.wikimedia.org/T219430)
[17:44:47] <gehel>	 ^ new check failing
[17:44:47] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1002 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:45:01] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: `...
[17:45:05] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[17:45:05] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:45:05] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2003 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:45:17] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[17:45:21] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:45:23] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:45:29] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2006 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:45:33] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1001 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:45:42] <wikibugs>	 (03Abandoned) 10Andrew Bogott: puppet compiler: collect facts from cloud VMs as well as prod hosts [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott)
[17:46:21] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2002 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:47:01] <icinga-wm>	 PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:48:05] <icinga-wm>	 RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[17:48:21] <wikibugs>	 (03Merged) 10jenkins-bot: tests/docs: unify usage of example.com domain [software/spicerack] - 10https://gerrit.wikimedia.org/r/500440 (owner: 10Volans)
[17:48:44] <wikibugs>	 (03PS2) 10Andrew Bogott: compiler-update-facts: better support addition of arbitrary fact sets [puppet] - 10https://gerrit.wikimedia.org/r/500501 (https://phabricator.wikimedia.org/T219430)
[17:49:22] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] realm.pp: Add urlshortcodes to private table [puppet] - 10https://gerrit.wikimedia.org/r/500470 (https://phabricator.wikimedia.org/T219777) (owner: 10Marostegui)
[17:50:42] <wikibugs>	 (03CR) 10jenkins-bot: tests/docs: unify usage of example.com domain [software/spicerack] - 10https://gerrit.wikimedia.org/r/500440 (owner: 10Volans)
[17:54:29] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[17:55:39] <XioNoX>	 !log remove asw2-c-eqiad:et-3/1/2 from disabled interfaces - T218059
[17:55:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:43] <stashbot>	 T218059: asw2-c-eqiad fpc3 Rear QSFP+ PIC Chan# 1 flapping - https://phabricator.wikimedia.org/T218059
[17:57:15] <icinga-wm>	 RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:58:09] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[17:58:11] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudnet2003-dev.codfw.wmnet'] `  Of which those...
[18:00:05] <jouncebot>	 Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1800)
[18:00:05] <jouncebot>	 dcausse, Lucas_WMDE, and kart_: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:25] <Lucas_WMDE>	 o/
[18:00:26] <dcausse>	 o/
[18:00:31] <dcausse>	 Lucas_WMDE: please go ahead
[18:00:39] <Lucas_WMDE>	 ok
[18:00:49] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:01:37] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: install new GPU in stat1005 - https://phabricator.wikimedia.org/T219522 (10Cmjohnson)
[18:01:39] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup)
[18:01:43] <Lucas_WMDE>	 my / Amir1’s config change should have no effect in production
[18:01:46] <Lucas_WMDE>	 only on beta
[18:01:51] <Lucas_WMDE>	 it’ll be deployed there automatically, right?
[18:02:02] <Lucas_WMDE>	 (and I’ll still do the scap + mwdebug dance to make sure it doesn’t break anything, of course)
[18:02:27] <wikibugs>	 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10Cmjohnson)
[18:02:39] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2001 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:02:39] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash2004 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:02:45] <kart_>	 I'm around.
[18:03:02] <onimisionipe>	 more silencing
[18:03:23] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup)
[18:04:10] <kart_>	 Lucas_WMDE: can you deploy my patch too? :)
[18:04:21] <Lucas_WMDE>	 probably, yeah
[18:04:25] <Lucas_WMDE>	 does it have a +1 already? :)
[18:04:32] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "sorry, forgot the rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup)
[18:04:37] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup)
[18:04:40] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: install new GPU in stat1005 - https://phabricator.wikimedia.org/T219522 (10Cmjohnson) 05Open→03Resolved card has been swapped
[18:04:47] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.799 second response time https://phabricator.wikimedia.org/T174916
[18:05:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) Per Chris's request I've gone ahead and put the following servers into maint for the until Friday in icinga:  labvirt1008 cloudvirt1009 c...
[18:05:50] <wikibugs>	 (03Merged) 10jenkins-bot: Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup)
[18:06:20] <kart_>	 Lucas_WMDE: Let me do that. It had, lost in rebase.
[18:06:45] <kart_>	 Lucas_WMDE: OK. It has :)
[18:07:32] <Lucas_WMDE>	 ok good :)
[18:07:36] <Lucas_WMDE>	 still busy with my own change atm
[18:08:12] <Lucas_WMDE>	 doesn’t seem to have broken anything on mwdebug1002, syncing
[18:08:43] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[18:09:35] <wikibugs>	 (03CR) 10Volans: "Looks mostly ok, couple of nitpicks inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[18:10:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:499999|Add tmpSerializeEmptyListsAsObjects Wikibase repo config (T138104)]] (duration: 00m 54s)
[18:10:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:05] <stashbot>	 T138104: Do not serialize empty containers (descriptions/aliases/sitelinks) as empty array [] - https://phabricator.wikimedia.org/T138104
[18:12:05] <Lucas_WMDE>	 I’m done for the moment, dcausse / kart_ who wants to go next?
[18:12:16] * Lucas_WMDE looks at kart_’s Gerrit change
[18:12:40] <dcausse>	 sorry I'm busy atm so if anyone wants to go ahead please do so
[18:12:47] <Lucas_WMDE>	 alright then I’ll continue
[18:14:56] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] ExternalGuidance: Allow google translate hosts as known services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948) (owner: 10Santhosh)
[18:15:04] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): ExternalGuidance: Allow google translate hosts as known services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948) (owner: 10Santhosh)
[18:15:11] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948) (owner: 10Santhosh)
[18:15:22] <wikibugs>	 10Operations, 10ops-eqiad: asw2-c-eqiad fpc3 Rear QSFP+ PIC Chan# 1 flapping - https://phabricator.wikimedia.org/T218059 (10ayounsi) 05Open→03Resolved There was a VC link between FPC3 and FPC8 that was acting up/flapping long time ago.  As we already had quite too many VC links, I disabled as well but left...
[18:15:37] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[18:17:05] <wikibugs>	 (03Merged) 10jenkins-bot: ExternalGuidance: Allow google translate hosts as known services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948) (owner: 10Santhosh)
[18:17:45] <Lucas_WMDE>	 kart_: your change is on mwdebug1002, can you test it?
[18:18:10] <kart_>	 Lucas_WMDE: Thanks. Nothing to test, but let me check if nothing is broken.
[18:18:17] <Lucas_WMDE>	 ok
[18:18:27] <bblack>	 !log multatuli (ns2) upgrade gdnsd -> 3.1.0
[18:18:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:01] <wikibugs>	 (03CR) 10jenkins-bot: Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup)
[18:20:05] <wikibugs>	 (03CR) 10jenkins-bot: ExternalGuidance: Allow google translate hosts as known services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948) (owner: 10Santhosh)
[18:21:11] <kart_>	 Lucas_WMDE: go ahead.
[18:21:17] <Lucas_WMDE>	 alright
[18:22:56] <Lucas_WMDE>	 apparently there was only one occurrence of this error in the last 24h
[18:22:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:498913|ExternalGuidance: Allow google translate hosts as known services (T218948)]] (duration: 00m 53s)
[18:23:05] <Lucas_WMDE>	 might be a while until we can definitely say it’s fixed, I guess
[18:23:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:09] <stashbot>	 T218948: InvalidArgumentException from SpecialExternalGuidance: Invalid service name - https://phabricator.wikimedia.org/T218948
[18:23:18] <Lucas_WMDE>	 but anyways, done
[18:24:57] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9200- on logstash1007 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:25:08] <dcausse>	 Lucas_WMDE: can I go ahead?
[18:25:30] <kart_>	 Lucas_WMDE: Thanks a lot!
[18:25:38] <kart_>	 Lucas_WMDE: Yes. I'll keep watching.
[18:25:54] <Lucas_WMDE>	 dcausse: yeah, sure
[18:25:57] <Lucas_WMDE>	 sorry for the missing ping
[18:26:00] <dcausse>	 thanks!
[18:26:04] <dcausse>	 np
[18:26:45] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[18:26:48] <wikibugs>	 (03PS1) 10Gehel: Revert "elasticsearch: add profile for icinga checks" [puppet] - 10https://gerrit.wikimedia.org/r/500514
[18:28:07] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Revert "elasticsearch: add profile for icinga checks" [puppet] - 10https://gerrit.wikimedia.org/r/500514 (owner: 10Gehel)
[18:28:23] <wikibugs>	 (03Merged) 10jenkins-bot: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[18:32:48] <wikibugs>	 10Operations, 10puppet-compiler: Puppet compiler returns errors - https://phabricator.wikimedia.org/T219742 (10Smalyshev) 05Open→03Resolved a:03Smalyshev
[18:33:41] <logmsgbot>	 !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T210381: [cirrus] Cleanup transitional states (duration: 00m 53s)
[18:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:44] <stashbot>	 T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381
[18:34:48] <wikibugs>	 (03PS4) 10DCausse: [cirrus] Use bm25 similarity for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499795 (https://phabricator.wikimedia.org/T219268)
[18:36:04] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] realm.pp: Add urlshortcodes to private table [puppet] - 10https://gerrit.wikimedia.org/r/500470 (https://phabricator.wikimedia.org/T219777) (owner: 10Marostegui)
[18:36:08] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: initializing_shards: 0, cluster_name: production-logstash-eqiad, status: green, active_shards_percent_as_number: 100.0, delayed_unassigned_shards: 0, active_shards: 202, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, number_of_data_nodes: 3, number_of_pending_tasks: 0, number_of_in_fligh
[18:36:08] <icinga-wm>	 r_of_nodes: 6, unassigned_shards: 0, timed_out: False, active_primary_shards: 86 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:36:12] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: active_primary_shards: 83, status: green, initializing_shards: 0, active_shards: 104, number_of_nodes: 2, number_of_data_nodes: 2, number_of_pending_tasks: 0, cluster_name: relforge-eqiad, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, active_shards_percent_as_number: 
[18:36:12] <icinga-wm>	 _shards: 0, number_of_in_flight_fetch: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:36:12] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2006 is OK: OK - elasticsearch status production-logstash-codfw: number_of_pending_tasks: 0, active_shards: 156, number_of_nodes: 6, number_of_data_nodes: 3, relocating_shards: 0, timed_out: False, active_shards_percent_as_number: 100.0, cluster_name: production-logstash-codfw, task_max_waiting_in_queue_millis: 0, number_of_in_flight_fetch: 0, initializing_shards
[18:36:12] <icinga-wm>	 ry_shards: 63, delayed_unassigned_shards: 0, unassigned_shards: 0, status: green https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:36:16] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1010 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_nodes: 6, unassigned_shards: 0, active_primary_shards: 86, number_of_in_flight_fetch: 0, active_shards: 202, number_of_pending_tasks: 0, status: green, timed_out: False, active_shards_percent_as_number: 100.0, cluster_name: production-logstash-eqiad, task_max_waiting_in_queue_millis: 0, de
[18:36:16] <icinga-wm>	 shards: 0, number_of_data_nodes: 3, relocating_shards: 0, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:36:17] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499795 (https://phabricator.wikimedia.org/T219268) (owner: 10DCausse)
[18:36:20] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: number_of_data_nodes: 2, active_shards: 12, status: green, number_of_pending_tasks: 0, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, timed_out: False, unassigned_shards: 0, initializing_shards: 0, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_numb
[18:36:20] <icinga-wm>	 r_name: relforge-eqiad-small-alpha, active_primary_shards: 6, number_of_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:36:24] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: cluster_name: production-logstash-eqiad, status: green, active_shards: 202, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0, unassigned_shards: 0, number_of_in_flight_fetch: 0, active_shards_percent_as_number: 100.0, relocating_shards: 0, active_primary_shards: 86, numb
[18:36:24] <icinga-wm>	 ks: 0, task_max_waiting_in_queue_millis: 0, number_of_nodes: 6, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:36:24] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2005 is OK: OK - elasticsearch status production-logstash-codfw: active_primary_shards: 63, number_of_nodes: 6, number_of_pending_tasks: 0, timed_out: False, unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, active_shards: 156, delayed_unassigned_shards: 0, cluster_name: production-logstash-codfw, number_of_data_nodes: 3, initializing_shards: 0, relocati
[18:36:24] <icinga-wm>	 ive_shards_percent_as_number: 100.0, status: green, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:36:30] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, active_shards_percent_as_number: 100.0, active_shards: 104, active_primary_shards: 83, number_of_data_nodes: 2, relocating_shards: 0, initializing_shards: 0, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, number
[18:36:30] <icinga-wm>	 ster_name: relforge-eqiad, timed_out: False, status: green https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:36:54] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, status: green, active_shards: 202, unassigned_shards: 0, initializing_shards: 0, cluster_name: production-logstash-eqiad, active_shards_percent_as_number: 100.0, number_of_nodes: 6, relocating_shards: 0, timed_out: False, active_primar
[18:36:54] <icinga-wm>	 ber_of_data_nodes: 3, number_of_in_flight_fetch: 0, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:36:54] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2004 is OK: OK - elasticsearch status production-logstash-codfw: number_of_in_flight_fetch: 0, status: green, active_shards: 156, initializing_shards: 0, relocating_shards: 0, active_shards_percent_as_number: 100.0, number_of_data_nodes: 3, task_max_waiting_in_queue_millis: 0, unassigned_shards: 0, active_primary_shards: 63, timed_out: False, cluster_name: produc
[18:36:54] <icinga-wm>	 fw, delayed_unassigned_shards: 0, number_of_nodes: 6, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:36:54] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2001 is OK: OK - elasticsearch status production-logstash-codfw: relocating_shards: 0, active_shards_percent_as_number: 100.0, unassigned_shards: 0, number_of_data_nodes: 3, number_of_in_flight_fetch: 0, active_shards: 156, delayed_unassigned_shards: 0, status: green, number_of_pending_tasks: 0, cluster_name: production-logstash-codfw, initializing_shards: 0, tas
[18:36:55] <icinga-wm>	 queue_millis: 0, timed_out: False, number_of_nodes: 6, active_primary_shards: 63 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:37:14] <wikibugs>	 (03Merged) 10jenkins-bot: [cirrus] Use bm25 similarity for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499795 (https://phabricator.wikimedia.org/T219268) (owner: 10DCausse)
[18:37:50] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: cluster_name: production-logstash-eqiad, status: green, timed_out: False, number_of_nodes: 6, active_shards: 202, initializing_shards: 0, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, active_shards_percent_as_number: 100.0, active_primary_shards: 86, unassi
[18:37:50] <icinga-wm>	 umber_of_pending_tasks: 0, number_of_data_nodes: 3, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:37:55] <wikibugs>	 (03CR) 10jenkins-bot: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse)
[18:37:57] <wikibugs>	 (03CR) 10jenkins-bot: [cirrus] Use bm25 similarity for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499795 (https://phabricator.wikimedia.org/T219268) (owner: 10DCausse)
[18:38:42] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: unassigned_shards: 0, initializing_shards: 0, number_of_nodes: 6, timed_out: False, number_of_data_nodes: 3, number_of_in_flight_fetch: 0, relocating_shards: 0, active_shards_percent_as_number: 100.0, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstas
[18:38:42] <icinga-wm>	 f_pending_tasks: 0, active_primary_shards: 86, active_shards: 202, status: green https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:38:42] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2003 is OK: OK - elasticsearch status production-logstash-codfw: active_shards: 156, initializing_shards: 0, number_of_nodes: 6, task_max_waiting_in_queue_millis: 0, timed_out: False, active_shards_percent_as_number: 100.0, delayed_unassigned_shards: 0, active_primary_shards: 63, status: green, relocating_shards: 0, number_of_data_nodes: 3, number_of_in_flight_fe
[18:38:42] <icinga-wm>	 ame: production-logstash-codfw, unassigned_shards: 0, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:38:42] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash2002 is OK: OK - elasticsearch status production-logstash-codfw: active_shards_percent_as_number: 100.0, number_of_pending_tasks: 0, relocating_shards: 0, delayed_unassigned_shards: 0, initializing_shards: 0, cluster_name: production-logstash-codfw, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, status: green, number_of_nodes: 6, active_shards
[18:38:42] <icinga-wm>	 data_nodes: 3, timed_out: False, unassigned_shards: 0, active_primary_shards: 63 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:40:52] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.929 second response time https://phabricator.wikimedia.org/T174916
[18:42:35] <logmsgbot>	 !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T219268: [cirrus] Use bm25 similarity for all wikis (duration: 00m 51s)
[18:42:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:38] <stashbot>	 T219268: Elasticsearch 6: the classic similarity is deprecated - https://phabricator.wikimedia.org/T219268
[18:43:55] <wikibugs>	 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10colewhite)
[18:43:58] <wikibugs>	 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10colewhite) 05Open→03Resolved
[18:44:06] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[18:44:21] <wikibugs>	 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10colewhite)
[18:44:52] <dcausse>	 !log Morning SWAT done
[18:44:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:40] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.296 second response time https://phabricator.wikimedia.org/T174916
[18:52:09] <wikibugs>	 (03PS1) 10Bstorm: sonofgridengine: make tools-checker hosts submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/500521 (https://phabricator.wikimedia.org/T219817)
[18:52:16] <shdubsh>	 !log restart mjolnir-kafka-msearch on relforge1002 to adopt new logging config
[18:52:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:10] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[18:54:22] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] sonofgridengine: make tools-checker hosts submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/500521 (https://phabricator.wikimedia.org/T219817) (owner: 10Bstorm)
[18:57:51] <wikibugs>	 (03PS1) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/500525 (https://phabricator.wikimedia.org/T214921)
[18:58:51] <XioNoX>	 !log re-set ulsfo-codfw ospf cost to previous default - T219591
[18:58:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:00] <stashbot>	 T219591: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591
[19:01:15] <wikibugs>	 (03PS2) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/500525 (https://phabricator.wikimedia.org/T214921)
[19:01:45] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 (10ayounsi) 05Open→03Resolved Link has been up for 1+ day. Got a notification saying the emergency maintenance was done.
[19:07:30] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:09:39] <wikibugs>	 (03PS1) 10Bstorm: Revert "sonofgridengine: make tools-checker hosts submit hosts" [puppet] - 10https://gerrit.wikimedia.org/r/500531
[19:11:04] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] Revert "sonofgridengine: make tools-checker hosts submit hosts" [puppet] - 10https://gerrit.wikimedia.org/r/500531 (owner: 10Bstorm)
[19:23:40] <wikibugs>	 (03PS1) 10Bstorm: sonofgridengine: make tools-checker hosts submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/500535 (https://phabricator.wikimedia.org/T219817)
[19:32:38] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:32:58] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:34:56] <wikibugs>	 (03PS1) 10Gilles: Renew Priority Hints origin trial token [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500537 (https://phabricator.wikimedia.org/T216499)
[19:36:35] <wikibugs>	 (03CR) 10Gilles: [C: 03+2] Renew Priority Hints origin trial token [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500537 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles)
[19:37:22] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:37:45] <wikibugs>	 (03Merged) 10jenkins-bot: Renew Priority Hints origin trial token [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500537 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles)
[19:38:04] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:41:10] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:43:29] <wikibugs>	 (03CR) 10jenkins-bot: Renew Priority Hints origin trial token [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500537 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles)
[19:48:02] <bblack>	 !log authdns2001 (ns1) upgrade gdnsd -> 3.1.0
[19:48:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:56] <logmsgbot>	 !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T216499 Renew Priority Hints origin trial token (duration: 00m 54s)
[19:48:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:59] <stashbot>	 T216499: Priority Hints origin trial - https://phabricator.wikimedia.org/T216499
[19:54:21] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:00:05] <jouncebot>	 cscott, arlolra, subbu, bearND, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T2000).
[20:07:04] <icinga-wm>	 PROBLEM - puppet last run on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer
[20:07:04] <icinga-wm>	 PROBLEM - Disk space on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer
[20:07:08] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer
[20:07:12] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:07:20] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:07:22] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:07:26] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:07:30] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer
[20:07:36] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:07:44] <icinga-wm>	 PROBLEM - DPKG on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer
[20:07:46] <icinga-wm>	 PROBLEM - MD RAID on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer
[20:07:56] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:07:56] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:07:58] <icinga-wm>	 PROBLEM - dhclient process on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer
[20:08:16] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:08:50] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:09:14] <icinga-wm>	 PROBLEM - swift-account-server on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:09:16] <icinga-wm>	 PROBLEM - swift-container-server on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:09:48] <icinga-wm>	 PROBLEM - configured eth on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer
[20:09:56] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer
[20:10:02] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:10:22] <icinga-wm>	 PROBLEM - MD RAID on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer
[20:10:32] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[20:11:12] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift
[20:11:14] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2026 is OK: OK ferm input default policy is set
[20:11:18] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift
[20:11:22] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift
[20:11:30] <icinga-wm>	 RECOVERY - DPKG on ms-be2026 is OK: All packages OK
[20:11:40] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift
[20:11:40] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift
[20:11:40] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift
[20:11:40] <icinga-wm>	 RECOVERY - swift-account-server on ms-be2026 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift
[20:11:40] <icinga-wm>	 RECOVERY - swift-container-server on ms-be2026 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift
[20:11:42] <icinga-wm>	 RECOVERY - dhclient process on ms-be2026 is OK: PROCS OK: 0 processes with command name dhclient
[20:12:00] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift
[20:12:14] <icinga-wm>	 RECOVERY - puppet last run on ms-be2026 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures
[20:12:16] <icinga-wm>	 RECOVERY - configured eth on ms-be2026 is OK: OK - interfaces up
[20:12:16] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift
[20:13:10] <icinga-wm>	 PROBLEM - DNS labtestnet2003.mgmt on labtestnet2003.mgmt is CRITICAL: Domain labtestnet2003.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:16:22] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2026 is OK: OK - load average: 20.88, 72.79, 71.45 https://wikitech.wikimedia.org/wiki/Swift
[20:18:48] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:00:04] <jouncebot>	 bawolff and Reedy: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T2100). Please do the needful.
[21:00:41] <bawolff>	 \o/
[21:02:36] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:05:32] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:10:59] <dcausse>	 !log elasticsearch search cluster: reindex spaceless languages (T219533)
[21:11:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:02] <stashbot>	 T219533: Reindex space less languages wikis to use BM25 - https://phabricator.wikimedia.org/T219533
[21:11:38] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:12:10] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[21:16:01] <bawolff>	 We're going to deploy a security thing
[21:18:24] <icinga-wm>	 PROBLEM - dhclient process on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer
[21:18:28] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:18:34] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:18:40] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer
[21:18:44] <icinga-wm>	 PROBLEM - swift-account-server on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:18:44] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:19:06] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:19:12] <icinga-wm>	 PROBLEM - MD RAID on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer
[21:19:14] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:19:16] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:19:18] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer
[21:19:24] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:19:26] <icinga-wm>	 PROBLEM - Disk space on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer
[21:19:28] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:19:32] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer
[21:19:58] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:20:04] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:20:08] <icinga-wm>	 PROBLEM - DPKG on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer
[21:20:08] <icinga-wm>	 PROBLEM - swift-container-server on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:21:02] <icinga-wm>	 PROBLEM - dhclient process on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer
[21:21:04] <icinga-wm>	 PROBLEM - swift-object-server on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:21:06] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:21:14] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational
[21:22:04] <icinga-wm>	 PROBLEM - Disk space on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer
[21:22:06] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[21:22:06] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[21:22:22] <icinga-wm>	 PROBLEM - configured eth on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer
[21:22:24] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift
[21:22:24] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be2018 is OK: OK: nf_conntrack is 6 % full
[21:22:26] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift
[21:22:30] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift
[21:22:30] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift
[21:22:30] <icinga-wm>	 RECOVERY - swift-account-server on ms-be2018 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift
[21:22:36] <icinga-wm>	 RECOVERY - DPKG on ms-be2018 is OK: All packages OK
[21:22:36] <icinga-wm>	 RECOVERY - swift-container-server on ms-be2018 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift
[21:22:52] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2018 is OK: OK - load average: 40.35, 36.11, 29.34 https://wikitech.wikimedia.org/wiki/Swift
[21:22:58] <icinga-wm>	 RECOVERY - MD RAID on ms-be2018 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[21:23:02] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift
[21:23:02] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2018 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift
[21:23:06] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2018 is OK: OK ferm input default policy is set
[21:23:12] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift
[21:23:14] <icinga-wm>	 RECOVERY - Disk space on ms-be2018 is OK: DISK OK
[21:23:16] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift
[21:23:20] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational
[21:23:24] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational
[21:23:30] <icinga-wm>	 RECOVERY - dhclient process on ms-be2018 is OK: PROCS OK: 0 processes with command name dhclient
[21:23:33] <icinga-wm>	 RECOVERY - configured eth on ms-be2018 is OK: OK - interfaces up
[21:23:33] <icinga-wm>	 RECOVERY - swift-object-server on ms-be2018 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift
[21:23:33] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift
[21:27:32] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:28:06] <wikibugs>	 10Operations, 10netops: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 (10ayounsi) `lang=diff,name=cr1-eqsin [edit protocols bgp group Transit4] -    import [ BGP_sanitize_in BGP_transit_in BGP_avoid_long_RTT_in BGP_community_actions ]; +    import [ BGP_sanitize_in BGP_tran...
[21:28:40] <XioNoX>	 !log Push AS specific policy-statements to cr1/2-eqsin v4 peers - T211930
[21:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:44] <stashbot>	 T211930: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930
[21:33:28] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:35:40] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.105 second response time https://phabricator.wikimedia.org/T174916
[21:37:20] <wikibugs>	 (03PS1) 10Alex Monk: service::node: Only try to define node10 repository if it is not already defined [puppet] - 10https://gerrit.wikimedia.org/r/500615
[21:39:40] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[21:44:18] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:44:26] <logmsgbot>	 !log sbassett@deploy1001 Synchronized private/PrivateSettings.php: Remove kowiki spam mitigations T212679 (duration: 00m 54s)
[21:44:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:37] <wikibugs>	 10Operations, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Tbayer) @elukey Sure, that totally makes sense! The end of January estimate from T178802#4647106 turned out a bit optimistic (see again our internal tim...
[21:52:39] <wikibugs>	 (03CR) 10MSantos: "Problem found in the Beta Cluster at deployment-maps04 https://horizon.wikimedia.org/project/instances/e469cff8-0791-4a83-aa86-3ba9bf3780d" [puppet] - 10https://gerrit.wikimedia.org/r/500615 (owner: 10Alex Monk)
[21:56:24] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.834 second response time https://phabricator.wikimedia.org/T174916
[22:00:26] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[22:01:18] <wikibugs>	 (03PS1) 10Andrew Bogott: labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622
[22:04:03] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10Yann) >>! In T219589#5072801, @Aklapper wrote: >>>! In T219589#5072592, @Yann wro...
[22:05:45] <wikibugs>	 (03PS2) 10Andrew Bogott: labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622
[22:14:20] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:15:52] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:16:16] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:17:09] <wikibugs>	 (03PS1) 10Andrew Bogott: Reconcile some passwords between eqiad1 and main regions [labs/private] - 10https://gerrit.wikimedia.org/r/500627
[22:20:03] <wikibugs>	 (03PS2) 10Andrew Bogott: Reconcile some passwords between eqiad1 and main regions [labs/private] - 10https://gerrit.wikimedia.org/r/500627
[22:20:20] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Reconcile some passwords between eqiad1 and main regions [labs/private] - 10https://gerrit.wikimedia.org/r/500627 (owner: 10Andrew Bogott)
[22:20:24] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:20:40] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:21:14] <wikibugs>	 10Operations, 10netops: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 (10ayounsi) `lang=diff,name=cr2-eqsin [edit protocols bgp group Transit4] -    import [ BGP_sanitize_in BGP_transit_in BGP_avoid_long_RTT_in BGP_community_actions ]; +    import [ BGP_sanitize_in BGP_tran...
[22:21:20] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:21:38] <wikibugs>	 (03PS5) 10CRusnov: Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032
[22:21:48] <icinga-wm>	 PROBLEM - Varnishkafka Eventlogging Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=eventlogging&var-host=All
[22:23:16] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:23:56] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:24:18] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:25:26] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:26:12] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 18 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002
[22:29:48] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:32:49] <wikibugs>	 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10bd808) p:05Unbreak!→03High >>! In T217280#5074689, @greg wrote: > Just drive-by checking in on UBN!s: is this task s...
[22:37:54] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.312 second response time https://phabricator.wikimedia.org/T174916
[22:41:46] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[22:43:00] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.560 second response time https://phabricator.wikimedia.org/T174916
[22:45:41] <nuria>	 tons of varnishkafka errors when sending data to eventlogging
[22:45:43] <nuria>	 https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=eventlogging&var-host=All
[22:46:53] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[22:50:58] <wikibugs>	 (03PS3) 10Andrew Bogott: labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622
[22:52:45] <XioNoX>	 !log restart pdfrender on scb1003 - T174916
[22:52:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:52:49] <stashbot>	 T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916
[22:53:10] <wikibugs>	 (03PS1) 10Alex Monk: Add hiera option to serve user traffic using acme-chief certs [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927)
[22:53:31] <wikibugs>	 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10User-zeljkofilipin: npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos - https://phabricator.wikimedia.org/T215562 (10Krinkle) >>! In T215562#5074013, @MoritzMuehlenhoff wrote: > @Krinkle I've prepar...
[22:53:53] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time https://phabricator.wikimedia.org/T174916
[22:54:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add hiera option to serve user traffic using acme-chief certs [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk)
[22:54:19] <XioNoX>	 yay
[22:56:15] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul)
[22:57:33] <wikibugs>	 (03PS2) 10Alex Monk: Add hiera option to serve user traffic using acme-chief certs [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927)
[23:00:05] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T2300).
[23:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:01:17] <icinga-wm>	 PROBLEM - puppet last run on kubestagetcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:01:27] <wikibugs>	 10Operations, 10netops: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 (10ayounsi) 05Open→03Resolved Pushed progressively and confirmed with the looking glasses that only the proper communities were received on the other side. As well as the proper local_pref was applied...
[23:06:35] <wikibugs>	 (03CR) 10Cwhite: profile: do not mutate level for mjolnir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500099 (https://phabricator.wikimedia.org/T213899) (owner: 10Cwhite)
[23:07:27] <nuria>	 Is there anywhere here that restart eventlogging kafka consumers?
[23:07:33] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron)
[23:08:10] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] logstash: send varnish syslogs via kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/498467 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron)
[23:10:05] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:16:22] <XioNoX>	 !log jnt push to csw2-esams
[23:16:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:17:46] <nuria>	 bblack: do you think you could help us reboot all eventlogging kafka producers ? cc mobrovac 
[23:18:07] <mobrovac>	 herron: too ^
[23:18:27] <nuria>	 bblack, herron: kafka jumbo cluster is kaput: https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=eventlogging&var-host=All
[23:19:30] <nuria>	 mobrovac: we need to bounce back cluster entirely right? seems it is totally off
[23:19:35] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[23:19:35] <icinga-wm>	 PROBLEM - swift-container-server on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[23:19:37] <XioNoX>	 I'm around if you need someone from SRE, but I don't know anything about kafka
[23:19:37] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[23:19:37] <icinga-wm>	 PROBLEM - dhclient process on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer
[23:19:41] <mobrovac>	 think so nuria, yeah
[23:19:45] <icinga-wm>	 PROBLEM - configured eth on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer
[23:19:47] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[23:19:55] <icinga-wm>	 PROBLEM - Disk space on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer
[23:19:59] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer
[23:20:01] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[23:20:03] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[23:20:05] <icinga-wm>	 PROBLEM - puppet last run on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer
[23:20:31] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2017 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift
[23:20:31] <icinga-wm>	 RECOVERY - swift-container-server on ms-be2017 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift
[23:20:31] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift
[23:20:31] <icinga-wm>	 RECOVERY - dhclient process on ms-be2017 is OK: PROCS OK: 0 processes with command name dhclient
[23:20:41] <icinga-wm>	 RECOVERY - configured eth on ms-be2017 is OK: OK - interfaces up
[23:20:41] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2017 is OK: OK - load average: 27.17, 28.45, 27.39 https://wikitech.wikimedia.org/wiki/Swift
[23:20:49] <icinga-wm>	 RECOVERY - Disk space on ms-be2017 is OK: DISK OK
[23:20:55] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2017 is OK: OK ferm input default policy is set
[23:20:57] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift
[23:20:59] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2017 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift
[23:22:47] <wikibugs>	 (03CR) 10Andrew Bogott: "Compiler diffs:" [puppet] - 10https://gerrit.wikimedia.org/r/500622 (owner: 10Andrew Bogott)
[23:22:47] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[23:23:54] <nuria>	 ping bblack again or herron
[23:24:59] <icinga-wm>	 RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures
[23:27:11] <icinga-wm>	 RECOVERY - puppet last run on kubestagetcd1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[23:27:29] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul) ` papaul@asw-b-codfw> show interfaces ge-8/0/11 descriptions  Interface       Admin Link Description ge-8/0/11...
[23:27:49] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul)
[23:28:17] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul)
[23:28:29] <shdubsh>	 !log restart kafka on kafka-jumbo1001
[23:28:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:03] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:36:21] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:36:53] <shdubsh>	 !log restart kafka on kafka-jumbo1002
[23:36:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:41:25] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties
[23:41:25] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:42:21] <wikibugs>	 (03PS1) 10Papaul: DNS: Remove mgmt and production DNS for cloudnet2001-dev [dns] - 10https://gerrit.wikimedia.org/r/500634
[23:43:59] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties
[23:43:59] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1002 is OK: OK - running: The system is fully operational
[23:45:55] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:46:23] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 56 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005
[23:46:35] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational
[23:46:35] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul)
[23:47:21] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 122 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004
[23:47:27] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 149.7 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[23:47:37] <shdubsh>	 !log restarting kafka on kafka-jumbo1003
[23:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:48:07] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:48:07] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties
[23:49:11] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10Papaul)
[23:51:49] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: add py extension to nfs-exportd and apply nfsd-ldap everywhere [puppet] - 10https://gerrit.wikimedia.org/r/500635 (https://phabricator.wikimedia.org/T209527)
[23:51:59] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: 80 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006
[23:53:15] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties
[23:53:33] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:53:39] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:54:15] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002
[23:54:31] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:54:49] <shdubsh>	 !log restarting kafka on kafka-jumbo1004
[23:54:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:31] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:58:55] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:59:15] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:59:41] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties