[00:00:04] <jouncebot>	 Deploy window No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190610T0000)
[05:15:59] <icinga-wm>	 RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[05:28:15] <icinga-wm>	 PROBLEM - Host db1077 is DOWN: PING CRITICAL - Packet loss = 100%
[05:32:55] <marostegui>	 Lovely
[05:32:58] <marostegui>	 Checking that
[05:33:12] <marostegui>	 it is down indeed
[05:33:45] <marostegui>	 and labsdb1009 also with issues :)
[05:33:46] <marostegui>	 nice
[05:34:13] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1077.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1077.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[05:34:49] <marostegui>	 expected as db1077 is the master for s3 on labs
[05:35:00] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516083
[05:35:31] <wikibugs>	 (03CR) 10Marostegui: [V: 03+2 C: 03+2] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516083 (owner: 10Marostegui)
[05:36:29] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516083 (owner: 10Marostegui)
[05:37:02] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 - host crashed (duration: 00m 52s)
[05:37:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:37:11] <marostegui>	 db1077 looks like BBU related
[05:38:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui)
[05:38:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 747.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[05:39:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) This is s3's sanitarium master, so for now s3 on labs will be lagging until we fix this host
[05:39:36] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave IO: s3 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1077.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1077.eqiad.wmnet (110 Connection timed out) Marostegui T225391 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[05:39:36] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 747.36 seconds Marostegui T225391 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[05:41:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) p:05Triage→03High @Cmjohnson looks like we have to first upgrade all the firwmare: https://support.hpe.com/hpsc/doc/public/display?docId=mmr_kc-0134828
[05:43:06] <wikibugs>	 (03PS1) 10Marostegui: db1077: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/516084 (https://phabricator.wikimedia.org/T225391)
[05:44:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) @Cmjohnson I will leave MySQL down so you can upgrade this host's firmwares as soon as you can without waiting for us to stop MySQL
[05:44:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1077: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/516084 (https://phabricator.wikimedia.org/T225391) (owner: 10Marostegui)
[06:05:59] <James_F>	 jouncebot: next
[06:05:59] <jouncebot>	 In 17 hour(s) and 54 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190611T0000)
[06:06:12] <James_F>	 Good.
[06:08:04] <apergos>	 indeed
[06:12:25] <icinga-wm>	 PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2086 MB (4% inode=61%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[06:28:57] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10ArielGlenn) @WMDE-leszek I think Rachel's question was directed to you.
[06:29:57] <icinga-wm>	 PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[06:30:27] <icinga-wm>	 PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/07-wikimania.conf]
[06:30:41] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/get-raid-status-megacli]
[06:33:35] <icinga-wm>	 PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml]
[06:42:07] <wikibugs>	 (03PS1) 10ArielGlenn: add awight as deployer [puppet] - 10https://gerrit.wikimedia.org/r/516109 (https://phabricator.wikimedia.org/T225062)
[06:44:19] <wikibugs>	 10Operations, 10SRE-Access-Requests: Access Q re maint1002 - https://phabricator.wikimedia.org/T225253 (10ArielGlenn) p:05Triage→03Normal
[06:44:33] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment cluster for awight - https://phabricator.wikimedia.org/T225062 (10ArielGlenn) p:05Triage→03Normal
[06:44:56] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10observability: Requesting access to icinga for tonycepo - https://phabricator.wikimedia.org/T224313 (10ArielGlenn) p:05Triage→03Normal
[06:45:20] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10ArielGlenn) p:05Triage→03Normal
[06:48:27] <wikibugs>	 10Operations, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review, and 2 others: Increase Memory Limit for Scribunto - https://phabricator.wikimedia.org/T223737 (10ArielGlenn) p:05Triage→03Normal
[06:55:17] <icinga-wm>	 RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[06:56:27] <wikibugs>	 10Operations, 10Cloud-Services, 10Kubernetes, 10Patch-For-Review: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287 (10ArielGlenn) p:05Triage→03Normal
[06:57:07] <icinga-wm>	 RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[06:57:35] <icinga-wm>	 RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:57:51] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[07:05:43] <wikibugs>	 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10ArielGlenn)
[07:05:45] <wikibugs>	 10Operations: Cron spam from phab1001 delete of temporary files - https://phabricator.wikimedia.org/T224727 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn The cronjob producing these has been removed maually on June 3 and will not reappear since the role was removed earlier from the host. I don't see any...
[07:15:06] <wikibugs>	 10Operations: Debian mirror in sync with upstream - https://phabricator.wikimedia.org/T224706 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn One datapoint is that we are still getting updates. I checked and saw there are new entries to the repo from today, mirrored by us.  An email sent by Daniel on May 3...
[07:26:41] <wikibugs>	 10Operations: conftool is failing flake8 - https://phabricator.wikimedia.org/T212397 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn Fixed in https://gerrit.wikimedia.org/r/#/c/operations/software/conftool/+/503061/
[07:31:11] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[07:33:39] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[07:38:14] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10ArielGlenn) p:05Triage→03Normal I see that @MoritzMuehlenhoff  installed bsd-mailx manually on the box to fix future runs; this sh...
[07:42:43] <wikibugs>	 (03PS1) 10ArielGlenn: phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131
[07:43:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 (owner: 10ArielGlenn)
[07:44:59] <wikibugs>	 (03PS2) 10ArielGlenn: phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 (https://phabricator.wikimedia.org/T224804)
[07:52:23] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[07:52:47] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[08:17:00] <wikibugs>	 (03CR) 10Marostegui: "sounds good, let's check" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515063 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[08:42:20] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] Add monitoring::alerts::kafka_topic_throughput and use it for eventgate [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203) (owner: 10Ottomata)
[08:48:52] <zeljkof>	 jouncebot: next
[08:48:52] <jouncebot>	 In 15 hour(s) and 11 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190611T0000)
[08:48:57] <zeljkof>	 :D
[09:03:43] <apergos>	 nice and easy!
[09:25:51] <icinga-wm>	 PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman
[09:27:17] <icinga-wm>	 RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman
[10:33:15] <wikibugs>	 (03PS1) 10MarcoAurelio: [WIP] New namespace aliases for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516195
[10:36:18] <wikibugs>	 (03PS2) 10MarcoAurelio: Set two new namespace aliases for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516195 (https://phabricator.wikimedia.org/T216143)
[10:39:29] <hauskatze>	 jouncebot: next
[10:39:30] <jouncebot>	 In 13 hour(s) and 20 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190611T0000)
[10:40:40] <wikibugs>	 (03CR) 10MarcoAurelio: "Note for SWAT deployer: this requires namespaceDupes.php afterwards." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516195 (https://phabricator.wikimedia.org/T216143) (owner: 10MarcoAurelio)
[10:45:22] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516195 (https://phabricator.wikimedia.org/T216143) (owner: 10MarcoAurelio)
[11:16:23] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[11:17:43] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[12:43:04] <wikibugs>	 10Operations, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review, and 2 others: Increase Memory Limit for Scribunto - https://phabricator.wikimedia.org/T223737 (10tstarling) Looking at the template in question, the obvious solution is to stop doing that. If that's what it takes to exceed the memor...
[13:04:00] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid upgrade staging -f citoid-staging-values.yaml stable/citoid [namespace: citoid, clusters: staging]
[13:04:01] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid cluster staging completed
[13:04:01] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid finished
[13:04:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:44] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid upgrade production -f citoid-eqiad-values.yaml stable/citoid [namespace: citoid, clusters: eqiad]
[13:13:46] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid cluster eqiad completed
[13:13:46] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid finished
[13:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:05] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid upgrade production -f citoid-codfw-values.yaml stable/citoid [namespace: citoid, clusters: codfw]
[13:18:07] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid cluster codfw completed
[13:18:07] <logmsgbot>	 !log mvolz@deploy1001 scap-helm citoid finished
[13:18:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:17] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[13:34:57] <wikibugs>	 (03CR) 10Ottomata: Allow Hadoop-related profiles to deploy Kerberos keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey)
[13:45:51] <wikibugs>	 (03PS2) 10Ottomata: Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson)
[13:46:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson)
[13:46:32] <wikibugs>	 (03CR) 10Ottomata: "Wow cool, did not know this was a thing!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson)
[13:47:20] <wikibugs>	 (03PS3) 10Ottomata: Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson)
[13:47:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson)
[13:48:34] <wikibugs>	 (03PS4) 10Ottomata: Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson)
[13:49:41] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson)
[13:58:53] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[13:58:54] <wikibugs>	 (03PS1) 10Ottomata: Enable HCatalog support in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/516261 (https://phabricator.wikimedia.org/T225310)
[13:59:32] <wikibugs>	 (03PS2) 10Ottomata: Enable HCatalog support in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/516261 (https://phabricator.wikimedia.org/T225310)
[14:14:44] <wikibugs>	 (03PS1) 10Ottomata: Fix hcatalog conditional in oozie-site.xml.erb [puppet/cdh] - 10https://gerrit.wikimedia.org/r/516265 (https://phabricator.wikimedia.org/T225310)
[14:15:11] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix hcatalog conditional in oozie-site.xml.erb [puppet/cdh] - 10https://gerrit.wikimedia.org/r/516265 (https://phabricator.wikimedia.org/T225310) (owner: 10Ottomata)
[14:15:55] <wikibugs>	 (03PS3) 10Ottomata: Enable HCatalog support in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/516261 (https://phabricator.wikimedia.org/T225310)
[14:18:50] <wikibugs>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/16940/" [puppet] - 10https://gerrit.wikimedia.org/r/516261 (https://phabricator.wikimedia.org/T225310) (owner: 10Ottomata)
[14:18:52] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Enable HCatalog support in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/516261 (https://phabricator.wikimedia.org/T225310) (owner: 10Ottomata)
[15:02:54] <wikibugs>	 10Operations, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review, and 2 others: Increase Memory Limit for Scribunto - https://phabricator.wikimedia.org/T223737 (10Reedy) 05Open→03Declined >>! In T223737#5246815, @tstarling wrote: > Looking at the template in question, the obvious solution is t...
[15:14:39] <wikibugs>	 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10AlexisJazz) https://commons.wikimedia.org/wiki/File:President_Lula_and_Marisa.jpg first two revisions mi...
[15:17:51] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "Seems to match the discussion on T101631" [puppet] - 10https://gerrit.wikimedia.org/r/515062 (https://phabricator.wikimedia.org/T101631) (owner: 10Jhedden)
[15:24:16] <wikibugs>	 (03CR) 1020after4: [C: 03+1] phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 (https://phabricator.wikimedia.org/T224804) (owner: 10ArielGlenn)
[15:24:56] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 (https://phabricator.wikimedia.org/T224804) (owner: 10ArielGlenn)
[15:39:19] <wikibugs>	 10Operations, 10Diffusion, 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10ArielGlenn) @mmodell you get farther than I do. I've checked the db and see the right key i...
[15:43:16] <wikibugs>	 10Operations, 10Diffusion, 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) @arielGlenn: The only thing left to do that I can think of is to run the git sshd...
[15:45:59] <wikibugs>	 10Operations, 10Diffusion, 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10ArielGlenn) Awesome, I'll be around if it's not ridiculous o'clock for me. There's a presen...
[15:53:12] <wikibugs>	 (03PS1) 10Ottomata: Enable HCatalog support in analytics hadoop oozie [puppet] - 10https://gerrit.wikimedia.org/r/516293 (https://phabricator.wikimedia.org/T225310)
[15:56:20] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Enable HCatalog support in analytics hadoop oozie [puppet] - 10https://gerrit.wikimedia.org/r/516293 (https://phabricator.wikimedia.org/T225310) (owner: 10Ottomata)
[16:16:10] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10RStallman-legalteam) @WMDE-leszek I went ahead and sent the NDAs to the four users mentioned above and will update the ticket once they are signed.
[16:24:56] <marostegui>	 !log Power reset db1077 from the idrac T225391
[16:25:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:02] <stashbot>	 T225391: db1077 crashed - https://phabricator.wikimedia.org/T225391
[16:30:59] <wikibugs>	 10Operations, 10Diffusion, 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10ArielGlenn) Just going to leave this here. https://bugs.debian.org/cgi-bin/bugreport.cgi?bu...
[17:03:36] <wikibugs>	 10Operations, 10Wikimedia-Site-requests, 10HHVM: Set hhvm.virtual_host[default][always_decode_post_data] = false - https://phabricator.wikimedia.org/T208191 (10Jdforrester-WMF) Does this have a PHP7 equivalent, given that we're moving off HHVM "soon"?
[17:25:37] <wikibugs>	 10Operations, 10Diffusion, 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) > openssh-server: SSH AuthorizedKeysCommand hangs when output is too large  Ah ha!...
[17:25:55] <icinga-wm>	 RECOVERY - Host db1077 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[17:39:58] <wikibugs>	 (03PS1) 10Ottomata: Disable ApiAction log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516303 (https://phabricator.wikimedia.org/T222267)
[17:47:22] <wikibugs>	 (03PS1) 10Joal: Update AQS druid datasource to 2019_05 snapshot [puppet] - 10https://gerrit.wikimedia.org/r/516307
[17:47:28] <joal>	 ottomata: --^ please :)
[17:47:52] <ottomata>	 k
[17:48:15] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Update AQS druid datasource to 2019_05 snapshot [puppet] - 10https://gerrit.wikimedia.org/r/516307 (owner: 10Joal)
[17:55:00] <logmsgbot>	 !log otto@deploy1001 Started restart [analytics/aqs/deploy@fc1d232]: (no justification provided)
[17:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:15] <ottomata>	 oops, there is justifcation, logged in other chan
[17:55:28] <ottomata>	 !log rolling restart of AQS service using scap deploy for new mediawiki_history_snaphost
[17:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:32] <wikibugs>	 (03PS4) 10Ottomata: Add monitoring::alerts::kafka_topic_throughput and use it for eventgate [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203)
[18:10:07] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Release-Engineering-Team (Backlog): Request: add awight to contint-docker - https://phabricator.wikimedia.org/T223262 (10greg) If you really want :) Approved.
[18:13:06] <wikibugs>	 (03PS5) 10Ottomata: Add monitoring::alerts::kafka_topic_throughput and use it for eventgate [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203)
[18:14:58] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/16944/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203) (owner: 10Ottomata)
[18:28:40] <wikibugs>	 (03CR) 10Ori.livneh: "One last ping before giving up." [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh)
[18:38:21] <wikibugs>	 10Operations, 10Wikimedia-Site-requests, 10HHVM: Set hhvm.virtual_host[default][always_decode_post_data] = false - https://phabricator.wikimedia.org/T208191 (10Krinkle) >>! In T208191#5247496, @Jdforrester-WMF wrote: > Does this have a PHP7 equivalent, given that we're moving off HHVM "soon"?  Per:  >>! In T...
[18:40:24] <wikibugs>	 10Operations, 10Wikimedia-Site-requests, 10HHVM: Set hhvm.virtual_host[default][always_decode_post_data] = false - https://phabricator.wikimedia.org/T208191 (10Jdforrester-WMF) Ah, so if we wait long enough, this will fix itself? ;-(
[18:48:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Cmjohnson) a:05Cmjohnson→03Marostegui I updated with the service pack and powered on...reassigning to @Marostegui
[18:52:41] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on zotero.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:53:37] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10Cmjohnson) 05Stalled→03Declined declining this for now since it's out of warranty and the disk has not failed
[18:53:40] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Cmjohnson)
[18:54:47] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T224795 (10Cmjohnson) 05Open→03Declined since this server is out of warranty and @elukey said to skip replacing the disk.  If the status changes and needs to be done please re-open task
[18:55:27] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 138 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:59:17] <wikibugs>	 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson This has been completed
[19:01:25] <icinga-wm>	 RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[19:05:03] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[19:06:25] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:07:02] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) Thanks @Cmjohnson - I can see that on the logs: `  /system1/log1/record15   Targets   Properties     number=15     severity=Informational     date=06/10/2019     time=16:34     description=Firmware fla...
[19:08:26] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Cmjohnson) a:05Cmjohnson→03RobH @robh this disk will need to be ordered outside of the warranty. These servers were shipped without disks, the procurement ta...
[19:08:49] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) @Cmjohnson can you also check the one of th power supply cable? It might be loose: ` /system1/log1/record17   Targets   Properties     number=17     severity=Caution     date=06/10/2019     time=17:16...
[19:09:12] <wikibugs>	 (03Abandoned) 10Ottomata: [WIP] Prometheus server for cloud-analytics project [puppet] - 10https://gerrit.wikimedia.org/r/479030 (https://phabricator.wikimedia.org/T211640) (owner: 10Ottomata)
[19:09:27] <wikibugs>	 (03Abandoned) 10Ottomata: Add LVS for druid-public-overlord indexing service [puppet] - 10https://gerrit.wikimedia.org/r/386427 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata)
[19:10:34] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-codfw.yaml production stable/zotero [namespace: zotero, clusters: codfw]
[19:10:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:40] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed
[19:10:40] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero finished
[19:10:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:52] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-eqiad.yaml production stable/zotero [namespace: zotero, clusters: eqiad]
[19:10:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:57] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed
[19:10:57] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero finished
[19:11:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:16] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-staging.yaml staging stable/zotero [namespace: zotero, clusters: staging]
[19:11:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:24] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero cluster staging completed
[19:11:24] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm zotero finished
[19:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:45] <akosiaris>	 !log refresh all zotero pods in all clusters
[19:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:45] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on db1124 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:15:07] <wikibugs>	 (03CR) 10Anomie: [C: 03+1] "Seems sane to me. One additional suggestion, if it's ok with the people who decide such things." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515062 (https://phabricator.wikimedia.org/T101631) (owner: 10Jhedden)
[19:20:16] <wikibugs>	 (03CR) 10Catrope: [C: 03+1] GrowthExperiments (testwiki): Switch on mobile homepage feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514638 (owner: 10Kosta Harlan)
[19:33:41] <wikibugs>	 (03PS1) 10Ottomata: Use method gt instead of ge for eventgate validation error throughput alerts [puppet] - 10https://gerrit.wikimedia.org/r/516324 (https://phabricator.wikimedia.org/T225203)
[19:34:36] <wikibugs>	 (03PS2) 10Ottomata: Use method gt instead of ge for eventgate validation error throughput alerts [puppet] - 10https://gerrit.wikimedia.org/r/516324 (https://phabricator.wikimedia.org/T225203)
[19:39:38] <thcipriani>	 !log restarting jenkins
[19:39:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:13] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Release-Engineering-Team (Backlog): Request: add awight to contint-docker - https://phabricator.wikimedia.org/T223262 (10awight) >>! In T223262#5247693, @greg wrote: > If you really want :) Approved.  #masochism not found
[19:42:43] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) MySQL started correctly, I have upgraded it and started replication as everything looked fine. Once it is up to date, I will run some data checks.
[19:45:45] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] Disable ApiAction log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516303 (https://phabricator.wikimedia.org/T222267) (owner: 10Ottomata)
[20:06:18] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Use method gt instead of ge for eventgate validation error throughput alerts [puppet] - 10https://gerrit.wikimedia.org/r/516324 (https://phabricator.wikimedia.org/T225203) (owner: 10Ottomata)
[20:34:06] <wikibugs>	 (03CR) 10Ottomata: "Hm, ok, moving to /user/analytics is going to be outside scope here, since current jobs are configured to read from /user/hdfs." [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata)
[20:37:58] <wikibugs>	 (03PS4) 10Ottomata: \Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544)
[20:39:14] <wikibugs>	 (03PS5) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544)
[20:40:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[20:49:08] <wikibugs>	 (03CR) 10Ottomata: "Oh, I see you already did that.  Hm." [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata)
[21:01:26] <wikibugs>	 (03PS6) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544)
[21:04:10] <wikibugs>	 (03PS7) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544)
[21:06:53] <wikibugs>	 (03PS8) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544)
[21:09:35] <wikibugs>	 (03PS9) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544)
[21:10:42] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/16949/an-master1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata)
[21:14:42] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Ottomata) Ok!  Creds deployed, and oozie job merged.  Refinery will be deployed this week and we can try it out!
[22:14:37] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: cp3035 PS Redundancy Lost - https://phabricator.wikimedia.org/T225035 (10Southparkfan) The tasks regarding loss of PSU redundancy on cp303[2689] are normal priority, does this one need to be high priority?
[22:26:34] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] gerrit: only ship gerrit.json to logstash, not *_log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509172 (https://phabricator.wikimedia.org/T141324) (owner: 10Dzahn)
[22:37:58] <Reedy>	 jouncebot: now
[22:37:58] <jouncebot>	 For the next 1 hour(s) and 22 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190610T0000)
[22:38:00] <Reedy>	 jouncebot: next
[22:38:01] <jouncebot>	 In 1 hour(s) and 21 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190611T0000)
[22:50:40] <wikibugs>	 (03PS1) 10Reedy: Prevent $wgFlaggedRevsNamespaces from having NS listed twice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516443 (https://phabricator.wikimedia.org/T225276)
[22:52:50] <James_F>	 Reedy: That means no deploys. :-)
[22:53:19] <James_F>	 Aka it's not broken enough.
[22:53:20] <Reedy>	 James_F: Based on the SAL, unless you're an SRE? ;P
[22:53:27] <James_F>	 Reedy: Yeah, well, quite.
[22:53:45] <James_F>	 "No touching appserver code", happy?
[22:54:01] <Reedy>	 Is config code?
[22:54:06] <James_F>	 It's the appservers we're worried about.
[22:54:08] <James_F>	 Yes. :-(
[23:52:37] <wikibugs>	 (03PS1) 10Smalyshev: Set up dumps for mediainfo RDF generation [puppet] - 10https://gerrit.wikimedia.org/r/516444 (https://phabricator.wikimedia.org/T221917)
[23:53:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Set up dumps for mediainfo RDF generation [puppet] - 10https://gerrit.wikimedia.org/r/516444 (https://phabricator.wikimedia.org/T221917) (owner: 10Smalyshev)