[00:00:04] Deploy window No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190610T0000) [05:15:59] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [05:28:15] PROBLEM - Host db1077 is DOWN: PING CRITICAL - Packet loss = 100% [05:32:55] Lovely [05:32:58] Checking that [05:33:12] it is down indeed [05:33:45] and labsdb1009 also with issues :) [05:33:46] nice [05:34:13] PROBLEM - MariaDB Slave IO: s3 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1077.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1077.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:34:49] expected as db1077 is the master for s3 on labs [05:35:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516083 [05:35:31] (03CR) 10Marostegui: [V: 03+2 C: 03+2] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516083 (owner: 10Marostegui) [05:36:29] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516083 (owner: 10Marostegui) [05:37:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 - host crashed (duration: 00m 52s) [05:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:11] db1077 looks like BBU related [05:38:33] 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) [05:38:39] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 747.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:39:15] 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) This is s3's sanitarium master, so for now s3 on labs will be lagging until we fix this host [05:39:36] ACKNOWLEDGEMENT - MariaDB Slave IO: s3 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1077.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1077.eqiad.wmnet (110 Connection timed out) Marostegui T225391 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:39:36] ACKNOWLEDGEMENT - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 747.36 seconds Marostegui T225391 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:41:24] 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) p:05Triage→03High @Cmjohnson looks like we have to first upgrade all the firwmare: https://support.hpe.com/hpsc/doc/public/display?docId=mmr_kc-0134828 [05:43:06] (03PS1) 10Marostegui: db1077: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/516084 (https://phabricator.wikimedia.org/T225391) [05:44:32] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) @Cmjohnson I will leave MySQL down so you can upgrade this host's firmwares as soon as you can without waiting for us to stop MySQL [05:44:40] (03CR) 10Marostegui: [C: 03+2] db1077: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/516084 (https://phabricator.wikimedia.org/T225391) (owner: 10Marostegui) [06:05:59] jouncebot: next [06:05:59] In 17 hour(s) and 54 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190611T0000) [06:06:12] Good. [06:08:04] indeed [06:12:25] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2086 MB (4% inode=61%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [06:28:57] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10ArielGlenn) @WMDE-leszek I think Rachel's question was directed to you. [06:29:57] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:30:27] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/07-wikimania.conf] [06:30:41] PROBLEM - puppet last run on cloudvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/get-raid-status-megacli] [06:33:35] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:42:07] (03PS1) 10ArielGlenn: add awight as deployer [puppet] - 10https://gerrit.wikimedia.org/r/516109 (https://phabricator.wikimedia.org/T225062) [06:44:19] 10Operations, 10SRE-Access-Requests: Access Q re maint1002 - https://phabricator.wikimedia.org/T225253 (10ArielGlenn) p:05Triage→03Normal [06:44:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment cluster for awight - https://phabricator.wikimedia.org/T225062 (10ArielGlenn) p:05Triage→03Normal [06:44:56] 10Operations, 10SRE-Access-Requests, 10observability: Requesting access to icinga for tonycepo - https://phabricator.wikimedia.org/T224313 (10ArielGlenn) p:05Triage→03Normal [06:45:20] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10ArielGlenn) p:05Triage→03Normal [06:48:27] 10Operations, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review, and 2 others: Increase Memory Limit for Scribunto - https://phabricator.wikimedia.org/T223737 (10ArielGlenn) p:05Triage→03Normal [06:55:17] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:27] 10Operations, 10Cloud-Services, 10Kubernetes, 10Patch-For-Review: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287 (10ArielGlenn) p:05Triage→03Normal [06:57:07] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:57:35] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:51] RECOVERY - puppet last run on cloudvirt1017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:05:43] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10ArielGlenn) [07:05:45] 10Operations: Cron spam from phab1001 delete of temporary files - https://phabricator.wikimedia.org/T224727 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn The cronjob producing these has been removed maually on June 3 and will not reappear since the role was removed earlier from the host. I don't see any... [07:15:06] 10Operations: Debian mirror in sync with upstream - https://phabricator.wikimedia.org/T224706 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn One datapoint is that we are still getting updates. I checked and saw there are new entries to the repo from today, mirrored by us. An email sent by Daniel on May 3... [07:26:41] 10Operations: conftool is failing flake8 - https://phabricator.wikimedia.org/T212397 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn Fixed in https://gerrit.wikimedia.org/r/#/c/operations/software/conftool/+/503061/ [07:31:11] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [07:33:39] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [07:38:14] 10Operations, 10Mail, 10Phabricator, 10Regression: "Phabricator monthly statistics" email on wikitech-l@ missing for May 2019 - https://phabricator.wikimedia.org/T224804 (10ArielGlenn) p:05Triage→03Normal I see that @MoritzMuehlenhoff installed bsd-mailx manually on the box to fix future runs; this sh... [07:42:43] (03PS1) 10ArielGlenn: phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 [07:43:38] (03CR) 10jerkins-bot: [V: 04-1] phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 (owner: 10ArielGlenn) [07:44:59] (03PS2) 10ArielGlenn: phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 (https://phabricator.wikimedia.org/T224804) [07:52:23] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [07:52:47] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [08:17:00] (03CR) 10Marostegui: "sounds good, let's check" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/515063 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [08:42:20] (03CR) 10Ppchelko: [C: 03+1] Add monitoring::alerts::kafka_topic_throughput and use it for eventgate [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203) (owner: 10Ottomata) [08:48:52] jouncebot: next [08:48:52] In 15 hour(s) and 11 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190611T0000) [08:48:57] :D [09:03:43] nice and easy! [09:25:51] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [09:27:17] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [10:33:15] (03PS1) 10MarcoAurelio: [WIP] New namespace aliases for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516195 [10:36:18] (03PS2) 10MarcoAurelio: Set two new namespace aliases for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516195 (https://phabricator.wikimedia.org/T216143) [10:39:29] jouncebot: next [10:39:30] In 13 hour(s) and 20 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190611T0000) [10:40:40] (03CR) 10MarcoAurelio: "Note for SWAT deployer: this requires namespaceDupes.php afterwards." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516195 (https://phabricator.wikimedia.org/T216143) (owner: 10MarcoAurelio) [10:45:22] (03CR) 10DannyS712: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516195 (https://phabricator.wikimedia.org/T216143) (owner: 10MarcoAurelio) [11:16:23] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [11:17:43] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:43:04] 10Operations, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review, and 2 others: Increase Memory Limit for Scribunto - https://phabricator.wikimedia.org/T223737 (10tstarling) Looking at the template in question, the obvious solution is to stop doing that. If that's what it takes to exceed the memor... [13:04:00] !log mvolz@deploy1001 scap-helm citoid upgrade staging -f citoid-staging-values.yaml stable/citoid [namespace: citoid, clusters: staging] [13:04:01] !log mvolz@deploy1001 scap-helm citoid cluster staging completed [13:04:01] !log mvolz@deploy1001 scap-helm citoid finished [13:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:44] !log mvolz@deploy1001 scap-helm citoid upgrade production -f citoid-eqiad-values.yaml stable/citoid [namespace: citoid, clusters: eqiad] [13:13:46] !log mvolz@deploy1001 scap-helm citoid cluster eqiad completed [13:13:46] !log mvolz@deploy1001 scap-helm citoid finished [13:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:05] !log mvolz@deploy1001 scap-helm citoid upgrade production -f citoid-codfw-values.yaml stable/citoid [namespace: citoid, clusters: codfw] [13:18:07] !log mvolz@deploy1001 scap-helm citoid cluster codfw completed [13:18:07] !log mvolz@deploy1001 scap-helm citoid finished [13:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:17] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:34:57] (03CR) 10Ottomata: Allow Hadoop-related profiles to deploy Kerberos keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [13:45:51] (03PS2) 10Ottomata: Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson) [13:46:10] (03CR) 10jerkins-bot: [V: 04-1] Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson) [13:46:32] (03CR) 10Ottomata: "Wow cool, did not know this was a thing!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson) [13:47:20] (03PS3) 10Ottomata: Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson) [13:47:44] (03CR) 10jerkins-bot: [V: 04-1] Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson) [13:48:34] (03PS4) 10Ottomata: Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson) [13:49:41] (03CR) 10Ottomata: [C: 03+2] Enable hcatalog integration for oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/515112 (https://phabricator.wikimedia.org/T225310) (owner: 10EBernhardson) [13:58:53] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:58:54] (03PS1) 10Ottomata: Enable HCatalog support in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/516261 (https://phabricator.wikimedia.org/T225310) [13:59:32] (03PS2) 10Ottomata: Enable HCatalog support in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/516261 (https://phabricator.wikimedia.org/T225310) [14:14:44] (03PS1) 10Ottomata: Fix hcatalog conditional in oozie-site.xml.erb [puppet/cdh] - 10https://gerrit.wikimedia.org/r/516265 (https://phabricator.wikimedia.org/T225310) [14:15:11] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix hcatalog conditional in oozie-site.xml.erb [puppet/cdh] - 10https://gerrit.wikimedia.org/r/516265 (https://phabricator.wikimedia.org/T225310) (owner: 10Ottomata) [14:15:55] (03PS3) 10Ottomata: Enable HCatalog support in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/516261 (https://phabricator.wikimedia.org/T225310) [14:18:50] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/16940/" [puppet] - 10https://gerrit.wikimedia.org/r/516261 (https://phabricator.wikimedia.org/T225310) (owner: 10Ottomata) [14:18:52] (03CR) 10Ottomata: [C: 03+2] Enable HCatalog support in hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/516261 (https://phabricator.wikimedia.org/T225310) (owner: 10Ottomata) [15:02:54] 10Operations, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review, and 2 others: Increase Memory Limit for Scribunto - https://phabricator.wikimedia.org/T223737 (10Reedy) 05Open→03Declined >>! In T223737#5246815, @tstarling wrote: > Looking at the template in question, the obvious solution is t... [15:14:39] 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10AlexisJazz) https://commons.wikimedia.org/wiki/File:President_Lula_and_Marisa.jpg first two revisions mi... [15:17:51] (03CR) 10BryanDavis: [C: 03+1] "Seems to match the discussion on T101631" [puppet] - 10https://gerrit.wikimedia.org/r/515062 (https://phabricator.wikimedia.org/T101631) (owner: 10Jhedden) [15:24:16] (03CR) 1020after4: [C: 03+1] phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 (https://phabricator.wikimedia.org/T224804) (owner: 10ArielGlenn) [15:24:56] (03CR) 10Paladox: [C: 03+1] phabricator logmail requires /usr/bin/mail be installed [puppet] - 10https://gerrit.wikimedia.org/r/516131 (https://phabricator.wikimedia.org/T224804) (owner: 10ArielGlenn) [15:39:19] 10Operations, 10Diffusion, 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10ArielGlenn) @mmodell you get farther than I do. I've checked the db and see the right key i... [15:43:16] 10Operations, 10Diffusion, 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) @arielGlenn: The only thing left to do that I can think of is to run the git sshd... [15:45:59] 10Operations, 10Diffusion, 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10ArielGlenn) Awesome, I'll be around if it's not ridiculous o'clock for me. There's a presen... [15:53:12] (03PS1) 10Ottomata: Enable HCatalog support in analytics hadoop oozie [puppet] - 10https://gerrit.wikimedia.org/r/516293 (https://phabricator.wikimedia.org/T225310) [15:56:20] (03CR) 10Ottomata: [C: 03+2] Enable HCatalog support in analytics hadoop oozie [puppet] - 10https://gerrit.wikimedia.org/r/516293 (https://phabricator.wikimedia.org/T225310) (owner: 10Ottomata) [16:16:10] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10RStallman-legalteam) @WMDE-leszek I went ahead and sent the NDAs to the four users mentioned above and will update the ticket once they are signed. [16:24:56] !log Power reset db1077 from the idrac T225391 [16:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:02] T225391: db1077 crashed - https://phabricator.wikimedia.org/T225391 [16:30:59] 10Operations, 10Diffusion, 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10ArielGlenn) Just going to leave this here. https://bugs.debian.org/cgi-bin/bugreport.cgi?bu... [17:03:36] 10Operations, 10Wikimedia-Site-requests, 10HHVM: Set hhvm.virtual_host[default][always_decode_post_data] = false - https://phabricator.wikimedia.org/T208191 (10Jdforrester-WMF) Does this have a PHP7 equivalent, given that we're moving off HHVM "soon"? [17:25:37] 10Operations, 10Diffusion, 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) > openssh-server: SSH AuthorizedKeysCommand hangs when output is too large Ah ha!... [17:25:55] RECOVERY - Host db1077 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [17:39:58] (03PS1) 10Ottomata: Disable ApiAction log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516303 (https://phabricator.wikimedia.org/T222267) [17:47:22] (03PS1) 10Joal: Update AQS druid datasource to 2019_05 snapshot [puppet] - 10https://gerrit.wikimedia.org/r/516307 [17:47:28] ottomata: --^ please :) [17:47:52] k [17:48:15] (03CR) 10Ottomata: [C: 03+2] Update AQS druid datasource to 2019_05 snapshot [puppet] - 10https://gerrit.wikimedia.org/r/516307 (owner: 10Joal) [17:55:00] !log otto@deploy1001 Started restart [analytics/aqs/deploy@fc1d232]: (no justification provided) [17:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:15] oops, there is justifcation, logged in other chan [17:55:28] !log rolling restart of AQS service using scap deploy for new mediawiki_history_snaphost [17:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:32] (03PS4) 10Ottomata: Add monitoring::alerts::kafka_topic_throughput and use it for eventgate [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203) [18:10:07] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Release-Engineering-Team (Backlog): Request: add awight to contint-docker - https://phabricator.wikimedia.org/T223262 (10greg) If you really want :) Approved. [18:13:06] (03PS5) 10Ottomata: Add monitoring::alerts::kafka_topic_throughput and use it for eventgate [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203) [18:14:58] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/16944/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203) (owner: 10Ottomata) [18:28:40] (03CR) 10Ori.livneh: "One last ping before giving up." [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [18:38:21] 10Operations, 10Wikimedia-Site-requests, 10HHVM: Set hhvm.virtual_host[default][always_decode_post_data] = false - https://phabricator.wikimedia.org/T208191 (10Krinkle) >>! In T208191#5247496, @Jdforrester-WMF wrote: > Does this have a PHP7 equivalent, given that we're moving off HHVM "soon"? Per: >>! In T... [18:40:24] 10Operations, 10Wikimedia-Site-requests, 10HHVM: Set hhvm.virtual_host[default][always_decode_post_data] = false - https://phabricator.wikimedia.org/T208191 (10Jdforrester-WMF) Ah, so if we wait long enough, this will fix itself? ;-( [18:48:35] 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Cmjohnson) a:05Cmjohnson→03Marostegui I updated with the service pack and powered on...reassigning to @Marostegui [18:52:41] PROBLEM - LVS HTTP IPv4 on zotero.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:53:37] 10Operations, 10ops-eqiad, 10DBA: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10Cmjohnson) 05Stalled→03Declined declining this for now since it's out of warranty and the disk has not failed [18:53:40] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Cmjohnson) [18:54:47] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T224795 (10Cmjohnson) 05Open→03Declined since this server is out of warranty and @elukey said to skip replacing the disk. If the status changes and needs to be done please re-open task [18:55:27] RECOVERY - LVS HTTP IPv4 on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 138 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:59:17] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson This has been completed [19:01:25] RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [19:05:03] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:06:25] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:07:02] 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) Thanks @Cmjohnson - I can see that on the logs: ` /system1/log1/record15 Targets Properties number=15 severity=Informational date=06/10/2019 time=16:34 description=Firmware fla... [19:08:26] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Cmjohnson) a:05Cmjohnson→03RobH @robh this disk will need to be ordered outside of the warranty. These servers were shipped without disks, the procurement ta... [19:08:49] 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) @Cmjohnson can you also check the one of th power supply cable? It might be loose: ` /system1/log1/record17 Targets Properties number=17 severity=Caution date=06/10/2019 time=17:16... [19:09:12] (03Abandoned) 10Ottomata: [WIP] Prometheus server for cloud-analytics project [puppet] - 10https://gerrit.wikimedia.org/r/479030 (https://phabricator.wikimedia.org/T211640) (owner: 10Ottomata) [19:09:27] (03Abandoned) 10Ottomata: Add LVS for druid-public-overlord indexing service [puppet] - 10https://gerrit.wikimedia.org/r/386427 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [19:10:34] !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-codfw.yaml production stable/zotero [namespace: zotero, clusters: codfw] [19:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:40] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [19:10:40] !log akosiaris@deploy1001 scap-helm zotero finished [19:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:52] !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-eqiad.yaml production stable/zotero [namespace: zotero, clusters: eqiad] [19:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:57] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [19:10:57] !log akosiaris@deploy1001 scap-helm zotero finished [19:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:16] !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-staging.yaml staging stable/zotero [namespace: zotero, clusters: staging] [19:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:24] !log akosiaris@deploy1001 scap-helm zotero cluster staging completed [19:11:24] !log akosiaris@deploy1001 scap-helm zotero finished [19:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:45] !log refresh all zotero pods in all clusters [19:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:45] RECOVERY - MariaDB Slave IO: s3 on db1124 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:15:07] (03CR) 10Anomie: [C: 03+1] "Seems sane to me. One additional suggestion, if it's ok with the people who decide such things." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515062 (https://phabricator.wikimedia.org/T101631) (owner: 10Jhedden) [19:20:16] (03CR) 10Catrope: [C: 03+1] GrowthExperiments (testwiki): Switch on mobile homepage feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514638 (owner: 10Kosta Harlan) [19:33:41] (03PS1) 10Ottomata: Use method gt instead of ge for eventgate validation error throughput alerts [puppet] - 10https://gerrit.wikimedia.org/r/516324 (https://phabricator.wikimedia.org/T225203) [19:34:36] (03PS2) 10Ottomata: Use method gt instead of ge for eventgate validation error throughput alerts [puppet] - 10https://gerrit.wikimedia.org/r/516324 (https://phabricator.wikimedia.org/T225203) [19:39:38] !log restarting jenkins [19:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:13] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Release-Engineering-Team (Backlog): Request: add awight to contint-docker - https://phabricator.wikimedia.org/T223262 (10awight) >>! In T223262#5247693, @greg wrote: > If you really want :) Approved. #masochism not found [19:42:43] 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) MySQL started correctly, I have upgraded it and started replication as everything looked fine. Once it is up to date, I will run some data checks. [19:45:45] (03CR) 10Ppchelko: [C: 03+1] Disable ApiAction log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516303 (https://phabricator.wikimedia.org/T222267) (owner: 10Ottomata) [20:06:18] (03CR) 10Ottomata: [C: 03+2] Use method gt instead of ge for eventgate validation error throughput alerts [puppet] - 10https://gerrit.wikimedia.org/r/516324 (https://phabricator.wikimedia.org/T225203) (owner: 10Ottomata) [20:34:06] (03CR) 10Ottomata: "Hm, ok, moving to /user/analytics is going to be outside scope here, since current jobs are configured to read from /user/hdfs." [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata) [20:37:58] (03PS4) 10Ottomata: \Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) [20:39:14] (03PS5) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) [20:40:03] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:49:08] (03CR) 10Ottomata: "Oh, I see you already did that. Hm." [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata) [21:01:26] (03PS6) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) [21:04:10] (03PS7) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) [21:06:53] (03PS8) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) [21:09:35] (03PS9) 10Ottomata: Include Swift analytics_admin auth .env file in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) [21:10:42] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/16949/an-master1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/512210 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata) [21:14:42] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Ottomata) Ok! Creds deployed, and oozie job merged. Refinery will be deployed this week and we can try it out! [22:14:37] 10Operations, 10ops-esams, 10Traffic: cp3035 PS Redundancy Lost - https://phabricator.wikimedia.org/T225035 (10Southparkfan) The tasks regarding loss of PSU redundancy on cp303[2689] are normal priority, does this one need to be high priority? [22:26:34] (03CR) 10Thcipriani: [C: 03+1] gerrit: only ship gerrit.json to logstash, not *_log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509172 (https://phabricator.wikimedia.org/T141324) (owner: 10Dzahn) [22:37:58] jouncebot: now [22:37:58] For the next 1 hour(s) and 22 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190610T0000) [22:38:00] jouncebot: next [22:38:01] In 1 hour(s) and 21 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190611T0000) [22:50:40] (03PS1) 10Reedy: Prevent $wgFlaggedRevsNamespaces from having NS listed twice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516443 (https://phabricator.wikimedia.org/T225276) [22:52:50] Reedy: That means no deploys. :-) [22:53:19] Aka it's not broken enough. [22:53:20] James_F: Based on the SAL, unless you're an SRE? ;P [22:53:27] Reedy: Yeah, well, quite. [22:53:45] "No touching appserver code", happy? [22:54:01] Is config code? [22:54:06] It's the appservers we're worried about. [22:54:08] Yes. :-( [23:52:37] (03PS1) 10Smalyshev: Set up dumps for mediainfo RDF generation [puppet] - 10https://gerrit.wikimedia.org/r/516444 (https://phabricator.wikimedia.org/T221917) [23:53:21] (03CR) 10jerkins-bot: [V: 04-1] Set up dumps for mediainfo RDF generation [puppet] - 10https://gerrit.wikimedia.org/r/516444 (https://phabricator.wikimedia.org/T221917) (owner: 10Smalyshev)