[00:15:02] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Paladox) [00:15:36] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Huji) Server logs? [00:19:10] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10jrbs) >>! In T209802#4756855, @Huji wrote: > Server logs? I don't have them to hand but I think they're similar to those @bawolff found T209656#4752391 [00:22:04] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Jalexander) Not sure if this is enough but what I was seeing in logstash. I have a feeling there are other log issues that aren't appear in there (at least wit... [00:31:55] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 28520 MB (5% inode=99%) [00:34:57] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:43:03] RECOVERY - Disk space on elastic1017 is OK: DISK OK [00:52:05] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:00:25] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:11:05] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [01:54:07] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Huji) I had a feeling it is related to FastCGI as well (just like the other two mentioned above). Sadly, I have no knowledge of FastCGI troubleshooting. I will... [02:12:16] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Bawolff) >>! In T209802#4756914, @Huji wrote: > I had a feeling it is related to FastCGI as well (just like the other two mentioned above). Sadly, I have no kn... [02:14:45] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:23:17] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=83%) [02:30:35] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [02:30:47] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [02:30:59] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [02:31:13] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [02:31:15] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [02:34:11] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [02:35:05] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10jrbs) I'm currently trying to set up a test election for folks to reproduce. I screwed it up the first time around on enwiki so just trying to sort that out. [02:39:29] RECOVERY - DPKG on notebook1004 is OK: All packages OK [02:39:41] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up [02:39:53] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [02:40:05] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [02:40:07] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [02:41:27] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [02:44:21] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10jrbs) [02:44:23] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:45:26] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10jrbs) [03:00:58] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Bawolff) As an aside, telling x-wikimedia-debug to send me to a php7 seemed to make it work, so definitely seems hhvm related. [03:25:22] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Bawolff) So looking in the logs, it seems like a log event is generated for importing the key into gpg, but there is no log event for actually encrypting the v... [03:27:23] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:29:57] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 851.19 seconds [03:41:55] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [03:56:00] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Bawolff) [03:57:09] revi: So this used to work as of like a couple weeks ago right? [03:57:13] I wonder what changed [03:57:34] actually...the problem is I never tried editing with election few weeks ago [03:57:54] so I don't know if it was possible to edit SecurePoll with my account [03:57:54] revi: But (for other bug) people could succesfully vote right? [03:57:57] yeah [03:58:03] the latest vote was concluded last week [03:58:20] and around 100 people voted, so it worked till last week [03:58:28] then it just started yelling at us [03:58:48] End of last Tuesday (UTC) [03:59:11] Last vote was cast at `22:59, 13 November 2018` [03:59:13] (UTC) [03:59:22] or wait... it's probably KST [04:00:05] yeah it's most likely Korean time if SecurePoll respects preferences timezone [04:00:21] so in UTC it is 13:59, 13 Nov 2018 [04:01:19] while testing on php-1.33.0-wmf.1, it still hangs in eval.php [04:03:04] oh. maybe mwscript doesn't work the way i think it does [04:12:15] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 288.99 seconds [04:14:58] I'll be AFK but can respond to pings so don't hesitate if you need something from me [04:17:55] This is really outside of my expertise, probably needs someone from SRE to debug what's going on [04:19:22] not sure if we should summon someone tho [04:19:32] we have like... 19 hours now [04:26:55] Well I assume europe would be waking up soon [04:29:35] It definitely seems to be a gpg problem. If one doesn't care about secret ballots, that could just be disabled [04:32:39] you can fetch the exact command from the logstash 'command' channel [04:33:03] if that doesn't hang than it probably interferes with the stream_select magic in ShellCommand somehow [04:42:18] tgr: The import key command is in the logs [04:43:10] I tried stepping through with mwrepl, but when single stepping, stuff doesn't hang [04:43:46] stream_select magic sounds like a good theory, but in terms of debugging or verifying that, I'm out of my depth [04:44:54] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Bawolff) [04:47:06] hmm, when i do ps on mwdebug1002, there are 3 limit.sh 'gpg' processes (but no actual gpg proesses, although there are some gpg-agent processes) [05:26:51] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:41:25] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [05:47:02] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Jalexander) >>! In T209802#4756957, @Bawolff wrote: > As an aside, telling x-wikimedia-debug to send me to a php7 seemed to make it work, so definitely seems h... [06:09:13] (03PS1) 10Marostegui: db-eqiad.php: Add db1078 to the file, but depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474620 (https://phabricator.wikimedia.org/T209754) [06:10:16] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Add db1078 to the file, but depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474620 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [06:12:21] (03PS2) 10Marostegui: db-eqiad.php: Add db1078 to the file, but depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474620 (https://phabricator.wikimedia.org/T209754) [06:13:21] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Add db1078 to the file, but depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474620 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [06:15:42] (03PS3) 10Marostegui: db-eqiad.php: Add db1078 to the file, but depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474620 (https://phabricator.wikimedia.org/T209754) [06:18:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add db1078 to the file, but depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474620 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [06:19:20] (03Merged) 10jenkins-bot: db-eqiad.php: Add db1078 to the file, but depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474620 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [06:20:05] !log marostegui@deploy1001 sync-file aborted: Add db1078 line back to config file but depooled T209754 (duration: 00m 02s) [06:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:09] T209754: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 [06:21:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Add db1078 line back to config file but depooled T209754 (duration: 00m 51s) [06:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:25] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10Marostegui) I have rebooted this host to see if there were any HW errors on boot-up, but it came back fine, no storage, memory or any other kind of error reported. [06:31:06] (03CR) 10jenkins-bot: db-eqiad.php: Add db1078 to the file, but depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474620 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [06:31:11] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` db1078.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201... [06:42:53] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10Marostegui) This host also crashed a bit over a year ago: T173365 Even if I didn't find any trace of a real storage crash, this is what syslog shows 10 minutes before the crash: ` Nov... [06:51:21] PROBLEM - Check systemd state on ms-be1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:07:10] (03PS1) 10Marostegui: install_server: Allow re-image db1078 [puppet] - 10https://gerrit.wikimedia.org/r/474621 (https://phabricator.wikimedia.org/T209754) [07:08:36] (03CR) 10Marostegui: [C: 032] install_server: Allow re-image db1078 [puppet] - 10https://gerrit.wikimedia.org/r/474621 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [07:12:12] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1078.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1078.eqiad.wmnet'] ` [07:12:22] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` db1078.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201... [07:12:25] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1078.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1078.eqiad.wmnet'] ` [07:12:44] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` db1078.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201... [07:12:48] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1078.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1078.eqiad.wmnet'] ` [07:13:02] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` db1078.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201... [07:17:29] RECOVERY - Disk space on notebook1004 is OK: DISK OK [07:29:41] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1080.eqiad.wmnet'... [07:31:28] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1078.eqiad.wmnet'] ` and were **ALL** successful. [07:41:49] (03Abandoned) 10Zoranzoki21: Add new throttle rule for Art+Feminism Event on 2018-11-17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473255 (https://phabricator.wikimedia.org/T209324) (owner: 10Zoranzoki21) [07:45:12] (03PS2) 10Muehlenhoff: Remove Diamond from Swift backends [puppet] - 10https://gerrit.wikimedia.org/r/474280 (https://phabricator.wikimedia.org/T183454) [07:47:55] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from Swift backends [puppet] - 10https://gerrit.wikimedia.org/r/474280 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [07:52:15] (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474627 (https://phabricator.wikimedia.org/T209754) [07:54:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474627 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [07:55:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474627 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [07:56:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 to clone db1078 T209754 (duration: 00m 47s) [07:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:41] T209754: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 [07:57:14] !log Stop MySQL on db1123 - T209754 [07:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:39] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474627 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [08:01:21] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1080.eqiad.wmnet'] ` and were **ALL** successful. [08:03:14] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:09:12] PROBLEM - Check systemd state on ms-be1021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:10:24] PROBLEM - Check systemd state on ms-be1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:10:58] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [08:11:18] PROBLEM - Check systemd state on ms-be1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:11:32] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [08:12:46] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [08:14:18] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:15:42] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [08:17:37] PROBLEM - puppet last run on ms-be2039 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [08:26:02] 10Operations, 10media-storage: Ingest swift access logs for thumbnail/original analysis - https://phabricator.wikimedia.org/T209810 (10fgiunchedi) [08:27:22] (03PS2) 10Effie Mouzeli: cumin: create alias for role redis::misc [puppet] - 10https://gerrit.wikimedia.org/r/473582 [08:28:05] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) [08:30:32] (03PS1) 10Elukey: Set the DHCP settings for the an-worker nodes to their 10G NIC mac [puppet] - 10https://gerrit.wikimedia.org/r/474635 (https://phabricator.wikimedia.org/T207192) [08:30:59] (03CR) 10Muehlenhoff: [C: 031] cumin: create alias for role redis::misc [puppet] - 10https://gerrit.wikimedia.org/r/473582 (owner: 10Effie Mouzeli) [08:31:08] (03CR) 10jerkins-bot: [V: 04-1] Set the DHCP settings for the an-worker nodes to their 10G NIC mac [puppet] - 10https://gerrit.wikimedia.org/r/474635 (https://phabricator.wikimedia.org/T207192) (owner: 10Elukey) [08:32:32] (03PS2) 10Elukey: Set 10G mac addresses of an-worker nodes in DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/474635 (https://phabricator.wikimedia.org/T207192) [08:33:17] (03CR) 10jerkins-bot: [V: 04-1] Set 10G mac addresses of an-worker nodes in DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/474635 (https://phabricator.wikimedia.org/T207192) (owner: 10Elukey) [08:35:15] (03PS3) 10Elukey: Set 10G mac addresses of an-worker nodes in DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/474635 (https://phabricator.wikimedia.org/T207192) [08:35:55] (03CR) 10Elukey: [C: 032] Set 10G mac addresses of an-worker nodes in DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/474635 (https://phabricator.wikimedia.org/T207192) (owner: 10Elukey) [08:36:33] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10MoritzMuehlenhoff) [08:37:13] RECOVERY - Check systemd state on ms-be1021 is OK: OK - running: The system is fully operational [08:37:13] RECOVERY - Check systemd state on ms-be1013 is OK: OK - running: The system is fully operational [08:37:49] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:37:51] PROBLEM - Host ms-be2044 is DOWN: PING CRITICAL - Packet loss = 100% [08:37:58] ms-be2044 is me [08:39:11] RECOVERY - Host ms-be2044 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [08:39:31] RECOVERY - Check systemd state on ms-be1020 is OK: OK - running: The system is fully operational [08:40:07] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational [08:40:43] RECOVERY - Check systemd state on ms-be1037 is OK: OK - running: The system is fully operational [08:40:53] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:41:07] RECOVERY - puppet last run on ms-be1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:41:12] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10fgiunchedi) >>! In T209395#4754151, @Papaul wrote: > @fgiunchedi I did the install on the first system ms-be2044 please check the output below. If it lo... [08:41:38] 10Operations, 10ops-codfw: Degraded RAID on ms-be2046 - https://phabricator.wikimedia.org/T209727 (10fgiunchedi) 05Open>03Invalid Systems being setup in {T209395} [08:42:41] RECOVERY - puppet last run on ms-be2039 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:43:31] (03PS2) 10CRusnov: Make the puppetdb backend process primitive types for queries. [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) [08:43:43] !log executing schema change on db2095 (T85757) [08:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:47] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [08:45:52] (03CR) 10Effie Mouzeli: [C: 032] cumin: create alias for role redis::misc [puppet] - 10https://gerrit.wikimedia.org/r/473582 (owner: 10Effie Mouzeli) [08:46:07] (03PS3) 10Effie Mouzeli: cumin: create alias for role redis::misc [puppet] - 10https://gerrit.wikimedia.org/r/473582 [08:46:09] (03CR) 10jerkins-bot: [V: 04-1] Make the puppetdb backend process primitive types for queries. [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [08:46:31] (03CR) 10Effie Mouzeli: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13570/oresrdb2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/474450 (https://phabricator.wikimedia.org/T209628) (owner: 10Alexandros Kosiaris) [08:47:06] (03CR) 10CRusnov: "I have completed the suggested changes." (0314 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [08:47:38] (03CR) 10CRusnov: Make the puppetdb backend process primitive types for queries. (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [08:50:13] PROBLEM - MariaDB Slave SQL: s6 on db2095 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1054, Errmsg: Could not execute Update_rows_v1 event on table frwiki.user: Unknown column user_options in NEW, Error_code: 1054: handler error HA_ERR_GENERIC: the events master log db2076-bin.001214, end_log_pos 500560723 [08:52:28] ACKNOWLEDGEMENT - MariaDB Slave SQL: s6 on db2095 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1054, Errmsg: Could not execute Update_rows_v1 event on table frwiki.user: Unknown column user_options in NEW, Error_code: 1054: handler error HA_ERR_GENERIC: the events master log db2076-bin.001214, end_log_pos 500560723 Banyek T85757 [08:56:49] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Joe) >>! In T209802#4756963, @Bawolff wrote: > So looking in the logs, it seems like a log event is generated for importing the key into gpg, but there is no l... [08:58:01] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1081.eqiad.wmnet'... [09:06:56] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10jijiki) [09:07:25] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10jrbs) Worth noting here - coordinators agreed to push voting back on the elections by 24 hours (i.e. 00:00 UTC on November 20). [09:07:47] 10Operations, 10decommission, 10Patch-For-Review, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10jijiki) [09:10:14] !log Rebuilt message group stats cache for T208521 [09:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:17] T208521: Translate extension language bar not reliably showing how complete a translation is - https://phabricator.wikimedia.org/T208521 [09:12:19] (03PS1) 10Marostegui: db-eqiad.php: Repool db1123 and db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474638 (https://phabricator.wikimedia.org/T209754) [09:12:48] (03CR) 10Marostegui: [C: 04-1] "Wait until replication lag is gone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474638 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [09:14:46] (03PS1) 10Marostegui: db1078: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/474640 (https://phabricator.wikimedia.org/T209754) [09:16:19] 10Operations, 10Analytics, 10EventBus, 10WMF-JobQueue, and 4 others: Stop and remove old job runners - https://phabricator.wikimedia.org/T198220 (10jijiki) @Pchelolo/@mobrovac jobqueue_redis instances have been removed from prod and we have cleaned up any puppet and mediawiki-config references . Should we... [09:19:48] RECOVERY - MariaDB Slave SQL: s6 on db2095 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:19:56] PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 3365 mails in exim queue. [09:26:29] (03CR) 10Filippo Giunchedi: [C: 04-1] role: add aggregations for TCP Fast Open to prometheus global (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474321 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [09:30:19] (03CR) 10Filippo Giunchedi: [C: 031] "Updating the submodule in puppet.git will require an additional commit/review in puppet.git itself." [puppet/nginx] - 10https://gerrit.wikimedia.org/r/474309 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [09:32:10] (03CR) 10Marostegui: [C: 032] db1078: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/474640 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [09:34:40] 10Operations, 10ops-eqiad, 10DBA: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Marostegui) [09:34:49] 10Operations, 10ops-eqiad, 10DBA: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Marostegui) p:05Triage>03Normal [09:41:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1123 and db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474638 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [09:42:48] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1123 and db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474638 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [09:43:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1123 and db1078 T209754 (duration: 00m 46s) [09:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:04] T209754: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 [09:45:54] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:46:30] !log Drop empty testwiki.petition_data from db1075 with replication - T208979 [09:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:33] T208979: Drop the petition_data table from production - https://phabricator.wikimedia.org/T208979 [09:48:54] !log Rename table foundationwiki.petition_data on db1078 - T208979 [09:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:08] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1123 and db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474638 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [09:49:34] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10Marostegui) a:03Marostegui [09:51:31] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1078 and db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474656 (https://phabricator.wikimedia.org/T209754) [09:52:18] (03CR) 10Alexandros Kosiaris: [C: 04-2] "> here is what happens if i update to the upstream release branch: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474334/ unless t" [puppet] - 10https://gerrit.wikimedia.org/r/472363 (owner: 10Dzahn) [09:53:18] (03PS4) 10Pmiazga: Prod: increase Schema.org page split test to 50% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473225 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [09:55:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1078 and db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474656 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [09:55:48] (03PS1) 10Marostegui: Revert "install_server: Allow re-image db1078" [puppet] - 10https://gerrit.wikimedia.org/r/474657 [09:55:54] (03PS2) 10Marostegui: Revert "install_server: Allow re-image db1078" [puppet] - 10https://gerrit.wikimedia.org/r/474657 [09:56:12] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1078 and db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474656 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [09:56:41] (03CR) 10Marostegui: [C: 032] Revert "install_server: Allow re-image db1078" [puppet] - 10https://gerrit.wikimedia.org/r/474657 (owner: 10Marostegui) [09:57:09] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1082.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1082.... [09:57:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1123 and increase weight for db1078 T209754 (duration: 00m 46s) [09:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:19] T209754: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 [10:02:29] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474658 [10:02:30] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1082.eqiad.wmnet'... [10:03:10] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1078 and db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474656 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [10:08:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474658 (owner: 10Marostegui) [10:09:14] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474658 (owner: 10Marostegui) [10:11:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase weight for db1078 T209754 (duration: 00m 46s) [10:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:07] T209754: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 [10:11:14] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [10:11:39] (03PS2) 10Ema: Revert "ATS: temporarily avoid calling 'verify_config' in ExecReload" [puppet] - 10https://gerrit.wikimedia.org/r/474283 [10:13:26] (03CR) 10Ema: [C: 032] Revert "ATS: temporarily avoid calling 'verify_config' in ExecReload" [puppet] - 10https://gerrit.wikimedia.org/r/474283 (owner: 10Ema) [10:14:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "So this changes > 500 files per gerrit and https://github.com/puppetlabs/puppetlabs-stdlib/compare/4.15.0...release" [puppet] - 10https://gerrit.wikimedia.org/r/474334 (owner: 10Dzahn) [10:16:55] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474658 (owner: 10Marostegui) [10:17:37] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474662 [10:19:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474662 (owner: 10Marostegui) [10:20:38] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474662 (owner: 10Marostegui) [10:21:05] !log stopping replication on db2076 (T85757) [10:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:09] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:21:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase weight for db1078 T209754 (duration: 00m 46s) [10:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:36] T209754: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 [10:22:33] RECOVERY - Host lvs2010 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [10:25:27] awight Krenair: The "not suitable for long-term storage" comment in https://meta.wikimedia.org/wiki/Etherpad still stands. It's also depicted in https://github.com/wikimedia/puppet/blob/production/modules/etherpad/templates/settings.json.erb#L16 which is the text displayed (that nobody reads, same as a EULA) when creating a new pad. [10:25:34] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1082.eqiad.wmnet'] ` and were **ALL** successful. [10:25:57] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [10:27:32] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1083.eqiad.wmnet'... [10:30:05] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181119T1030). [10:30:24] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [10:30:34] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474663 (https://phabricator.wikimedia.org/T209754) [10:31:04] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474662 (owner: 10Marostegui) [10:33:32] 10Operations, 10Analytics, 10EventBus, 10WMF-JobQueue, and 3 others: Stop and remove old job runners - https://phabricator.wikimedia.org/T198220 (10mobrovac) 05Open>03Resolved Indeed @jijiki ! Thanks! [10:34:34] (03CR) 10Gehel: "Minor comments inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473735 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [10:35:44] (03PS4) 10Gehel: maps: added use_proxy flag to set proxy [puppet] - 10https://gerrit.wikimedia.org/r/473731 (https://phabricator.wikimedia.org/T209570) (owner: 10Mathew.onipe) [10:36:55] PROBLEM - pybal on lvs2010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [10:37:08] (03CR) 10Gehel: [C: 032] maps: added use_proxy flag to set proxy [puppet] - 10https://gerrit.wikimedia.org/r/473731 (https://phabricator.wikimedia.org/T209570) (owner: 10Mathew.onipe) [10:38:32] (03PS1) 10MSantos: Disable admin cron on maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/474667 [10:38:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474663 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [10:39:45] (03CR) 10jerkins-bot: [V: 04-1] Disable admin cron on maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/474667 (owner: 10MSantos) [10:40:25] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474663 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [10:41:20] (03PS2) 10MSantos: Disable admin cron on maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/474667 [10:41:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1078 T209754 (duration: 00m 46s) [10:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:54] T209754: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 [10:44:33] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) @Cmjohnson the Debian OS install is in progress, but I think that an-worker109[45] have their network ports disabled. Can you... [10:44:38] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474663 (https://phabricator.wikimedia.org/T209754) (owner: 10Marostegui) [10:45:30] (03PS4) 10Gehel: maps: update SQL script location for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/473736 (https://phabricator.wikimedia.org/T209566) (owner: 10Mathew.onipe) [10:45:57] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10fselles) i took a quick look with strace and it seems is hanging launching this ` /bin/bash /srv/mediawiki/php-1.33.0-wmf.4/includes/shell/limit.sh 'gpg' --h... [10:46:20] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Joe) I just succesfully obtained an encrypted message by running the same script and disabling the light_process feature of HHVM. ` $ PHP="/usr/bin/hhvm -d hh... [10:47:52] 10Operations, 10DBA, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10Marostegui) 05Open>03Resolved db1078 is now fully repooled after cloning it. This is all done. As a follow up with DCOps I have created {T209815} so we can have everything up to da... [10:49:53] (03PS2) 10Ema: ATS: add check_trafficserver_verify_config [puppet] - 10https://gerrit.wikimedia.org/r/474288 (https://phabricator.wikimedia.org/T204209) [10:50:02] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:50:41] (03CR) 10MSantos: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/473736 (https://phabricator.wikimedia.org/T209566) (owner: 10Mathew.onipe) [10:50:52] (03CR) 10Gehel: "LGTM (minor comment inline, but for a future CR)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474667 (owner: 10MSantos) [10:51:08] (03CR) 10Gehel: [C: 032] Disable admin cron on maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/474667 (owner: 10MSantos) [10:54:57] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Joe) So, redefining `$wgSecurePollTempDir` doesn't do the trick: ` $ mwscript eval.php votewiki > $wgSecurePollTempDir="/var/tmp/hhvm"; > $context = new Sec... [10:56:12] (03PS1) 10Elukey: Apply -R 200 to memcached on mc1020 [puppet] - 10https://gerrit.wikimedia.org/r/474670 (https://phabricator.wikimedia.org/T208844) [10:56:42] (03PS5) 10Gehel: maps: update SQL script location for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/473736 (https://phabricator.wikimedia.org/T209566) (owner: 10Mathew.onipe) [10:58:14] (03CR) 10Gehel: [C: 032] maps: update SQL script location for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/473736 (https://phabricator.wikimedia.org/T209566) (owner: 10Mathew.onipe) [10:59:10] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:00:34] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1084.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1084.... [11:05:53] (03PS3) 10Ema: ATS: add check_trafficserver_verify_config [puppet] - 10https://gerrit.wikimedia.org/r/474288 (https://phabricator.wikimedia.org/T204209) [11:07:27] (03CR) 10Arturo Borrero Gonzalez: "Comments inline. Thanks for working on this!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [11:08:54] (03PS1) 10Muehlenhoff: Enable Kerberos for Druid workers (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/474672 [11:08:56] (03PS1) 10Muehlenhoff: Enable Kerberos for Druid/www (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/474673 [11:11:04] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [11:15:32] (03PS2) 10Elukey: Apply -R 200 to memcached on mc1020 [puppet] - 10https://gerrit.wikimedia.org/r/474670 (https://phabricator.wikimedia.org/T208844) [11:15:38] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13572/" [puppet] - 10https://gerrit.wikimedia.org/r/474670 (https://phabricator.wikimedia.org/T208844) (owner: 10Elukey) [11:20:14] !log restart memcached on mc1020 to apply -R 200 settings (shard wiped) - T208844 [11:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:18] T208844: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 [11:21:04] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10elukey) [11:22:08] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Joe) I tried to debug further what the problem is, given I'm not inclined to disable site-wide the use of lightprocesses (although now it should be less of an... [11:23:03] (03PS1) 10Muehlenhoff: Fix help text [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/474674 [11:24:47] (03PS3) 10Alexandros Kosiaris: First draft of a zotero helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/466287 (https://phabricator.wikimedia.org/T201611) [11:28:38] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1085.eqiad.wmnet'... [11:30:16] (03PS5) 10Vgutierrez: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) [11:33:45] !log labsdb1011 upgraded packages on labsdb1011 (pre-work T209517) [11:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:48] T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 [11:39:57] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:40:04] (03CR) 10Vgutierrez: "pcc happy with the new approach, showing NOOPs across several lvs nodes and the expected changes in lvs2010: https://puppet-compiler.wmfla" [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [11:41:03] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [11:41:45] RECOVERY - Long running screen/tmux on an-coord1001 is OK: OK: No SCREEN or tmux processes detected. [11:50:14] PROBLEM - Long running screen/tmux on certcentral1001 is CRITICAL: CRIT: Long running SCREEN process. (user: vgutierrez PID: 13547, 1735234s 1728000s). [11:50:39] that snitch ¬¬ :) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181119T1200). [12:00:04] Zoranzoki21, Urbanecm, and raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:09] here [12:00:14] o/ [12:01:51] (03PS12) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [12:02:59] I can SWAT today [12:03:10] (03CR) 10GTirloni: toolforge: Refactor clush (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [12:03:23] raynor: go ahead with your commit while I review other commits [12:04:34] kk [12:04:58] (03CR) 10Pmiazga: [C: 032] Prod: increase Schema.org page split test to 50% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473225 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [12:06:33] Zoranzoki21 around for SWAT? [12:06:42] (03Merged) 10jenkins-bot: Prod: increase Schema.org page split test to 50% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473225 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [12:07:20] Urbanecm please stand by, looks like Zoranzoki21 is not around, so you're next [12:07:32] (03CR) 10jenkins-bot: Prod: increase Schema.org page split test to 50% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473225 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [12:07:48] in case Zoranzoki21 is not around, in that case I can take care about his patches [12:08:34] testing on mwdebug1002... [12:09:11] Urbanecm: ok, do you want me to deploy your change first, or his? [12:09:50] mine, it is more urgent [12:10:07] deploying... [12:10:18] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:10:45] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:473225]|Enable Schema.org page split test at 50% sampling (T208755)]] (duration: 00m 46s) [12:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:50] T208755: Launch A/B test for sameAs property - https://phabricator.wikimedia.org/T208755 [12:11:24] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [12:11:33] zeljkof, I'm done - my patch works. Thank you for letting me go first. [12:11:53] SWAT window is yours zeljkof -> if you need help with swatting other changes just let me know [12:12:13] Hi, I am here for SWAT [12:12:19] raynor: well, if you need training, feel free to deploy all changes ;) [12:12:45] it's not that I need training, but if you're busy I can do that [12:12:54] ok, I'll do that zeljkof, but please be around [12:12:56] ;) [12:13:07] raynor: ok, go ahead :) we're all busy [12:13:12] Urbanecm, ready? I can go with your patch [12:13:15] I'm around in case you need help [12:13:59] yup, feel free to push to mwdebug raynor [12:14:01] Zoranzoki21, please wait [12:14:06] raynor: Ok [12:14:27] (03PS2) 10Pmiazga: Remove wgMetaNamespaceTalk for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474124 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [12:14:51] (03CR) 10Pmiazga: [C: 032] Remove wgMetaNamespaceTalk for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474124 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [12:15:17] Urbanecm, merging, I'll let you know once it's on mwdebug1002 [12:15:37] k [12:16:14] raynor: IRC kicked me, I no know why. But, I am back [12:16:24] (03Merged) 10jenkins-bot: Remove wgMetaNamespaceTalk for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474124 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [12:16:35] Zoranzoki21, roger that [12:18:14] Urbanecm, your change is on mwdebug1002 [12:18:18] testing [12:18:59] Zoranzoki21 - is there any particular order in your patches?, can I go one by one from top? [12:19:08] From up to down [12:19:19] raynor, please deploy the patch [12:19:29] to be sure, please run namespaceDupes.php after deploying [12:20:17] Urbanecm, roger that [12:21:27] zeljkof, can you run the script please? [12:21:46] I'm deploying the change right now [12:21:49] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:474124]|Remove wgMetaNamespaceTalk for shnwiki (T206777)]] (duration: 00m 46s) [12:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:53] T206777: Create Wikipedia Shan - https://phabricator.wikimedia.org/T206777 [12:22:07] raynor: even better, you can run the script! :) [12:22:13] Urbanecm - your change is live [12:22:16] thanks [12:22:19] raynor: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Maintenance_scripts [12:22:20] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Joe) Some more information before I get away on a vacation day: looking at strace of HHVM's execution that hangs, I see the gpg-agent process reading what fol... [12:22:47] (03CR) 10Volans: [C: 031] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/474674 (owner: 10Muehlenhoff) [12:22:52] raynor: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#namespaceDupes [12:23:25] ok [12:23:27] raynor: Note, rebase changes before deploying [12:23:35] *rebase my changes [12:23:54] Zoranzoki21, raynor: gerrit does rebases in some cases, so it's not needed always [12:24:17] hashar might know more, but some trivial rebases are done automatically [12:24:20] zeljkof: thx for info [12:24:43] some commits require manual click on rebase in gerrit [12:24:46] ok, so I run the mwscript namespaceDupes.php shnwiki and it told me "0 pages to fix, 0 were resolvable' [12:24:53] and very few need an actual rebase on a dev machine [12:25:05] raynor: did you run with --fix [12:25:08] ? [12:25:13] nope, just dry run [12:25:33] run with --fix, I'm not sure if it's needed, but that's what I do :) [12:25:46] and then copy/paste script output to the task as a comment [12:25:51] with fix I have the same output [12:26:11] sometimes there's an action needed, some cleanup [12:26:34] that's done by task owner, not deployer, like renaming pages on a wiki and such [12:26:56] kk, thx [12:27:17] Zoranzoki21 - your turn [12:27:28] Ihttps://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/472918/ [12:27:31] raynor: Let`s go in this new adventure ;) [12:27:42] I'm mmerging and pushing - https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/472918/ to mwdeploy1002, I'll let you know once it's there [12:27:53] raynor: Ok :) [12:28:13] (03CR) 10Pmiazga: [C: 032] Enable autopatrol, patrol, rollback rights and RCPatrol on srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472918 (https://phabricator.wikimedia.org/T209252) (owner: 10Zoranzoki21) [12:28:56] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1086.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1086.... [12:29:27] Zoranzoki21, I need to merge `[config] 472917 Enable RCPatrol on srwikibooks` first [12:29:42] raynor: Why? [12:29:58] Something bad happening? [12:30:07] `Enable autopatrol, patrol, rollback rights and RCPatrol on srwiktionary` depends on the `Enable RCPatrol` [12:30:32] (03PS2) 10Zoranzoki21: Enable autopatrol, patrol, rollback rights and RCPatrol on srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472918 (https://phabricator.wikimedia.org/T209252) [12:30:34] and `Enable RCPatrol` depends on `03d551a05209702ccd2de321e7bb58a3c9972b44` which is `Upload HDLogos` [12:30:45] Try now to merge https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/472918/ [12:32:44] ok, I see [12:32:56] so now it means we don't need the 'Enable RCPatrol` patch [12:34:36] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10tstarling) Installing the package gnupg1 and using ` $wgSecurePollGPGCommand = 'gpg1'; ` causes this to be fixed, I have tested it on mwmaint1002. Apparently... [12:35:11] Zoranzoki21 - our change ins on mwdebug1002 [12:35:51] Let me check [12:36:29] I will move other patches in another SWAT, because I have to hurry up in school [12:36:49] srwiktionary looks good, raynor [12:36:55] np, I think you can remove the RCPatrol one, as it already there [12:37:00] ok, deploying to prod [12:37:19] Zoranzoki21, do you want me to deploy another one or do you want to wait for another SWAT window? [12:37:22] (03CR) 10jenkins-bot: Remove wgMetaNamespaceTalk for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474124 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [12:37:24] (03CR) 10jenkins-bot: Enable autopatrol, patrol, rollback rights and RCPatrol on srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472918 (https://phabricator.wikimedia.org/T209252) (owner: 10Zoranzoki21) [12:37:59] raynor: I will wait for another SWAT window [12:38:13] ok, sorry for that Zoranzoki21 [12:38:25] raynor: No problems. I have to go in school [12:38:39] raynor: it`s reason [12:38:45] your change should be in live in less than a minute [12:38:50] ok [12:38:51] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:472918]|Enable autopatrol, patrol, rollback rights and RCPatrol on srwiktionary (T209252)]] (duration: 00m 46s) [12:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:54] T209252: Enable autopatrol, patrol, rollback rights and RCPatrol on srwiktionary - https://phabricator.wikimedia.org/T209252 [12:39:02] ok, Zoranzoki21 - it's on prod, please verify [12:39:02] tnx [12:39:02] cya [12:39:08] yes, it`s live [12:39:58] so it means we're done [12:40:01] raynor: done? it's the first swat you did completely? [12:40:07] congratulations! ;) [12:40:10] (03PS1) 10Tim Starling: In SecurePoll use gpg1 to avoid gpg-agent autostart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474679 (https://phabricator.wikimedia.org/T209802) [12:40:13] !log EU SWAT finished [12:40:14] my job is done here [12:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:14] is there more room in the window? [12:40:15] because [12:40:22] see tim's patch right above [12:40:23] * zeljkof rides into the sunset [12:40:25] this is an ubn [12:40:26] hmm [12:40:32] apergos, sure, lets do that [12:40:38] let's make sure the patch is ready [12:40:50] zeljkof - do I log EU SWAT repopened ?? :) [12:40:59] raynor: yes [12:41:01] and yes, that was my first SWAT [12:41:03] that's what I do [12:41:15] apergos - sure, I'm around, ping me once you have that patch [12:41:19] thx [12:41:25] and that's how swats go, you think you're done, but there's another patch :D [12:41:44] well this is unusual :-) [12:44:43] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10ema) p:05Triage>03Normal [12:45:13] raynor, apergos: I've just installed gpg1 on all app servers, so that patch is good to go (proper puppet patch to follow) [12:45:19] awesome [12:45:19] (03PS1) 10Mathew.onipe: maps: change nodes.bin owner to osmupdater [puppet] - 10https://gerrit.wikimedia.org/r/474680 (https://phabricator.wikimedia.org/T209569) [12:45:34] zeljkof, but please, don't add me to deployer list in wikitech ;) [12:45:45] * raynor hides in bushes [12:45:57] raynor: I won't. yet. ;P [12:46:09] moritzm, apergos I'm around, so this is a puppet patch [12:46:14] banyek|away: hey, we scheduled some labsdb reboots in ~15mins for now. Will you be around? [12:46:28] marostegui: ^^^ [12:46:38] raynor: no, it's a mediawiki-config patch [12:47:06] there will be a separate you-don't-have-to-worry-about-it puppet patch to make the install of the package be done for future hosts [12:47:12] ah, ok, even better, I didn't do puppet-related deployments yet [12:47:14] ack, I've installed gnupg 1 manually which is a pre-requisite for https://gerrit.wikimedia.org/r/474679 (and that installation will be puppetised) [12:47:16] it was done the old manual quick way so this can get out the door [12:48:31] (03PS3) 10Arturo Borrero Gonzalez: toolforge: purge jmail script [puppet] - 10https://gerrit.wikimedia.org/r/474311 (https://phabricator.wikimedia.org/T208579) (owner: 10BryanDavis) [12:48:35] (03PS1) 10Muehlenhoff: Install gpg 1 on app servers for SecurePoll extension [puppet] - 10https://gerrit.wikimedia.org/r/474681 (https://phabricator.wikimedia.org/T209802) [12:49:26] apergos, moritzm -> please add the patch you want deploy to the https://wikitech.wikimedia.org/wiki/Deployments [12:49:42] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toolforge: purge jmail script [puppet] - 10https://gerrit.wikimedia.org/r/474311 (https://phabricator.wikimedia.org/T208579) (owner: 10BryanDavis) [12:50:12] raynor: ack, doing that now [12:50:46] !log EU SWAT reopened [12:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:03] woops, I was in there [12:51:08] ok well, it's all yours then :-D [12:51:59] zeljkof - are there any jouncebot commands I need to run when someone adds something to the window during deployment? [12:52:19] I know it's used to poke people during SWAT windows, anything else? I'm too lazy to check the bot code [12:52:29] raynor: done [12:52:31] raynor: no, I usually just ask them to add the new commit to the calendar [12:52:35] apergos: ack, sorry for the race [12:52:48] no worries [12:53:17] (03CR) 10Mathew.onipe: maps: change nodes.bin owner to osmupdater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474680 (https://phabricator.wikimedia.org/T209569) (owner: 10Mathew.onipe) [12:54:15] moritzm, pushing to debug1002 -> I'll let you know once it's done [12:54:21] (03CR) 10Pmiazga: [C: 032] In SecurePoll use gpg1 to avoid gpg-agent autostart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474679 (https://phabricator.wikimedia.org/T209802) (owner: 10Tim Starling) [12:55:24] (03Merged) 10jenkins-bot: In SecurePoll use gpg1 to avoid gpg-agent autostart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474679 (https://phabricator.wikimedia.org/T209802) (owner: 10Tim Starling) [12:56:44] moritzm, apergos -> it's on mwdebug1002 [12:56:44] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1087.eqiad.wmnet'... [13:00:21] I've done the test that bawolff did in https://phabricator.wikimedia.org/T209802#4756963 and it seems ok now [13:00:30] moritzm: anything else we should try do you think? [13:00:44] *done on mwdebug1002 I should say [13:00:57] yeah, looks good to me [13:01:03] raynor: feel free to proceed! [13:01:24] roger that [13:01:28] (03PS1) 10Filippo Giunchedi: profile: introduce jmx_exporter_port to logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/474683 (https://phabricator.wikimedia.org/T206454) [13:01:29] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1087.eqiad.wmnet'] ` and were **ALL** successful. [13:02:11] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/13576/" [puppet] - 10https://gerrit.wikimedia.org/r/474681 (https://phabricator.wikimedia.org/T209802) (owner: 10Muehlenhoff) [13:03:09] !log pmiazga@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:474679]|In SecurePoll use gpg1 to avoid gpg-agent autostart (T209802)]] (duration: 00m 48s) [13:03:12] moritzm, apergos: deployed to prod [13:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:14] T209802: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 [13:03:20] great [13:03:25] thanks [13:03:47] is there anything important I should see in logs? [13:04:18] just asking because that looked like a pretty big change :) [13:04:22] (03CR) 10ArielGlenn: [C: 031] Install gpg 1 on app servers for SecurePoll extension [puppet] - 10https://gerrit.wikimedia.org/r/474681 (https://phabricator.wikimedia.org/T209802) (owner: 10Muehlenhoff) [13:05:11] what we should not see in logs is issues with voting; I think the particular poll is being (re)opened tomorrow [13:06:04] !log EU SWAT finished [13:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:20] "message": "AH01067: Failed to read FastCGI header", this with referrer from securepoll, should not appear any more [13:06:22] (03PS2) 10Muehlenhoff: Install gpg 1 on app servers for SecurePoll extension [puppet] - 10https://gerrit.wikimedia.org/r/474681 (https://phabricator.wikimedia.org/T209802) [13:07:08] (03CR) 10jenkins-bot: In SecurePoll use gpg1 to avoid gpg-agent autostart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474679 (https://phabricator.wikimedia.org/T209802) (owner: 10Tim Starling) [13:07:46] "Worth noting here - coordinators agreed to push voting back on the elections by 24 hours (i.e. 00:00 UTC on November 20)." (phab) so we won't see anything til then [13:08:08] !log T207377 icinga downtime and reboot of cloudcontrol1003 and cloudservices1003 [13:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:11] T207377: Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 [13:09:16] * raynor tries to catch zeljkof so they can ride into the sunset together [13:09:46] (03CR) 10Muehlenhoff: [C: 032] Install gpg 1 on app servers for SecurePoll extension [puppet] - 10https://gerrit.wikimedia.org/r/474681 (https://phabricator.wikimedia.org/T209802) (owner: 10Muehlenhoff) [13:09:54] raynor: I'll halt my horse next to the saloon ;) [13:10:12] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1088.eqiad.wmnet'... [13:11:17] (03PS2) 10Filippo Giunchedi: profile: introduce jmx_exporter_port to logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/474683 (https://phabricator.wikimedia.org/T206454) [13:11:42] PROBLEM - DPKG on labcontrol1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:12:32] (03PS2) 10Mathew.onipe: maps: change nodes.bin owner to osmupdater [puppet] - 10https://gerrit.wikimedia.org/r/474680 (https://phabricator.wikimedia.org/T209569) [13:12:48] RECOVERY - DPKG on labcontrol1001 is OK: All packages OK [13:13:03] (03CR) 10Mathew.onipe: maps: change nodes.bin owner to osmupdater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474680 (https://phabricator.wikimedia.org/T209569) (owner: 10Mathew.onipe) [13:13:24] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) [13:13:57] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler1002/13578/logstash1008.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/474683 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [13:14:05] (03PS3) 10Filippo Giunchedi: profile: introduce jmx_exporter_port to logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/474683 (https://phabricator.wikimedia.org/T206454) [13:16:18] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Patch-For-Review, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10MoritzMuehlenhoff) p:05Unbreak!>03Normal >>! In T209802#4757707, @tstarling wrote: > Installing the package gnupg1 and using > > ` >... [13:17:01] (03PS1) 10Alexandros Kosiaris: releases: Set Cache-control on charts/index.yaml [puppet] - 10https://gerrit.wikimedia.org/r/474684 [13:17:11] 10Operations, 10ops-eqiad: Degraded RAID on labcontrol1001 - https://phabricator.wikimedia.org/T209829 (10ops-monitoring-bot) [13:21:13] !log T207377 icinga downtime and reboot of labcontrol1001 and labservices1001 [13:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:23] T207377: Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 [13:25:07] (03CR) 10Alexandros Kosiaris: [C: 032] releases: Set Cache-control on charts/index.yaml [puppet] - 10https://gerrit.wikimedia.org/r/474684 (owner: 10Alexandros Kosiaris) [13:25:10] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) [13:30:56] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] First draft of a zotero helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/466287 (https://phabricator.wikimedia.org/T201611) (owner: 10Alexandros Kosiaris) [13:33:34] !log fdans@deploy1001 Started deploy [analytics/aqs/deploy@7cde8c8]: Deploying AQS to add two new fields to uniques [13:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:49] (03PS3) 10Alexandros Kosiaris: ores: Move all of celery configs to puppet [puppet] - 10https://gerrit.wikimedia.org/r/474158 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [13:33:54] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores: Move all of celery configs to puppet [puppet] - 10https://gerrit.wikimedia.org/r/474158 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [13:33:57] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) [13:33:59] (03PS13) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [13:36:26] !log disable puppet on ores1* and ores2* for slow deployment of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474158/ [13:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:36] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Patch-For-Review, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Cyberpower678) I just submitted a vote to poll 750 successfully. [13:39:15] !log cumin -b1 -s 300 'ores2*' 'enable-puppet "merge of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474158/" ; puppet agent -t ; service uwsgi-ores restart ; service celery-ores-worker restart' [13:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:52] !log fdans@deploy1001 Finished deploy [analytics/aqs/deploy@7cde8c8]: Deploying AQS to add two new fields to uniques (duration: 06m 18s) [13:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:08] !log joal@deploy1001 Started deploy [analytics/aqs/deploy@7cde8c8]: Update unique-devices schema adding 2 fields [13:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:12] 10Operations, 10Multimedia, 10Traffic: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10ema) [13:43:47] !log installing chromium security update on proton* (tested new upstream release in deployment-prep) [13:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:12] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1089.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1089.... [13:44:21] (03CR) 10Vgutierrez: [C: 031] "shellcheck, human and pcc (https://puppet-compiler.wmflabs.org/compiler1002/13580/) are happy. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/474288 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema) [13:45:23] akosiaris: always use run-puppet-agent with cumin, never puppet agent -t, has the wrong exit codes. Pro-tip use aliases ;) like A:ores-codfw [13:45:31] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1084.eqiad.wmnet'... [13:46:24] 10Operations, 10Traffic: INMARSAT geolocates to the UK, leading to requests going to esams - https://phabricator.wikimedia.org/T209785 (10BBlack) When looking at the latest MaxMind data, it locates this network as being in New Zealand, which we map to ulsfo as first choice, and esams as the last-resort choice.... [13:47:29] (03PS4) 10Ema: ATS: add check_trafficserver_verify_config [puppet] - 10https://gerrit.wikimedia.org/r/474288 (https://phabricator.wikimedia.org/T204209) [13:48:46] (03CR) 10Ema: [C: 032] ATS: add check_trafficserver_verify_config [puppet] - 10https://gerrit.wikimedia.org/r/474288 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema) [13:53:03] (03CR) 10Arturo Borrero Gonzalez: toolforge: Refactor clush (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [13:54:27] volans: I was wondering whether you will comment. I was betting on me using the shell notation ; and not the chained commands [13:54:46] turns out I lost the bet [13:54:51] but just barely [13:55:19] oh yeah, that too, -m async, command1 command2.... [13:55:23] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10GTirloni) [13:55:31] I know about the wrong exit codes btw, hence the non-chained commands [13:55:45] my muscle memory always goes there [13:55:53] !log T207377 reboot cloudcontrol1004 [13:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:56] T207377: Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 [13:56:27] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) [13:57:53] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:58:40] (03PS14) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [14:00:12] (03CR) 10GTirloni: toolforge: Refactor clush (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [14:02:01] (03PS15) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [14:02:05] !log joal@deploy1001 Finished deploy [analytics/aqs/deploy@7cde8c8]: Update unique-devices schema adding 2 fields (duration: 20m 57s) [14:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:23] PROBLEM - AQS root url on aqs1004 is CRITICAL: connect to address 10.64.0.107 and port 7232: Connection refused [14:02:29] elukey: --^ [14:02:32] :( [14:03:18] it is depooled so it is not making too much damage :D [14:04:21] joal: the rest of AQS looks good right? [14:06:45] 10Operations, 10Multimedia, 10Traffic: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10TheDJ) This is due to experiments with {T27611} [14:08:34] correct elukey [14:08:39] sorry for delay [14:11:27] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [14:12:55] 10Operations, 10Multimedia, 10Traffic: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10TheDJ) Opera 11.60 release in 2011-12-06 (.64 are just security updates). I guess in theory we could blacklist the old Opera UAs in the varnish confi... [14:16:53] (03CR) 10Gehel: [C: 04-1] maps: change nodes.bin owner to osmupdater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474680 (https://phabricator.wikimedia.org/T209569) (owner: 10Mathew.onipe) [14:17:51] (03PS10) 10Gehel: elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) [14:19:10] (03CR) 10Gehel: [C: 032] elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) (owner: 10Gehel) [14:19:51] (03PS1) 10Ema: Avoid serving WebP thumbnail variants to Opera [puppet] - 10https://gerrit.wikimedia.org/r/474693 (https://phabricator.wikimedia.org/T27611) [14:20:24] 10Operations, 10Multimedia, 10Traffic, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10ema) p:05Triage>03Normal [14:21:25] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10faidon) JFTR, I don't know what cloudinfra-puppetmaster-01 is. Maybe @Krenair or someone else set up that? More broadly, the concept... [14:22:44] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) >>! In T171188#4758194, @faidon wrote: > JFTR, I don't know what cloudinfra-puppetmaster-01 is. Maybe @Krenair or someone els... [14:25:42] 10Operations, 10Multimedia, 10Traffic, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10TheDJ) Note: The chromium/webkit versions of Opera after opera 15 use the OPR string to identify Opera. These browsers likely D... [14:27:59] PROBLEM - Ensure traffic_server is running on cp2003 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_server [14:31:59] RECOVERY - Ensure traffic_server is running on cp2003 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server [14:32:03] (03PS1) 10Ladsgroup: ores: Change configs to celery4 ones [puppet] - 10https://gerrit.wikimedia.org/r/474694 (https://phabricator.wikimedia.org/T209587) [14:32:44] (03CR) 10Ladsgroup: [C: 04-1] "Merging this would make the whole cluster to explode. I need to deploy several stuff first" [puppet] - 10https://gerrit.wikimedia.org/r/474694 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [14:33:13] (03PS1) 10Muehlenhoff: Disable Diamond on WDQS hosts [puppet] - 10https://gerrit.wikimedia.org/r/474695 (https://phabricator.wikimedia.org/T183454) [14:39:25] (03PS3) 10Niedzielski: Prod: increase Schema.org page split test to 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473227 (https://phabricator.wikimedia.org/T208755) [14:41:47] (03PS2) 10Niedzielski: BC Wikibase: override repoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473835 (https://phabricator.wikimedia.org/T209352) [14:42:06] (03PS7) 10Niedzielski: Doc: add repoConceptBaseUri comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) [14:44:33] (03PS1) 10Muehlenhoff: Absent unused Diamond collector for ldap/corp [puppet] - 10https://gerrit.wikimedia.org/r/474698 (https://phabricator.wikimedia.org/T183454) [14:44:36] (03PS1) 10Muehlenhoff: Remove Diamond from openldap/corp servers [puppet] - 10https://gerrit.wikimedia.org/r/474699 (https://phabricator.wikimedia.org/T183454) [15:01:13] PROBLEM - Ensure traffic_server is running on cp1073 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_server [15:02:13] RECOVERY - Ensure traffic_server is running on cp1073 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server [15:05:34] (03PS1) 10Anomie: Set ActorTableSchemaMigrationStage => write-both/read-old on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474704 (https://phabricator.wikimedia.org/T188327) [15:05:49] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10herron) >>! In T171188#4758203, @Krenair wrote: >>>! In T171188#4758194, @faidon wrote: >> JFTR, I don't know what cloudinfra-puppetma... [15:06:09] (03CR) 10Anomie: [C: 032] "Deploying planned config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474704 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [15:07:13] (03Merged) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-old on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474704 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [15:08:29] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-both/read-old on group 0 (T188327) (duration: 00m 47s) [15:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:32] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [15:10:55] 10Operations, 10Multimedia, 10Traffic, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10Gilles) I guess this means that these older Opera versions send request headers stating that they accept webp when they're in f... [15:12:41] PROBLEM - Ensure traffic_server is running on cp1072 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_server [15:12:51] known, sorry about that ^ [15:13:03] (03CR) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-old on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474704 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [15:13:04] 10Operations, 10Multimedia, 10Traffic, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10TheDJ) @gilles see my note in T209805#4758174 v11 probably supports some early versions of them, but not all. [15:13:41] RECOVERY - Ensure traffic_server is running on cp1072 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server [15:13:42] (03PS1) 10Ema: ATS: more specific traffic_server check definition [puppet] - 10https://gerrit.wikimedia.org/r/474706 (https://phabricator.wikimedia.org/T204209) [15:17:05] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) As arturo suggests, cloudinfra-puppetmaster-01 is meant to be the puppetmaster for things inside the cloudinfra project. I an... [15:17:51] (03CR) 10Ema: [C: 032] ATS: more specific traffic_server check definition [puppet] - 10https://gerrit.wikimedia.org/r/474706 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema) [15:18:14] !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@b399c34]: Removing empty fields from unique result [15:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:23] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1089.eqiad.wmnet'... [15:18:26] (03CR) 10Gilles: [C: 031] Avoid serving WebP thumbnail variants to Opera [puppet] - 10https://gerrit.wikimedia.org/r/474693 (https://phabricator.wikimedia.org/T27611) (owner: 10Ema) [15:18:28] 10Operations, 10Multimedia, 10Traffic, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10Gilles) Indeed. I've installed 11.64 and even the lossy ones we generate don't work. And it does advertise webp support in requ... [15:18:32] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1089.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1089.... [15:19:20] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10herron) [15:19:24] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 (10herron) 05Open>03Resolved a:03herron Logs have been successfully s... [15:21:36] (03PS2) 10Ema: Avoid serving WebP thumbnail variants to Opera [puppet] - 10https://gerrit.wikimedia.org/r/474693 (https://phabricator.wikimedia.org/T27611) [15:22:24] (03CR) 10Ema: [C: 032] Avoid serving WebP thumbnail variants to Opera [puppet] - 10https://gerrit.wikimedia.org/r/474693 (https://phabricator.wikimedia.org/T27611) (owner: 10Ema) [15:22:30] 10Operations, 10Multimedia, 10Traffic, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10Gilles) I've just verified the current stable Opera out of curiosity and it does (unsurprisingly) render our webps correctly. [15:22:57] 10Operations, 10Multimedia, 10Performance-Team, 10Traffic, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10Gilles) a:03ema [15:23:31] !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@b399c34]: Removing empty fields from unique result (duration: 05m 17s) [15:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:53] akosiaris: tell me when you're around to do the deployment of the ores [15:26:09] (if the puppet changes reached eqiad) [15:26:33] yeah they are done now [15:26:35] Amir1: ^ [15:26:44] feel free to start the deployment [15:26:55] okay, I start the deployment [15:28:17] dba11e9640642e8e5bc93a82c5be39916990e0c6 <- In case we need to rollback ores [15:28:24] !log ladsgroup@deploy1001 Started deploy [ores/deploy@e957b24]: T209587 T170950 [15:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:29] T209587: Migrate ores celery configs to celery 4 - https://phabricator.wikimedia.org/T209587 [15:28:30] T170950: ORES deployment finish "successfully" even when uwsgi and celery fail to successfully start up - https://phabricator.wikimedia.org/T170950 [15:29:02] (03PS1) 10Ema: ATS: quote traffic_server check_procs arguments [puppet] - 10https://gerrit.wikimedia.org/r/474709 (https://phabricator.wikimedia.org/T204209) [15:29:44] (03PS1) 10Alexandros Kosiaris: icinga: Re-add reload functionality [puppet] - 10https://gerrit.wikimedia.org/r/474710 [15:32:46] (03CR) 10Alexandros Kosiaris: "PCC https://puppet-compiler.wmflabs.org/compiler1002/13582/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/474710 (owner: 10Alexandros Kosiaris) [15:33:59] (03PS2) 10Alexandros Kosiaris: icinga: Re-add reload functionality [puppet] - 10https://gerrit.wikimedia.org/r/474710 [15:34:10] works fine. Moving forward [15:34:25] (Logs are clean, curl works, grafana is happy) [15:38:22] RECOVERY - AQS root url on aqs1004 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.004 second response time [15:40:51] 10Operations, 10Multimedia, 10Performance-Team, 10Traffic: Wikipedia sends WebP thumbnails when Opera claims to support it but lies - https://phabricator.wikimedia.org/T209805 (10Gilles) 05Open>03Resolved [15:41:12] 10Operations, 10Multimedia, 10Performance-Team, 10Traffic: Wikipedia sends WebP thumbnails when Opera claims to support it but lies - https://phabricator.wikimedia.org/T209805 (10Gilles) Verified the fix on enwiki front page using Opera 11.64 [15:43:25] (03CR) 10GTirloni: "Any other comments?" [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [15:45:33] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@e957b24]: T209587 T170950 (duration: 17m 09s) [15:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:37] T209587: Migrate ores celery configs to celery 4 - https://phabricator.wikimedia.org/T209587 [15:45:38] T170950: ORES deployment finish "successfully" even when uwsgi and celery fail to successfully start up - https://phabricator.wikimedia.org/T170950 [15:46:20] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1090.eqiad.wmnet'... [15:48:24] (03PS16) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [15:50:05] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1090.eqiad.wmnet'] ` and were **ALL** successful. [15:50:15] RECOVERY - Long running screen/tmux on certcentral1001 is OK: OK: No SCREEN or tmux processes detected. [15:52:01] (03CR) 10Ema: [C: 032] ATS: quote traffic_server check_procs arguments [puppet] - 10https://gerrit.wikimedia.org/r/474709 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema) [15:52:51] 10Operations, 10Certcentral, 10Traffic: Deploy a certcentral managed TLS certificate for librenms - https://phabricator.wikimedia.org/T209856 (10Vgutierrez) p:05Triage>03Normal [15:52:54] 10Operations, 10Certcentral, 10Traffic: Deploy a certcentral managed TLS certificate for librenms - https://phabricator.wikimedia.org/T209856 (10Vgutierrez) [15:56:52] (03CR) 10Alexandros Kosiaris: [C: 031] Absent unused Diamond collector for ldap/corp [puppet] - 10https://gerrit.wikimedia.org/r/474698 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:57:57] 10Operations, 10Maps, 10Traffic, 10Reading-Infrastructure-Team-Backlog (Kanban): Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Jhernandez) Do we want to keep this one open to wait for the stretch migration and checking on the eventbus load, or should we spin u... [15:59:35] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) [15:59:43] (03PS1) 10Vgutierrez: certcentral: Provide a TLS certificate for librenms [puppet] - 10https://gerrit.wikimedia.org/r/474722 (https://phabricator.wikimedia.org/T209856) [15:59:45] (03PS1) 10Vgutierrez: librenms: Deploy the TLS certificate managed by certcentral [puppet] - 10https://gerrit.wikimedia.org/r/474723 (https://phabricator.wikimedia.org/T209856) [16:00:50] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi All yours [16:05:00] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/13584/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/474723 (https://phabricator.wikimedia.org/T209856) (owner: 10Vgutierrez) [16:06:29] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) Ping on this! I know it is TG week so things might be slow, but I'm checking in anyway :) [16:08:13] (03CR) 10Filippo Giunchedi: [C: 031] Disable Diamond on WDQS hosts [puppet] - 10https://gerrit.wikimedia.org/r/474695 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:08:18] (03CR) 10Ladsgroup: [C: 031] "The code has been deployed to prod. This patch is cherry-picked on beta and works fine. We can deploy this now. Maybe gradually I guess." [puppet] - 10https://gerrit.wikimedia.org/r/474694 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [16:08:24] (03CR) 10Filippo Giunchedi: [C: 031] Remove Diamond from openldap/corp servers [puppet] - 10https://gerrit.wikimedia.org/r/474699 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:08:36] (03CR) 10Filippo Giunchedi: [C: 031] Absent unused Diamond collector for ldap/corp [puppet] - 10https://gerrit.wikimedia.org/r/474698 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:13:35] RECOVERY - exim queue on mx1001 is OK: OK: Less than 1000 mails in exim queue. [16:13:43] (03CR) 10Cwhite: [C: 031] "Seems reasonable to me given the sysvinit script is auto-converted to systemd unit. It should work." [puppet] - 10https://gerrit.wikimedia.org/r/474710 (owner: 10Alexandros Kosiaris) [16:14:23] (03PS1) 10Volans: Fix an-worker1089 management PTRs [dns] - 10https://gerrit.wikimedia.org/r/474724 (https://phabricator.wikimedia.org/T207192) [16:15:34] (03CR) 10Elukey: [C: 032] "Better call volans CIT" [dns] - 10https://gerrit.wikimedia.org/r/474724 (https://phabricator.wikimedia.org/T207192) (owner: 10Volans) [16:16:00] lol [16:18:08] (03CR) 10Cwhite: [C: 031] Absent unused Diamond collector for ldap/corp [puppet] - 10https://gerrit.wikimedia.org/r/474698 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:18:32] (03CR) 10Cwhite: [C: 031] Remove Diamond from openldap/corp servers [puppet] - 10https://gerrit.wikimedia.org/r/474699 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:18:52] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/474710 (owner: 10Alexandros Kosiaris) [16:18:56] (03CR) 10Cwhite: [C: 031] Disable Diamond on WDQS hosts [puppet] - 10https://gerrit.wikimedia.org/r/474695 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:18:58] RECOVERY - puppet last run on ms-be2046 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:20:31] (03PS2) 10Vgutierrez: certcentral: Provide a TLS certificate for librenms [puppet] - 10https://gerrit.wikimedia.org/r/474722 (https://phabricator.wikimedia.org/T209856) [16:20:34] (03PS2) 10Vgutierrez: librenms: Deploy the TLS certificate managed by certcentral [puppet] - 10https://gerrit.wikimedia.org/r/474723 (https://phabricator.wikimedia.org/T209856) [16:20:36] (03PS1) 10Vgutierrez: certcentral: Use the same naming schema for certs as LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/474730 (https://phabricator.wikimedia.org/T209856) [16:22:02] (03PS1) 10Filippo Giunchedi: site: add ms-be205* hosts [puppet] - 10https://gerrit.wikimedia.org/r/474732 (https://phabricator.wikimedia.org/T209395) [16:22:24] (03PS2) 10Vgutierrez: certcentral: Use the same naming schema for certs as LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/474730 (https://phabricator.wikimedia.org/T209856) [16:22:26] (03PS3) 10Vgutierrez: certcentral: Provide a TLS certificate for librenms [puppet] - 10https://gerrit.wikimedia.org/r/474722 (https://phabricator.wikimedia.org/T209856) [16:22:28] (03PS3) 10Vgutierrez: librenms: Deploy the TLS certificate managed by certcentral [puppet] - 10https://gerrit.wikimedia.org/r/474723 (https://phabricator.wikimedia.org/T209856) [16:22:30] (03CR) 10Alex Monk: [C: 04-1] "commit message should probably mention this importantly adds chain.crt files" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474730 (https://phabricator.wikimedia.org/T209856) (owner: 10Vgutierrez) [16:22:48] 10Operations, 10ops-codfw, 10DBA: Decommission parsercache hosts: pc2006 pc2007 pc2008 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [16:23:04] 10Operations, 10ops-codfw, 10DBA: Decommission parsercache hosts: pc2006 pc2007 pc2008 - https://phabricator.wikimedia.org/T209858 (10Marostegui) p:05Triage>03Normal [16:23:43] (03PS2) 10Cwhite: role: add aggregations for TCP Fast Open to prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/474321 (https://phabricator.wikimedia.org/T183454) [16:23:51] 10Operations, 10ops-codfw, 10DBA: Decommission parsercache hosts: pc2006 pc2007 pc2008 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [16:24:17] (03PS3) 10Cwhite: role: add aggregations for TCP Fast Open to prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/474321 (https://phabricator.wikimedia.org/T183454) [16:26:01] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10fgiunchedi) Thanks @papaul ! Writing down what I found and the fixes for reference * ms-be2046 doesn't show its spinning, only ssd. fixed with `megacli... [16:26:02] (03PS3) 10Vgutierrez: certcentral: Deliver same certs (with same naming) as LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/474730 (https://phabricator.wikimedia.org/T209856) [16:26:04] (03PS4) 10Vgutierrez: certcentral: Provide a TLS certificate for librenms [puppet] - 10https://gerrit.wikimedia.org/r/474722 (https://phabricator.wikimedia.org/T209856) [16:26:06] (03PS4) 10Vgutierrez: librenms: Deploy the TLS certificate managed by certcentral [puppet] - 10https://gerrit.wikimedia.org/r/474723 (https://phabricator.wikimedia.org/T209856) [16:26:14] (03CR) 10Filippo Giunchedi: [C: 032] site: add ms-be205* hosts [puppet] - 10https://gerrit.wikimedia.org/r/474732 (https://phabricator.wikimedia.org/T209395) (owner: 10Filippo Giunchedi) [16:26:40] jouncebot: next [16:26:40] In 1 hour(s) and 33 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181119T1800) [16:26:43] (03CR) 10Vgutierrez: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474730 (https://phabricator.wikimedia.org/T209856) (owner: 10Vgutierrez) [16:27:20] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [16:28:32] 10Operations, 10Operations-Software-Development: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10Volans) My proposal is to start with 1+2, 6 and 8. 1 and 2 can be merged into a single report that validates the ass... [16:28:54] 10Operations, 10ops-codfw, 10DBA: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [16:28:59] (03CR) 10Arturo Borrero Gonzalez: "I still have some doubts about the role::wmcs::toolforge::clush::target." [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [16:29:17] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [16:29:18] (03CR) 10Filippo Giunchedi: [C: 031] role: add aggregations for TCP Fast Open to prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/474321 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:29:58] (03PS2) 10Alexandros Kosiaris: ores::redis: Set maxmemory-policy: volatile-lur [puppet] - 10https://gerrit.wikimedia.org/r/474450 (https://phabricator.wikimedia.org/T209628) [16:30:38] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [16:30:42] (03CR) 10Alexandros Kosiaris: [C: 032] ores::redis: Set maxmemory-policy: volatile-lur [puppet] - 10https://gerrit.wikimedia.org/r/474450 (https://phabricator.wikimedia.org/T209628) (owner: 10Alexandros Kosiaris) [16:31:13] (03CR) 10Alexandros Kosiaris: [C: 032] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/474450 (https://phabricator.wikimedia.org/T209628) (owner: 10Alexandros Kosiaris) [16:31:21] (03CR) 10Vgutierrez: "PCC is still happy https://puppet-compiler.wmflabs.org/compiler1002/13585/netmon1002.wikimedia.org/ and shows the changes introduced by Ie" [puppet] - 10https://gerrit.wikimedia.org/r/474723 (https://phabricator.wikimedia.org/T209856) (owner: 10Vgutierrez) [16:33:20] 10Operations, 10ops-codfw, 10DBA: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [16:33:38] PROBLEM - puppet last run on ms-be2050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:33:59] (03CR) 10Alex Monk: [C: 031] certcentral: Deliver same certs (with same naming) as LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/474730 (https://phabricator.wikimedia.org/T209856) (owner: 10Vgutierrez) [16:34:28] (03CR) 10Vgutierrez: [C: 032] certcentral: Deliver same certs (with same naming) as LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/474730 (https://phabricator.wikimedia.org/T209856) (owner: 10Vgutierrez) [16:34:45] (03PS4) 10Vgutierrez: certcentral: Deliver same certs (with same naming) as LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/474730 (https://phabricator.wikimedia.org/T209856) [16:35:13] (03PS1) 10Filippo Giunchedi: Include ms-be[12]05[0-9] hosts in disk/role configuration [puppet] - 10https://gerrit.wikimedia.org/r/474734 [16:36:59] (03CR) 10Filippo Giunchedi: [C: 032] Include ms-be[12]05[0-9] hosts in disk/role configuration [puppet] - 10https://gerrit.wikimedia.org/r/474734 (owner: 10Filippo Giunchedi) [16:37:02] (03CR) 10Cwhite: [C: 032] role: add aggregations for TCP Fast Open to prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/474321 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:37:07] (03PS2) 10Filippo Giunchedi: Include ms-be[12]05[0-9] hosts in disk/role configuration [puppet] - 10https://gerrit.wikimedia.org/r/474734 [16:37:09] (03PS4) 10Cwhite: role: add aggregations for TCP Fast Open to prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/474321 (https://phabricator.wikimedia.org/T183454) [16:37:17] 10Operations, 10Wikimedia-Logstash: Ship peopleweb apache2 logs to ELK - https://phabricator.wikimedia.org/T209860 (10herron) p:05Triage>03Normal [16:37:28] 10Operations, 10Wikimedia-Logstash: Ship peopleweb apache2 logs to ELK - https://phabricator.wikimedia.org/T209860 (10herron) [16:37:30] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [16:39:47] (03PS5) 10Cwhite: role: add aggregations for TCP Fast Open to prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/474321 (https://phabricator.wikimedia.org/T183454) [16:40:20] (03CR) 10Vgutierrez: [C: 032] certcentral: Provide a TLS certificate for librenms [puppet] - 10https://gerrit.wikimedia.org/r/474722 (https://phabricator.wikimedia.org/T209856) (owner: 10Vgutierrez) [16:40:29] (03PS5) 10Vgutierrez: certcentral: Provide a TLS certificate for librenms [puppet] - 10https://gerrit.wikimedia.org/r/474722 (https://phabricator.wikimedia.org/T209856) [16:42:38] sigh.. and rebase again :) [16:43:02] (03PS6) 10Vgutierrez: certcentral: Provide a TLS certificate for librenms [puppet] - 10https://gerrit.wikimedia.org/r/474722 (https://phabricator.wikimedia.org/T209856) [16:43:54] RECOVERY - puppet last run on ms-be2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:44:48] PROBLEM - Check systemd state on ores1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:44:54] PROBLEM - Check systemd state on ores2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:45:14] PROBLEM - Check systemd state on ores2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:45:20] PROBLEM - Check systemd state on ores2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:45:20] PROBLEM - Check systemd state on ores2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:45:28] PROBLEM - Check systemd state on ores2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:45:30] PROBLEM - Check systemd state on ores2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:47:49] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [16:47:54] 10Operations, 10Traffic, 10Patch-For-Review: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 (10ema) 05Open>03Resolved [16:47:59] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [16:48:08] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [16:48:15] akosiaris: ^^^ [16:48:18] (ORES) [16:48:23] (03CR) 10GTirloni: "> Patch Set 16:" [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [16:48:27] (03CR) 10Mathew.onipe: [C: 031] Disable Diamond on WDQS hosts [puppet] - 10https://gerrit.wikimedia.org/r/474695 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:48:53] * akosiaris looking [16:48:57] 10Operations, 10Patch-For-Review: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10Volans) @faidon >>! In T205898#4735131, @faidon wrote: > Regardless, I think this all boils down to these two questions: > - Is it worth our time/effort to pursue this NAPALM exploration furthe... [16:49:14] 10Operations, 10ops-codfw, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10Papaul) [16:50:23] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Dzahn) >>! In T209757#4756307, @jcrespo wrote: >> i think we should normally not use this method (disable notifications) and instead "schedule... [16:50:30] 10Operations, 10ops-codfw, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10Papaul) switch port information sessionstore... [16:50:45] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) [16:53:37] PROBLEM - Disk space on cp1071 is CRITICAL: DISK CRITICAL - free space: / 179 MB (2% inode=91%) [16:54:03] PROBLEM - Disk space on cp2009 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%) [16:54:21] RECOVERY - Check systemd state on ores2007 is OK: OK - running: The system is fully operational [16:54:33] RECOVERY - Check systemd state on ores2006 is OK: OK - running: The system is fully operational [16:54:36] (03CR) 10Vgutierrez: [C: 032] librenms: Deploy the TLS certificate managed by certcentral [puppet] - 10https://gerrit.wikimedia.org/r/474723 (https://phabricator.wikimedia.org/T209856) (owner: 10Vgutierrez) [16:54:44] (03PS5) 10Vgutierrez: librenms: Deploy the TLS certificate managed by certcentral [puppet] - 10https://gerrit.wikimedia.org/r/474723 (https://phabricator.wikimedia.org/T209856) [16:54:45] RECOVERY - Check systemd state on ores2008 is OK: OK - running: The system is fully operational [16:55:09] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational [16:55:47] ema: cp1071/2009 (traffic server backeds afaics) have their root partition filled up [16:56:02] seems /var/log and 10G / maximum space [16:56:03] RECOVERY - Check systemd state on ores2004 is OK: OK - running: The system is fully operational [16:56:04] elukey: looking! [16:56:13] super :) [16:56:47] PROBLEM - Disk space on cp1074 is CRITICAL: DISK CRITICAL - free space: / 274 MB (3% inode=91%) [16:57:55] RECOVERY - Check systemd state on ores2002 is OK: OK - running: The system is fully operational [16:59:41] RECOVERY - Check systemd state on ores2009 is OK: OK - running: The system is fully operational [17:00:53] 10Operations, 10Patch-For-Review: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10ayounsi) >>! In T205898#4735131, @faidon wrote: > [...] but I'm wondering what "some devices facts" means exactly Uptime, OS version, serial#, etc... And it only displays it as far as I know. >>... [17:02:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1007 predicted raid failure - https://phabricator.wikimedia.org/T209861 (10Andrew) [17:03:10] PROBLEM - Check systemd state on cp2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:05:24] RECOVERY - Disk space on cp2009 is OK: DISK OK [17:06:12] RECOVERY - Check systemd state on cp2009 is OK: OK - running: The system is fully operational [17:06:18] (03PS1) 10Alex Monk: librenms: Use certcentral cert [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) [17:07:30] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) The old ticket was too old, and new ticket 19131684 has been opened. I'm working this (sending over all the old info and logs) and will schedule another onsite attempt. [17:09:12] RECOVERY - Disk space on cp1074 is OK: DISK OK [17:10:48] PROBLEM - Disk space on cp1073 is CRITICAL: DISK CRITICAL - free space: / 209 MB (2% inode=91%) [17:14:10] PROBLEM - Disk space on cp2015 is CRITICAL: DISK CRITICAL - free space: / 260 MB (2% inode=91%) [17:14:30] (03PS2) 10Alex Monk: librenms: Use certcentral cert [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) [17:14:30] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 6 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/centralcerts/librenms.rsa-2048.crt],File[/etc/centralcerts/librenms.rsa-2048.chain.crt],File[/etc/centralcerts/librenms.rsa-2048.chained.crt],File[/etc/centralcerts/librenms.rsa-2048.key] [17:14:32] (03PS1) 10Alex Monk: librenms: Remove old letsencrypt puppetisation cert [puppet] - 10https://gerrit.wikimedia.org/r/474747 (https://phabricator.wikimedia.org/T209856) [17:14:48] vgutierrez: you called for it ^^^ [17:15:11] yeah... that's expected [17:15:13] ema: are those space alarms for cp hosts expected? [17:15:41] in the first run in the client gets authorized and in the second one the fetch attempt works [17:15:47] (in the first one gets a 403) [17:16:03] I'm wondering if we can set a require on a @@file resource [17:16:23] in theory we should not have anything that requires multiple puppet runs by design [17:16:37] but I also know that we have already some cases like this one [17:16:47] volans: nope, that's the ats logging part not working as advertised [17:16:50] like icinga checks? :) [17:16:56] ema: ack [17:17:10] vgutierrez: no that's one puppet run on the target host and one on the icinga host [17:17:17] that's by puppet design and exported resources [17:17:22] I mean double run on a single host [17:17:47] volans, so what we could do instead is have something running on certcentral hosts that polls puppet DB for authorised hosts entries [17:18:06] I don't know the details of the current issue and also in a meeting [17:18:10] ok [17:18:26] 10Operations, 10ops-eqiad, 10netops: Fix missing PDU's for row C eqiad in netbox - https://phabricator.wikimedia.org/T208091 (10ayounsi) [17:18:28] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [17:18:32] but happy to discuss later/tomorrow with a bit more context ;) [17:19:30] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:19:47] (03CR) 10Ottomata: "Looks good! Woohoo more parameters!" (033 comments) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 (owner: 10Elukey) [17:20:39] 10Operations, 10netops, 10Patch-For-Review: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10ayounsi) Reply from the RIPE: > I see that you have found the problem as my graphs are looking normal now. From what I can gather, it was packet loss on IPv6 cau... [17:22:46] PROBLEM - Check systemd state on cp1071 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:23:10] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [17:23:12] 10Operations, 10Traffic, 10Patch-For-Review: ATS: log inspection at runtime - https://phabricator.wikimedia.org/T204225 (10ema) 05Resolved>03Open There's a problem with fifo-log-demux reading from the pipe, reopening! [17:23:30] (03PS1) 10Ema: ATS: disable icinga notifications [puppet] - 10https://gerrit.wikimedia.org/r/474749 (https://phabricator.wikimedia.org/T204225) [17:23:55] (03CR) 10Awight: [C: 031] "Should be safe (but first deploy to a canary and restart the services!)" [puppet] - 10https://gerrit.wikimedia.org/r/474694 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [17:24:02] PROBLEM - Check systemd state on cp1073 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:24:44] 10Operations, 10Certcentral, 10Traffic, 10Patch-For-Review: Deploy a certcentral managed TLS certificate for librenms - https://phabricator.wikimedia.org/T209856 (10Vgutierrez) looking good: `vgutierrez@neodymium:~$ sudo cumin netmon1002.wikimedia.org,netmon2001.wikimedia.org 'sha256sum /etc/centralcerts/l... [17:26:11] (03CR) 10Ema: [C: 032] ATS: disable icinga notifications [puppet] - 10https://gerrit.wikimedia.org/r/474749 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema) [17:27:18] 10Operations, 10ops-eqiad: Degraded RAID on labcontrol1001 - https://phabricator.wikimedia.org/T209829 (10aborrero) Manual check: ` aborrero@icinga1001:~ $ /usr/lib/nagios/plugins/check_nrpe -4 -H labcontrol1001 -c get_raid_status_md Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]... [17:29:01] 10Operations, 10netops: Access to network devices for Riccardo (volans) - https://phabricator.wikimedia.org/T208726 (10RobH) removing the project for access requests, since htis is now a netops thing. [17:32:40] (03PS1) 10Bstorm: wiki replicas: depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/474751 (https://phabricator.wikimedia.org/T209517) [17:34:52] 10Operations, 10monitoring, 10User-CDanis: graph server temperature metrics - https://phabricator.wikimedia.org/T209863 (10CDanis) [17:38:00] PROBLEM - Check systemd state on cp2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:38:33] 10Operations, 10netops: asw2-a-eqiad FPC2 reboot - https://phabricator.wikimedia.org/T209588 (10Cmjohnson) @ayounsi, power cables are fine, both power supplies are green. There wasn't anyone in the cage at the time of the reboot. [17:40:14] PROBLEM - Disk space on cp2003 is CRITICAL: DISK CRITICAL - free space: / 343 MB (3% inode=91%) [17:42:03] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) [17:42:41] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) [17:43:03] bleh, pointless change i rolled back task edit. [17:49:45] 10Operations, 10ops-codfw, 10DBA, 10decommission: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [17:50:14] 10Operations, 10ops-eqiad, 10netops: Fix missing PDU's for row C eqiad in netbox - https://phabricator.wikimedia.org/T208091 (10Cmjohnson) Physically it was impossible to get to the s/n without removing them from the mounts. ayounsi was able to get them a different way. Asset tags ps1-c1 wmf7459 ps1-c2 wm... [17:50:22] (03PS1) 10Ema: ATS: actually disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/474753 (https://phabricator.wikimedia.org/T204225) [17:54:03] 10Operations, 10SRE-Access-Requests, 10WMDE-Analytics-Engineering, 10Graphite, 10User-Addshore: Requesting access to graphite hosts for addshore - https://phabricator.wikimedia.org/T208750 (10RobH) a:03fgiunchedi Filippo volunteered to review this during our SRE team meeting, reassigning. [17:54:58] (03CR) 10Elukey: ">" (033 comments) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 (owner: 10Elukey) [17:56:00] (03PS1) 10Bstorm: sonofgridengine: Fix up the grid_configurator script to parse exec configs [puppet] - 10https://gerrit.wikimedia.org/r/474755 (https://phabricator.wikimedia.org/T200557) [17:58:42] (03CR) 10Arturo Borrero Gonzalez: "> > Patch Set 16:" [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [18:00:04] gehel and onimisionipe: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181119T1800). [18:00:20] here here! [18:04:27] (03PS17) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [18:04:58] (03PS18) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [18:05:36] (03PS1) 10Andrew Bogott: Horizon: enable deployment-prep in eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/474756 (https://phabricator.wikimedia.org/T208101) [18:05:41] (03CR) 10GTirloni: "> I would put the code in modules/profile/manifests/toolforge/infrastructure.pp" [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [18:06:49] (03CR) 10Andrew Bogott: [C: 032] Horizon: enable deployment-prep in eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/474756 (https://phabricator.wikimedia.org/T208101) (owner: 10Andrew Bogott) [18:08:40] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@a25eb30]: GUI Update, new executor limits and new blazegraph build [18:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:52] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10jijiki) rdb2001 was used for a demo, thus it was re-imaged. [18:10:24] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10jijiki) [18:10:37] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10jijiki) 05Open>03Resolved [18:12:52] (03PS11) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce/Hive/Oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [18:17:33] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@a25eb30]: GUI Update, new executor limits and new blazegraph build (duration: 08m 53s) [18:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:28] (03PS2) 10Ema: ATS: actually disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/474753 (https://phabricator.wikimedia.org/T204225) [18:25:40] (03CR) 10Ema: [C: 032] ATS: actually disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/474753 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema) [18:26:08] (03PS3) 10Cwhite: icinga: Re-add reload functionality [puppet] - 10https://gerrit.wikimedia.org/r/474710 (owner: 10Alexandros Kosiaris) [18:26:46] RECOVERY - Disk space on cp1073 is OK: DISK OK [18:27:00] RECOVERY - Disk space on cp2003 is OK: DISK OK [18:27:07] (03CR) 10Cwhite: [C: 032] icinga: Re-add reload functionality [puppet] - 10https://gerrit.wikimedia.org/r/474710 (owner: 10Alexandros Kosiaris) [18:29:12] !log connecting eqiad asw2-b fpc2 and fpc8 [18:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:23] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [18:32:25] 10Operations, 10ops-eqiad, 10netops: Fix missing PDU's for row C eqiad in netbox - https://phabricator.wikimedia.org/T208091 (10ayounsi) 05Open>03Resolved a:05Cmjohnson>03ayounsi Serial exported from LibreNMS. All 8 PSUs imported in Netbox, as well as their console connections. [18:32:49] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [18:34:18] RECOVERY - Disk space on cp2015 is OK: DISK OK [18:37:19] (03PS1) 10Andrew Bogott: shinken: temporarily remove monitoring for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/474758 (https://phabricator.wikimedia.org/T208101) [18:39:03] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) In theory we could fix this with the upgrade to 2.16 (as nothing uses the db anymore but it's still... [18:39:14] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) [18:40:24] RECOVERY - Check systemd state on cp2015 is OK: OK - running: The system is fully operational [18:52:04] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:53:29] (03CR) 10Dzahn: [C: 031] "thanks! ack" [puppet] - 10https://gerrit.wikimedia.org/r/474710 (owner: 10Alexandros Kosiaris) [18:58:56] (03CR) 10Dzahn: "re: icinga migration. you can do this anytime. when compiling it just don't worry about einsteinium anymore. if it's fine on icinga1001, g" [puppet] - 10https://gerrit.wikimedia.org/r/461498 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [18:59:40] PROBLEM - Disk space on cp2009 is CRITICAL: DISK CRITICAL - free space: / 214 MB (2% inode=91%) [18:59:41] (03CR) 10Dzahn: [C: 04-1] "that being said, it does report a syntax error it looks: https://puppet-compiler.wmflabs.org/compiler1002/13590/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/461498 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [19:00:04] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181119T1900) [19:00:04] stephanebisson and kostajh: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:26] I'm here [19:00:45] I'll SWAT [19:00:57] And I'm filling in for Stephane who isn't here [19:10:21] (03PS1) 10Herron: rsyslog::input::file add rsyslog imfile wrapper for file ingestion [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) [19:11:13] (03CR) 10jerkins-bot: [V: 04-1] rsyslog::input::file add rsyslog imfile wrapper for file ingestion [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) (owner: 10Herron) [19:12:11] (03PS2) 10Herron: rsyslog::input::file add rsyslog imfile wrapper for file ingestion [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) [19:12:22] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:13:07] (03CR) 10jerkins-bot: [V: 04-1] rsyslog::input::file add rsyslog imfile wrapper for file ingestion [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) (owner: 10Herron) [19:18:07] (03PS1) 10Catrope: Enable WelcomeSurvey on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474761 (https://phabricator.wikimedia.org/T209725) [19:18:25] (03CR) 10Catrope: [C: 032] Enable WelcomeSurvey on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474761 (https://phabricator.wikimedia.org/T209725) (owner: 10Catrope) [19:19:26] (03Merged) 10jenkins-bot: Enable WelcomeSurvey on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474761 (https://phabricator.wikimedia.org/T209725) (owner: 10Catrope) [19:25:36] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable WelcomeSurvey on testwiki (T209725) (duration: 00m 49s) [19:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:40] T209725: Deploy Welcome survey A to production - https://phabricator.wikimedia.org/T209725 [19:26:38] (03PS2) 10Bstorm: sonofgridengine: Fix up the grid_configurator script to parse exec configs [puppet] - 10https://gerrit.wikimedia.org/r/474755 (https://phabricator.wikimedia.org/T200557) [19:28:29] (03CR) 10Bstorm: [C: 032] sonofgridengine: Fix up the grid_configurator script to parse exec configs [puppet] - 10https://gerrit.wikimedia.org/r/474755 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [19:35:20] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Patch-For-Review, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10jrbs) >>! In T209802#4757844, @MoritzMuehlenhoff wrote: >>>! In T209802#4757707, @tstarling wrote: >> Installing the package gnupg1 and u... [19:36:27] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/GrowthExperiments/: WelcomeSurvey fixes (T206371) (duration: 00m 46s) [19:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:30] T206371: Personalized first day: build Variation A - https://phabricator.wikimedia.org/T206371 [19:39:19] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Patch-For-Review, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10Bawolff) One thing that was confusing me was why timeout in limit.sh wasnt killing the process eventually. But after reading docs i guess... [19:39:45] (03PS2) 10Catrope: Enable and configure Welcome survey on kowiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474331 (https://phabricator.wikimedia.org/T209725) (owner: 10Sbisson) [19:40:14] (03CR) 10Catrope: [C: 032] Enable and configure Welcome survey on kowiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474331 (https://phabricator.wikimedia.org/T209725) (owner: 10Sbisson) [19:41:37] (03Merged) 10jenkins-bot: Enable and configure Welcome survey on kowiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474331 (https://phabricator.wikimedia.org/T209725) (owner: 10Sbisson) [19:44:32] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaEvents/: EditorJourney fixes (T207307) (duration: 00m 46s) [19:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:36] T207307: Understanding first day: testing and QA - https://phabricator.wikimedia.org/T207307 [19:44:46] 10Operations, 10Release Pipeline, 10Release-Engineering-Team: Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10thcipriani) >>! In T209088#4745829, @akosiaris wrote: > I think we should support multiple tags per image (docker anyway does support that and they cost next to... [19:50:17] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable WelcomeSurvey on cswiki and kowiki (T209725) (duration: 00m 46s) [19:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:20] T209725: Deploy Welcome survey A to production - https://phabricator.wikimedia.org/T209725 [19:50:45] SWAT complete [19:52:14] Whoops, spoke too soon, there's an i18n patch to do still [19:54:31] (03PS1) 10Bstorm: maintain-dbusers: Shut up noisy logging [puppet] - 10https://gerrit.wikimedia.org/r/474766 (https://phabricator.wikimedia.org/T206238) [19:55:10] (03CR) 10jerkins-bot: [V: 04-1] maintain-dbusers: Shut up noisy logging [puppet] - 10https://gerrit.wikimedia.org/r/474766 (https://phabricator.wikimedia.org/T206238) (owner: 10Bstorm) [20:06:02] (03PS1) 10Papaul: DNS: Add production and mgmt DNS entries for sessionstore200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/474771 (https://phabricator.wikimedia.org/T209389) [20:06:11] !log catrope@deploy1001 Started scap: Full scap for special alias changes for GrowthExperiments [20:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:51] 10Operations, 10ops-codfw, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10Papaul) @robh which RAID 1 partman recipe are we using here? [20:14:16] 10Operations, 10ops-codfw, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10RobH) So anytime we have dual disks with sw raid (and no... [20:18:12] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10Volans) >>! In T209757#4756307, @jcrespo wrote: > @volans I reported this very issue, believing firmly this was a bug on our icinga installatio... [20:27:14] !log catrope@deploy1001 Finished scap: Full scap for special alias changes for GrowthExperiments (duration: 21m 03s) [20:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:58] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [20:34:34] (03PS3) 10Herron: rsyslog::input::file add rsyslog imfile wrapper for file ingestion [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) [20:36:26] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 39.08 ms [20:41:06] (03PS4) 10Herron: rsyslog::input::file add rsyslog imfile wrapper for file ingestion [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) [20:43:17] (03PS1) 10Bstorm: sonofgridengine: observer env and the openstack client libs for SGE master [puppet] - 10https://gerrit.wikimedia.org/r/474776 (https://phabricator.wikimedia.org/T200557) [20:44:07] (03PS2) 10Bstorm: sonofgridengine: observer env and the openstack client libs for SGE master [puppet] - 10https://gerrit.wikimedia.org/r/474776 (https://phabricator.wikimedia.org/T200557) [20:50:39] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [20:50:59] (03PS5) 10Herron: rsyslog::input::file add rsyslog imfile wrapper for file ingestion [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) [20:55:22] (03PS2) 10Bstorm: maintain-dbusers: Shut up noisy logging [puppet] - 10https://gerrit.wikimedia.org/r/474766 (https://phabricator.wikimedia.org/T206238) [20:55:51] (03CR) 10Bstorm: [C: 032] sonofgridengine: observer env and the openstack client libs for SGE master [puppet] - 10https://gerrit.wikimedia.org/r/474776 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [20:55:58] (03CR) 10jerkins-bot: [V: 04-1] maintain-dbusers: Shut up noisy logging [puppet] - 10https://gerrit.wikimedia.org/r/474766 (https://phabricator.wikimedia.org/T206238) (owner: 10Bstorm) [20:56:12] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10bd808) [20:56:51] (03PS6) 10Herron: rsyslog::input::file add rsyslog imfile wrapper for file ingestion [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) [20:57:47] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [20:59:17] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) Gerrit's db support is being removed in https://gerrit-review.googlesource.com/c/gerrit/+/205196 :) [21:00:05] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181119T2100). [21:01:03] 10Operations, 10User-jijiki: Introduce state to Scap - https://phabricator.wikimedia.org/T209881 (10jijiki) [21:10:30] (03PS3) 10Bstorm: maintain-dbusers: Shut up noisy logging [puppet] - 10https://gerrit.wikimedia.org/r/474766 (https://phabricator.wikimedia.org/T206238) [21:12:03] (03CR) 10Bstorm: [C: 032] maintain-dbusers: Shut up noisy logging [puppet] - 10https://gerrit.wikimedia.org/r/474766 (https://phabricator.wikimedia.org/T206238) (owner: 10Bstorm) [21:12:11] (03PS4) 10Bstorm: maintain-dbusers: Shut up noisy logging [puppet] - 10https://gerrit.wikimedia.org/r/474766 (https://phabricator.wikimedia.org/T206238) [21:16:15] 10Operations, 10DBA, 10JADE, 10Epic, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10Marostegui) >>! In T196547#4748654, @awight wrote: > This was addressed for now, by an agreement between our team and SRE to not install JADE on wikis with revision t... [21:18:14] 10Operations, 10Scap, 10User-jijiki: Introduce state to Scap - https://phabricator.wikimedia.org/T209881 (10thcipriani) [21:19:27] 10Operations, 10DBA, 10JADE, 10Epic, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) [21:20:05] 10Operations, 10Performance-Team (Radar): Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Imarlier) [21:20:21] 10Operations, 10DBA, 10JADE, 10Epic, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) >>! In T196547#4759786, @Marostegui wrote: > There are some other big wikis (commons) where this is also a concern and some other agreements were made in orde... [21:22:55] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10colewhite) @volans I do not think we have a good explanation for that behavior. It's deprecated software and we are transferring state files b... [21:32:24] PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.007 second response time [21:33:34] RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time [21:36:58] (03PS1) 10Bstorm: openstack client: Install python3 stuff on stretch [puppet] - 10https://gerrit.wikimedia.org/r/474795 (https://phabricator.wikimedia.org/T200557) [21:41:03] (03PS2) 10Ayounsi: Icinga, assign bfd check to routers [puppet] - 10https://gerrit.wikimedia.org/r/461498 (https://phabricator.wikimedia.org/T83992) [21:47:48] (03CR) 10Volans: "As discussed on IRC we still need to figure out how to solve the failing test, but we have a better understanding of it now." (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [21:53:22] (03PS3) 10Niedzielski: BC Wikibase: override repoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473835 (https://phabricator.wikimedia.org/T209352) [21:53:41] (03PS4) 10Niedzielski: Prod: increase Schema.org page split test to 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473227 (https://phabricator.wikimedia.org/T208755) [21:54:58] (03PS1) 10Bstorm: sonofgridengine: fix the path of the exechosts config dir [puppet] - 10https://gerrit.wikimedia.org/r/474801 (https://phabricator.wikimedia.org/T200557) [21:55:19] (03CR) 10Ayounsi: "It needed a manual rebase to fix conflicts." [puppet] - 10https://gerrit.wikimedia.org/r/461498 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [21:56:13] (03CR) 10Bstorm: [C: 032] sonofgridengine: fix the path of the exechosts config dir [puppet] - 10https://gerrit.wikimedia.org/r/474801 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [21:58:50] !log restart bird on dns2001 to try to establish the BFD sessions [21:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:04] bawolff and Reedy: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181119T2200). [22:00:22] PROBLEM - puppet last run on cloudcontrol1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:25:47] (03PS1) 10EBernhardson: Set default elasticsearch cluster name in profile [puppet] - 10https://gerrit.wikimedia.org/r/474807 [22:31:13] RECOVERY - puppet last run on cloudcontrol1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:31:41] 10Operations, 10MediaWiki-extensions-SecurePoll, 10Patch-For-Review, 10Wikimedia-production-error: Cannot vote on votewiki - https://phabricator.wikimedia.org/T209802 (10tstarling) 05Open>03Resolved a:03tstarling >>! In T209802#4759395, @Bawolff wrote: > One thing that was confusing me was why timeou... [22:34:07] (03PS2) 10EBernhardson: Set default elasticsearch cluster name in profile [puppet] - 10https://gerrit.wikimedia.org/r/474807 [22:42:10] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [22:44:46] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [22:47:28] (03PS1) 10Herron: logstash: add type "apache-error" and use logstash core patterns [puppet] - 10https://gerrit.wikimedia.org/r/474813 (https://phabricator.wikimedia.org/T205852) [22:51:54] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) Updater seems to be able to get about 4-5 updates... [22:55:23] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) Trying to run Updater on labs (for T206636) where there's no Kafka, I... [23:11:28] (03PS1) 10Bstorm: sonofgridengine: fix missing instantiation [puppet] - 10https://gerrit.wikimedia.org/r/474816 (https://phabricator.wikimedia.org/T200557) [23:24:40] (03CR) 10Bstorm: [C: 032] sonofgridengine: fix missing instantiation [puppet] - 10https://gerrit.wikimedia.org/r/474816 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [23:40:47] (03CR) 10Andrew Bogott: Set default elasticsearch cluster name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474807 (owner: 10EBernhardson) [23:45:56] (03PS1) 10Ayounsi: Bird, permit BFD multihop port in ferm [puppet] - 10https://gerrit.wikimedia.org/r/474819 [23:52:02] (03PS2) 10Ayounsi: Bird anycast DNS, add BFD multicast support [puppet] - 10https://gerrit.wikimedia.org/r/474819 [23:53:20] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/13593/" [puppet] - 10https://gerrit.wikimedia.org/r/474819 (owner: 10Ayounsi) [23:56:36] (03PS1) 10Alex Monk: deployment-prep: Update some IPs for the migration [puppet] - 10https://gerrit.wikimedia.org/r/474820 (https://phabricator.wikimedia.org/T208101) [23:56:48] (03CR) 10EBernhardson: Set default elasticsearch cluster name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474807 (owner: 10EBernhardson) [23:58:49] (03CR) 10Andrew Bogott: Set default elasticsearch cluster name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474807 (owner: 10EBernhardson) [23:59:04] (03CR) 10Andrew Bogott: [C: 032] deployment-prep: Update some IPs for the migration [puppet] - 10https://gerrit.wikimedia.org/r/474820 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk)