[00:06:15] PROBLEM - nova-compute process on labvirt1002 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [00:07:16] RECOVERY - nova-compute process on labvirt1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [00:07:25] you again [00:09:05] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [00:16:05] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [00:18:05] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:01:05] PROBLEM - nova-compute process on labvirt1004 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [01:02:05] RECOVERY - nova-compute process on labvirt1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [02:24:59] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.11) (duration: 07m 41s) [02:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:39] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jul 31 02:31:39 UTC 2017 (duration 6m 40s) [02:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:06] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 796.13 seconds [04:14:25] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 289.55 seconds [04:19:53] (03PS1) 10Andrew Bogott: get $::labsproject from the certname [puppet] - 10https://gerrit.wikimedia.org/r/368606 (https://phabricator.wikimedia.org/T171289) [04:42:05] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [04:44:05] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.82 ms [04:45:35] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. [05:13:51] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T171723#3485006 (10Marostegui) 05Open>03Resolved a:03Cmjohnson This is now fixed: ``` root@db1068:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [05:16:05] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3485012 (10Marostegui) Removing the DBA tag as there is nothing for us to do here - I will remain subscribed to the task just in case... [06:43:50] (03PS3) 10Giuseppe Lavagetto: apache::conf: convert to use validate_numeric [puppet] - 10https://gerrit.wikimedia.org/r/367891 (https://phabricator.wikimedia.org/T171704) [06:51:44] 10Operations, 10Patch-For-Review, 10Prod-Kubernetes (Experiment), 10User-Joe: Set up docker building environment for production - https://phabricator.wikimedia.org/T149812#3485033 (10Joe) 05Open>03Resolved [06:52:46] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Switch all hosts to the future parser - https://phabricator.wikimedia.org/T171704#3485034 (10Joe) [06:53:09] (03CR) 10Giuseppe Lavagetto: [C: 032] apache::conf: convert to use validate_numeric [puppet] - 10https://gerrit.wikimedia.org/r/367891 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [06:55:48] (03PS5) 10Giuseppe Lavagetto: role::mediawiki::canary_appserver: move to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/367876 (https://phabricator.wikimedia.org/T171704) [06:57:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368607 (https://phabricator.wikimedia.org/T166204) [06:58:26] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368607 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [07:01:22] (03PS2) 10Marostegui: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368607 (https://phabricator.wikimedia.org/T166204) [07:01:36] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::canary_appserver: move to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/367876 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [07:01:40] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7215/" [puppet] - 10https://gerrit.wikimedia.org/r/367876 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [07:04:33] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368607 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [07:05:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368607 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [07:06:14] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368607 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [07:07:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 - T166204 (duration: 00m 52s) [07:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:35] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [07:09:14] 10Operations, 10Operations-Software-Development, 10Pybal, 10Traffic, 10Patch-For-Review: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893#3485048 (10Volans) The added `PyBal IPVS diff check` is flapping a bit with UNKNOWN for some ho... [07:10:16] (03PS1) 10Giuseppe Lavagetto: admin: fixup for I191cbe091347e [puppet] - 10https://gerrit.wikimedia.org/r/368610 [07:12:51] !log Deploy alter table on db1055 - T166204 [07:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:02] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [07:17:36] !log Stop replication on s7 on db1102 for maintenance - T153743 [07:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:45] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:21:55] (03PS2) 10Elukey: admin::data::data.yaml: remove ironholds from absented users [puppet] - 10https://gerrit.wikimedia.org/r/368577 (https://phabricator.wikimedia.org/T171696) [07:22:16] (03CR) 10Jonas Kress (WMDE): [C: 031] Log 'WikibaseQualityConstraints' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367914 (https://phabricator.wikimedia.org/T171281) (owner: 10Lucas Werkmeister (WMDE)) [07:24:17] (03CR) 10Elukey: [C: 032] admin::data::data.yaml: remove ironholds from absented users [puppet] - 10https://gerrit.wikimedia.org/r/368577 (https://phabricator.wikimedia.org/T171696) (owner: 10Elukey) [07:31:45] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T171926#3485057 (10Volans) [07:31:47] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T172051#3485055 (10Volans) [07:31:55] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T171926#3480411 (10Volans) [07:31:58] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T172054#3485059 (10Volans) [07:32:08] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T172062#3485063 (10Volans) [07:32:10] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T171926#3480411 (10Volans) [07:32:24] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3485066 (10Dereckson) This solution to disable Flow isn't acceptable. We disabled Flow on two existing wikis, because they already used the... [08:02:37] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T171926#3485113 (10fgiunchedi) [08:03:31] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3376162 (10Trizek-WMF) Just to be sure, [[ https://www.mediawiki.org/wiki/Flow/Sandbox | have you tried Flow ]]? [08:09:25] 10Operations, 10Commons, 10media-storage, 10monitoring, 10User-fgiunchedi: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails - https://phabricator.wikimedia.org/T106937#3485116 (10fgiunchedi) [08:35:35] !log Drop table old_growth on s1 - T115982 [08:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:49] T115982: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982 [08:39:52] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#3485197 (10Marostegui) The empty table old_growth has been dropped from enwiki. [08:39:57] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#3485198 (10Marostegui) [08:43:34] (03PS1) 10Giuseppe Lavagetto: base::puppet: fix the environment declaration with the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368611 [08:43:45] <_joe_> volans: ^^ [08:43:54] thx, looking [08:44:14] (03PS1) 10Elukey: Remove stat1002 configuration as part of decom [puppet] - 10https://gerrit.wikimedia.org/r/368612 (https://phabricator.wikimedia.org/T152712) [08:44:34] * volans hates unless [08:44:40] :-P [08:45:13] !log Rename table click_tracking and click_tracking_user_properties on db1089 (s1) - T115982 [08:45:17] (03CR) 10jerkins-bot: [V: 04-1] Remove stat1002 configuration as part of decom [puppet] - 10https://gerrit.wikimedia.org/r/368612 (https://phabricator.wikimedia.org/T152712) (owner: 10Elukey) [08:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:25] T115982: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982 [08:47:20] (03PS2) 10Elukey: Remove stat1002 configuration as part of decom [puppet] - 10https://gerrit.wikimedia.org/r/368612 (https://phabricator.wikimedia.org/T152712) [08:47:49] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/368611 (owner: 10Giuseppe Lavagetto) [08:48:19] (03CR) 10jerkins-bot: [V: 04-1] Remove stat1002 configuration as part of decom [puppet] - 10https://gerrit.wikimedia.org/r/368612 (https://phabricator.wikimedia.org/T152712) (owner: 10Elukey) [08:48:44] <_joe_> volans: uhm I did screw up something [08:49:22] (03PS2) 10Giuseppe Lavagetto: base::puppet: fix the environment declaration with the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368611 [08:50:35] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#3485206 (10Marostegui) On `db1089` (enwiki) I have renamed `click_tracking` and `click... [08:50:57] (03CR) 10Volans: [C: 031] base::puppet: fix the environment declaration with the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368611 (owner: 10Giuseppe Lavagetto) [08:52:23] 10Operations, 10monitoring, 10User-fgiunchedi: Update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635#1248584 (10fgiunchedi) Both issues (debug log and slow stop) have been bandaided in our puppet in the meantime [08:54:23] (03CR) 10Giuseppe Lavagetto: [C: 032] base::puppet: fix the environment declaration with the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368611 (owner: 10Giuseppe Lavagetto) [08:54:36] (03PS3) 10Elukey: Remove stat1002 configuration as part of decom [puppet] - 10https://gerrit.wikimedia.org/r/368612 (https://phabricator.wikimedia.org/T152712) [08:55:09] !log update nodejs* on aqs100[56789] to 6.11 - T170790 [08:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:21] T170790: Upgrade AQS to node 6.11 - https://phabricator.wikimedia.org/T170790 [08:59:24] (03PS1) 10Hashar: contint: upgrade tox [puppet] - 10https://gerrit.wikimedia.org/r/368616 (https://phabricator.wikimedia.org/T169602) [09:04:15] (03CR) 10Paladox: contint: upgrade tox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/368616 (https://phabricator.wikimedia.org/T169602) (owner: 10Hashar) [09:06:47] (03CR) 10Paladox: [C: 031] "Bug has been fixed upstream. Bug is not in 2.14 or 2.13 :)" [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [09:15:07] (03PS2) 10Ema: pybal::monitoring: bump check_pybal_ipvs_diff timeout [puppet] - 10https://gerrit.wikimedia.org/r/368416 (https://phabricator.wikimedia.org/T134893) [09:15:55] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::appserver::canary_api: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368618 (https://phabricator.wikimedia.org/T171704) [09:15:56] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::appserver::*: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368619 (https://phabricator.wikimedia.org/T171704) [09:15:58] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::imagescaler: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368620 (https://phabricator.wikimedia.org/T171704) [09:16:01] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::jobrunner/videoscaler: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368621 (https://phabricator.wikimedia.org/T171704) [09:16:59] (03CR) 10Ema: [C: 032] pybal::monitoring: bump check_pybal_ipvs_diff timeout [puppet] - 10https://gerrit.wikimedia.org/r/368416 (https://phabricator.wikimedia.org/T134893) (owner: 10Ema) [09:21:06] (03PS1) 10Filippo Giunchedi: prometheus: enable vmstat node-exporter collector [puppet] - 10https://gerrit.wikimedia.org/r/368622 [09:21:39] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::appserver::canary_api: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368618 (https://phabricator.wikimedia.org/T171704) [09:21:51] (03PS2) 10Filippo Giunchedi: prometheus: enable vmstat node-exporter collector [puppet] - 10https://gerrit.wikimedia.org/r/368622 [09:21:53] (03PS1) 10Volans: Failoid: migrate to Puppet's future parser [puppet] - 10https://gerrit.wikimedia.org/r/368623 (https://phabricator.wikimedia.org/T171704) [09:22:26] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::mediawiki::appserver::canary_api: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368618 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [09:23:43] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::appserver::*: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368619 (https://phabricator.wikimedia.org/T171704) [09:26:13] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3376166 (10Dereckson) [09:26:35] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::appserver::*: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368619 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [09:26:57] (03PS1) 10Jcrespo: parsercache: Retire temporary parsercaches from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/368624 (https://phabricator.wikimedia.org/T167784) [09:28:49] !log Stop replication on labsdb1009 and labsdb1010 for maintenance - T153743 [09:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:58] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:29:56] (03PS2) 10Jcrespo: parsercache: Retire temporary parsercaches from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/368624 (https://phabricator.wikimedia.org/T167784) [09:31:30] (03CR) 10Jcrespo: [C: 032] parsercache: Retire temporary parsercaches from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/368624 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [09:33:26] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3485335 (10Dereckson) @Jayprakash12345 provided me by mail a translation for the Flow Topic: namespace, so I opened T172093 and have submitt... [09:33:50] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::imagescaler: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368620 (https://phabricator.wikimedia.org/T171704) [09:34:41] (03PS3) 10Filippo Giunchedi: prometheus: enable vmstat node-exporter collector [puppet] - 10https://gerrit.wikimedia.org/r/368622 [09:35:54] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: enable vmstat node-exporter collector [puppet] - 10https://gerrit.wikimedia.org/r/368622 (owner: 10Filippo Giunchedi) [09:36:13] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::imagescaler: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368620 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [09:36:28] (03PS3) 10Giuseppe Lavagetto: role::mediawiki::imagescaler: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368620 (https://phabricator.wikimedia.org/T171704) [09:41:31] (03CR) 10Volans: "Compiler output at: https://puppet-compiler.wmflabs.org/compiler02/7220/" [puppet] - 10https://gerrit.wikimedia.org/r/368623 (https://phabricator.wikimedia.org/T171704) (owner: 10Volans) [09:43:31] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3485417 (10Jayprakash12345) >>! In T168765#3485335, @Dereckson wrote: > @Jayprakash12345 provided me by mail a translation for the Flow Topi... [09:46:24] (03PS1) 10Jcrespo: Include db11XX and db21XX servers into the mysql cluster [puppet] - 10https://gerrit.wikimedia.org/r/368628 (https://phabricator.wikimedia.org/T170662) [09:46:53] (03CR) 10Jcrespo: "This was noticed because db1102 was not included on the cluster." [puppet] - 10https://gerrit.wikimedia.org/r/368628 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [09:48:26] (03CR) 10Marostegui: [C: 031] Include db11XX and db21XX servers into the mysql cluster [puppet] - 10https://gerrit.wikimedia.org/r/368628 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [09:49:05] (03CR) 10Jcrespo: [C: 032] Include db11XX and db21XX servers into the mysql cluster [puppet] - 10https://gerrit.wikimedia.org/r/368628 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [09:52:42] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::jobrunner/videoscaler: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368621 (https://phabricator.wikimedia.org/T171704) [10:05:18] (03PS1) 10Giuseppe Lavagetto: puppet: make all priorities explicitly integers [puppet] - 10https://gerrit.wikimedia.org/r/368638 [10:17:16] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3485531 (10Jayprakash12345) [10:24:31] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: make all priorities explicitly integers [puppet] - 10https://gerrit.wikimedia.org/r/368638 (owner: 10Giuseppe Lavagetto) [10:48:46] Reedy: https://meta.wikimedia.org/wiki/MediaWiki:Email-blacklist <-- is global? (his list affects only this wiki;)? [10:56:41] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3485634 (10StevenJ81) I guess the subquestion of disabling Flow is done. But for the record I've tried Flow, I hate Flow, I find it to be di... [11:06:12] Reedy: gerrit claims it is global, but the WM page you created don't . [11:12:18] !log Compress s6 on db1102 - T153743 [11:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:30] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [11:18:16] 10Operations, 10Traffic: OCSP update failed for /etc/update-ocsp.d/globalsign-2016-ecdsa-unified.conf - https://phabricator.wikimedia.org/T172101#3485695 (10ema) [11:18:24] 10Operations, 10Traffic: OCSP update failed for /etc/update-ocsp.d/globalsign-2016-ecdsa-unified.conf - https://phabricator.wikimedia.org/T172101#3485707 (10ema) p:05Triage>03Normal [11:31:47] (03PS1) 10Addshore: Remove WMDE log Channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368701 (https://phabricator.wikimedia.org/T168635) [11:37:41] (03PS5) 10Addshore: Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) [11:39:31] (03CR) 10jerkins-bot: [V: 04-1] Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [11:41:43] (03PS6) 10Addshore: Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) [11:43:37] (03PS2) 10Addshore: DNM remove wgRevisionSliderAlternateSlider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363206 [11:43:41] (03PS2) 10Addshore: DNM Remove wm?gRevisionSliderBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363207 [11:44:03] (03PS3) 10Addshore: Remove wm?gRevisionSliderBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363207 [11:44:14] (03PS3) 10Addshore: Remove wgRevisionSliderAlternateSlider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363206 [11:45:21] (03CR) 10Addshore: Remove wgRevisionSliderAlternateSlider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363206 (owner: 10Addshore) [11:45:25] (03CR) 10Addshore: Remove wm?gRevisionSliderBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363207 (owner: 10Addshore) [11:56:27] 10Operations, 10Traffic: IPVS issues with UDP services, pybal depooling strategy - https://phabricator.wikimedia.org/T172103#3485845 (10ema) [11:56:42] 10Operations, 10Pybal, 10Traffic: IPVS issues with UDP services, pybal depooling strategy - https://phabricator.wikimedia.org/T172103#3485857 (10ema) p:05Triage>03Normal [12:09:01] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2031460 [12:33:13] (03PS1) 10ArielGlenn: write dump output files to temporary location, move in place when done [dumps] - 10https://gerrit.wikimedia.org/r/368744 (https://phabricator.wikimedia.org/T169849) [12:36:25] 10Operations, 10Dumps-Generation, 10Patch-For-Review: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3485967 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/368744/ is a draft of the callback plus file content check for dump output files as they are produce... [12:39:01] !log banning elastic10(17|18|19|20) to prepare for thermal paste - T168816 [12:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:13] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [12:52:41] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [12:53:58] ^ elasticsearch above might be me, having a look [12:54:41] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [12:56:08] !log un-banning elastic1020 since it seems to have impact on cluster performances - T168816 [12:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:18] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [12:56:31] damn, last time there was mostly no perf impact with loosing 4 nodes in the cluster... [12:57:19] jouncebot: next [12:57:20] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170731T1300) [12:57:53] addshore: deploying your own changes today during eu swat? or should I? [12:58:28] zeljkof: how about you do the merging and deploying stuff and I simply verify them? :) [12:58:42] addshore: sure :) [12:58:45] I have a bit of a headache :/ [12:58:47] Thanks! :) [12:59:33] addshore: I hope it will get better soon, headaches are no fun [12:59:49] Nope, think I am just massively dehydrated [13:00:00] hashar: I can swat today, but do you have a few minutes to take a look at the patches, just in case? [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170731T1300). Please do the needful. [13:00:05] gehel, debt, and addshore: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:15] jouncebot: o/ [13:00:19] o/ [13:00:21] \o [13:00:23] jouncebot: o/ [13:00:27] I can SWAT today! [13:00:30] \O/ [13:00:36] :D [13:01:04] gehel: about your patch https://gerrit.wikimedia.org/r/#/c/368172/4 should one setting always imply the other is set ? [13:01:16] ACKNOWLEDGEMENT - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [1000.0] Gehel banning nodes for thermal paste has more impact than expected - T168816 [13:01:31] (03CR) 10Hashar: [C: 031] enable mapframe for euwiki, ptwiki and uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368172 (https://phabricator.wikimedia.org/T167619) (owner: 10Gehel) [13:01:39] hashar: yes [13:01:55] (03CR) 10Hashar: [C: 031] Remove WMDE log Channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368701 (https://phabricator.wikimedia.org/T168635) (owner: 10Addshore) [13:02:02] hashar: are you doing the swat, or just reviewing? just checking, so we don't both deploy :) [13:02:05] hashar: we should actually merge those settings, it does not really make sense to expose both [13:02:33] o/ [13:02:48] (03CR) 10Hashar: [C: 031] Remove wm?gRevisionSliderBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363207 (owner: 10Addshore) [13:02:53] Hey _joe_ [13:03:13] (03CR) 10Hashar: [C: 031] Remove wgRevisionSliderAlternateSlider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363206 (owner: 10Addshore) [13:03:23] I'm here to continue our stress testing of ORES. :) No rush though if you're working on something else now. [13:03:32] (03CR) 10Hashar: [C: 04-1] Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [13:04:00] :O [13:04:00] (03CR) 10Hashar: [C: 031] "wrong click :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [13:04:02] :D [13:04:04] zeljkof: all reviewed :D [13:04:17] hashar: should I do the swat? [13:05:45] oh, CI is busy, this will be fun :( [13:06:40] zeljkof: patches receiving a +2 takes precedence [13:06:44] take [13:07:38] gehel, debt: merging 368172 [13:07:43] (03CR) 10jerkins-bot: [V: 04-1] write dump output files to temporary location, move in place when done [dumps] - 10https://gerrit.wikimedia.org/r/368744 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [13:07:51] (03PS5) 10Zfilipin: enable mapframe for euwiki, ptwiki and uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368172 (https://phabricator.wikimedia.org/T167619) (owner: 10Gehel) [13:07:51] * gehel is crossing fingers [13:08:07] * debt let's get this done! :) [13:08:20] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368172 (https://phabricator.wikimedia.org/T167619) (owner: 10Gehel) [13:09:06] zeljkof: let us know when we can test... [13:10:00] gehel: I will pull the commit to mwdebug1002 as soon as it is merged, can you test there? [13:10:12] zeljkof: yep cc:debt [13:10:40] got it, thanks [13:10:52] (03Merged) 10jenkins-bot: enable mapframe for euwiki, ptwiki and uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368172 (https://phabricator.wikimedia.org/T167619) (owner: 10Gehel) [13:11:03] (03CR) 10jenkins-bot: enable mapframe for euwiki, ptwiki and uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368172 (https://phabricator.wikimedia.org/T167619) (owner: 10Gehel) [13:12:15] gehel, debt: 368172 is at mwdebug1002, please test and let me know if I can proceed [13:12:39] debt: I'll let you do the testing... [13:13:02] gehel: so far, not seeing it [13:13:06] (03PS2) 10Zfilipin: Remove WMDE log Channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368701 (https://phabricator.wikimedia.org/T168635) (owner: 10Addshore) [13:13:08] <_joe_> halfak: around? [13:13:14] (03PS4) 10Zfilipin: Remove wm?gRevisionSliderBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363207 (owner: 10Addshore) [13:13:18] (03PS4) 10Zfilipin: Remove wgRevisionSliderAlternateSlider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363206 (owner: 10Addshore) [13:14:19] I have added two patches for SWAT [13:14:22] gehel, debt: you do not see 368172 at mwdebug1002? [13:14:24] if that's okay [13:14:30] (03PS1) 10Ottomata: Install libcgi-pm-perl for wikistats 1.0 ezachte [puppet] - 10https://gerrit.wikimedia.org/r/368763 (https://phabricator.wikimedia.org/T152712) [13:14:36] Amir1: sure, if there is time [13:14:41] Thanks [13:14:57] zeljkof, gehel - I'm not seeing it yet [13:15:02] zeljkof: nope, checking to see if I can understand why, else we'll rollback. Can you give us 3 more minutes? [13:15:19] gehel, debt: sure, take your time [13:16:04] (03PS1) 10Elukey: hive: fix server and metastore configuration [puppet/cdh] - 10https://gerrit.wikimedia.org/r/368764 (https://phabricator.wikimedia.org/T172107) [13:16:20] (03CR) 10Ottomata: [C: 032] Install libcgi-pm-perl for wikistats 1.0 ezachte [puppet] - 10https://gerrit.wikimedia.org/r/368763 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [13:19:40] (03PS2) 10Elukey: hive: fix server and metastore configuration [puppet/cdh] - 10https://gerrit.wikimedia.org/r/368764 (https://phabricator.wikimedia.org/T172107) [13:19:54] zeljkof: no idea what is going on. It does not seem to break anything, but it looks like kartographer is still not activated [13:20:06] * gehel does not understand much about the mediawiki side of maps... [13:20:24] gehel: sorry, I can not help there myself :| [13:21:24] (03CR) 10Ottomata: [C: 031] "nice! One nit." (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/368764 (https://phabricator.wikimedia.org/T172107) (owner: 10Elukey) [13:22:29] gehel, debt: what should I do? revert 368172? [13:22:32] ottomata: thanks! pcc looks good https://puppet-compiler.wmflabs.org/compiler02/7230/analytics1003.eqiad.wmnet/, will amend and merge ok? [13:22:34] zeljkof: let's rollback, we'll dig into this a bit more. No reason to hold the SWAT just for us. [13:22:52] thanks, zeljkof [13:22:53] elukey: +1 [13:22:54] maybe kartographer is not enabled at all ? [13:23:05] there might be a setting like $wmgUseKartographer [13:23:15] maplink works, so kartographer seems to be enabled [13:23:27] maplink was already enabled, months ago [13:23:51] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [13:24:19] (03PS3) 10Elukey: hive: fix server and metastore configuration [puppet/cdh] - 10https://gerrit.wikimedia.org/r/368764 (https://phabricator.wikimedia.org/T172107) [13:24:24] We're not going to debug that one live just now. But we'll get to the bottom of this. [13:24:31] zeljkof: all my appologies [13:25:27] gehel: no problem, reverting then [13:25:53] zeljkof: thanks! [13:26:20] (03PS1) 10Zfilipin: Revert "enable mapframe for euwiki, ptwiki and uawikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368766 [13:26:37] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368766 (owner: 10Zfilipin) [13:26:44] thanks again, zeljkof; we'll get this figured out [13:27:07] debt: there is always another swat window :D [13:27:26] zeljkof, gehel yes! :) [13:27:47] (03PS4) 10Elukey: hive: fix server and metastore configuration [puppet/cdh] - 10https://gerrit.wikimedia.org/r/368764 (https://phabricator.wikimedia.org/T172107) [13:28:07] (03Merged) 10jenkins-bot: Revert "enable mapframe for euwiki, ptwiki and uawikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368766 (owner: 10Zfilipin) [13:28:26] (03CR) 10jenkins-bot: Revert "enable mapframe for euwiki, ptwiki and uawikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368766 (owner: 10Zfilipin) [13:29:52] addshore: reviewing your commits, apologies for the delay, some trouble with the first commit [13:29:57] no problem [13:30:24] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368701 (https://phabricator.wikimedia.org/T168635) (owner: 10Addshore) [13:30:28] (03PS6) 10Rush: openstack: move rabbitmq to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/368523 (https://phabricator.wikimedia.org/T171494) [13:30:48] !log disable puppet for cloud-y things [13:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:23] (03CR) 10Elukey: [V: 032 C: 032] hive: fix server and metastore configuration [puppet/cdh] - 10https://gerrit.wikimedia.org/r/368764 (https://phabricator.wikimedia.org/T172107) (owner: 10Elukey) [13:33:45] addshore: CI is busy, it might take a while to merge :( [13:33:54] ack [13:37:02] (03PS1) 10Elukey: modules::cdh: update to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/368767 (https://phabricator.wikimedia.org/T172107) [13:37:32] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Fix fqdn for promethium - https://phabricator.wikimedia.org/T172111#3486106 (10Andrew) [13:38:17] (03PS3) 10Zfilipin: Remove WMDE log Channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368701 (https://phabricator.wikimedia.org/T168635) (owner: 10Addshore) [13:38:49] 10Operations, 10Operations-Software-Development, 10Pybal, 10Traffic, 10Patch-For-Review: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893#3486123 (10BBlack) >>! In T134893#3485048, @Volans wrote: > The added `PyBal IPVS diff check` i... [13:39:12] (03CR) 10Rush: [C: 032] openstack: move rabbitmq to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/368523 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [13:40:05] (03PS3) 10Andrew Bogott: labs puppetmaster: validate cert name before autosigning [puppet] - 10https://gerrit.wikimedia.org/r/368449 (https://phabricator.wikimedia.org/T171961) [13:40:07] (03PS2) 10Andrew Bogott: get $::labsproject from the certname [puppet] - 10https://gerrit.wikimedia.org/r/368606 (https://phabricator.wikimedia.org/T171289) [13:40:22] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7233/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/368767 (https://phabricator.wikimedia.org/T172107) (owner: 10Elukey) [13:40:31] (03PS2) 10Elukey: modules::cdh: update to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/368767 (https://phabricator.wikimedia.org/T172107) [13:40:34] (03CR) 10Elukey: [V: 032 C: 032] modules::cdh: update to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/368767 (https://phabricator.wikimedia.org/T172107) (owner: 10Elukey) [13:41:11] (03CR) 10Zfilipin: Remove WMDE log Channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368701 (https://phabricator.wikimedia.org/T168635) (owner: 10Addshore) [13:41:19] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368701 (https://phabricator.wikimedia.org/T168635) (owner: 10Addshore) [13:41:56] (03PS1) 10Mforns: Add QuickSurvey schemas to EventLogging white-list [puppet] - 10https://gerrit.wikimedia.org/r/368769 (https://phabricator.wikimedia.org/T172112) [13:42:12] subbu: I want to make some changes on promethium… let me know when you're around? [13:42:15] * halfak looks around for _joe_ [13:42:23] andrewbogott, here [13:42:55] subbu: I'm noticing that (unlike everything else in Labs) promethium doesn't have a project name in its fqdn or puppet cert. [13:43:02] andrewbogott, check /etc/hosts [13:43:03] That might be for a good reason but I'd like to try to fix it [13:43:12] ok. [13:43:15] (03PS1) 10Steinsplitter: Same namespace for global mail blacklist as for global spam blacklist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368770 [13:43:33] subbu: can you live with some breakage if it goes badly? Or should I wait and schedule this for another time? [13:43:43] (03Merged) 10jenkins-bot: Remove WMDE log Channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368701 (https://phabricator.wikimedia.org/T168635) (owner: 10Addshore) [13:44:18] andrewbogott, well, depends on how long of a breakage .. if it comes backup within the week, i am good. :) [13:44:23] *back up [13:44:29] subbu: ok — I was thinking more like an hour or two [13:44:30] thanks [13:44:36] addshore: finally merged :( pushing to mwdebug1002... [13:44:42] ack! [13:44:48] andrewbogott, you are good then. go for it. [13:44:49] subbu: project should be 'wikitextexp' right? [13:44:56] yes. [13:45:25] a [13:45:40] addshore: 368701 is at mwdebug1002, please test [13:45:57] looks good! [13:46:05] addshore: deploying then [13:46:44] (03CR) 10jenkins-bot: Remove WMDE log Channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368701 (https://phabricator.wikimedia.org/T168635) (owner: 10Addshore) [13:46:58] (03CR) 10Steinsplitter: "Please see Change-Id: Ib66ca24c9d31017dcefd5b52b648e66580e69528" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367537 (owner: 10Reedy) [13:47:22] subbu: seems to have just worked :) Thanks, lmk if you see any bad effects later [13:47:27] (03PS2) 10Steinsplitter: Same namespace for global mail blacklist as for global spam blacklist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368770 [13:47:39] will do. :) [13:47:39] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:368701|Remove WMDE log Channel (T168635)]] (duration: 00m 43s) [13:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:51] T168635: Undeploy campaign-specific patch & logging for tracking user registration and guided tour - https://phabricator.wikimedia.org/T168635 [13:47:54] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3486182 (10Andrew) [13:47:54] addshore: 368701 deployed, please test [13:47:58] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Fix fqdn for promethium - https://phabricator.wikimedia.org/T172111#3486180 (10Andrew) 05Open>03Resolved that was easy [13:48:02] Looks good! [13:49:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [13:49:18] addshore: merging 341127 [13:49:22] ack [13:49:56] 10Operations, 10Pybal, 10Traffic: IPVS issues with UDP services, pybal depooling strategy - https://phabricator.wikimedia.org/T172103#3486193 (10BBlack) +1. There are a number of tricky things here to get to these simple goals, though, and since the sysctls affect all services, we have to have the TCP cases... [13:50:37] (03CR) 10Stryn: [C: 031] Same namespace for global mail blacklist as for global spam blacklist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368770 (owner: 10Steinsplitter) [13:51:05] 10Operations, 10Pybal, 10Traffic: Backport ipvsadm - https://phabricator.wikimedia.org/T171850#3486199 (10BBlack) [13:51:08] 10Operations, 10Pybal, 10Traffic: IPVS issues with UDP services, pybal depooling strategy - https://phabricator.wikimedia.org/T172103#3486198 (10BBlack) [13:51:33] (03Merged) 10jenkins-bot: Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [13:51:47] (03CR) 10jenkins-bot: Remove Wikibase vs Interwikisorting checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341127 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [13:55:23] addshore: 341127 is at mwdebug1002 [13:55:27] checking [13:55:58] (03PS5) 10Zfilipin: Remove wgRevisionSliderAlternateSlider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363206 (owner: 10Addshore) [13:56:03] (03PS5) 10Zfilipin: Remove wm?gRevisionSliderBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363207 (owner: 10Addshore) [13:56:28] zeljkof: looks good! [13:56:33] addshore: deploying [13:56:41] 10Operations, 10Traffic: OCSP update failed for /etc/update-ocsp.d/globalsign-2016-ecdsa-unified.conf - https://phabricator.wikimedia.org/T172101#3486214 (10BBlack) 05Open>03Resolved a:03BBlack Ran it again and it's ok now. [13:57:25] !log zfilipin@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:341127|Remove Wikibase vs Interwikisorting checks (T150183)]] (duration: 00m 43s) [13:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:35] addshore: deployed, please check [13:57:35] T150183: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183 [13:58:11] addshore, Amir1: looks like there is no other deployment after eu swat, should I continue? [13:58:27] looks like CI is no longer as busy too, so it should go faster [13:58:28] It would be great [13:58:33] zeljkof: that one looks fine [13:58:46] and yeh! lets continue! :D [13:59:04] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363206 (owner: 10Addshore) [13:59:25] addshore, Amir1: will continue with swat then :) [13:59:31] addshore: merging 363206 [14:00:25] (03Merged) 10jenkins-bot: Remove wgRevisionSliderAlternateSlider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363206 (owner: 10Addshore) [14:00:40] (03CR) 10jenkins-bot: Remove wgRevisionSliderAlternateSlider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363206 (owner: 10Addshore) [14:01:22] addshore: 363206 is at mwdebug [14:02:08] checking [14:02:37] zeljkof: looks good [14:03:10] addshore: deploying [14:03:38] (03PS3) 10Steinsplitter: Same namespace for global mail blacklist as for global spam blacklist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368770 [14:04:18] (03PS1) 10Rush: openstack: replace cloudrepo placeholders for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/368774 (https://phabricator.wikimedia.org/T171494) [14:04:22] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:363206|Remove wgRevisionSliderAlternateSlider]] (duration: 00m 42s) [14:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:42] addshore: deployed, please check [14:04:49] checking [14:04:54] (03PS1) 10Aklapper: phabricator: Block certain mobile IP ranges from uploading files [puppet] - 10https://gerrit.wikimedia.org/r/368775 [14:04:56] (03PS4) 10Steinsplitter: Same namespace for global mail blacklist as for global spam blacklist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368770 [14:04:58] looks good [14:05:01] (03CR) 10Stryn: [C: 031] Same namespace for global mail blacklist as for global spam blacklist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368770 (owner: 10Steinsplitter) [14:05:32] addshore: there is merge conflict for 363207 [14:06:38] (03CR) 10Rush: [C: 032] openstack: replace cloudrepo placeholders for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/368774 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:06:40] bah [14:07:14] (03PS2) 10Aklapper: phabricator: Block certain mobile IP ranges from uploading files [puppet] - 10https://gerrit.wikimedia.org/r/368775 [14:07:31] Amir1: reviewing 367393 [14:07:47] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367393 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup) [14:07:55] 10Operations: ganeti2003 ipmi_sdr_cache_create: internal IPMI error - https://phabricator.wikimedia.org/T172115#3486250 (10herron) [14:08:04] !log shutting down elastic10(17|18|19) for thermal paste - T168816 [14:08:05] (03PS6) 10Addshore: Remove wm?gRevisionSliderBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363207 [14:08:09] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(17|18|19).eqiad.wmnet [14:08:13] zeljkof: ^^ rebased locally, didnt actually have a conflict to solve [14:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:16] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [14:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:24] (03CR) 10Paladox: "This will affect legitiment users" [puppet] - 10https://gerrit.wikimedia.org/r/368775 (owner: 10Aklapper) [14:08:31] addshore: strange, gerrit did not want to rebase from web interface [14:08:39] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3486264 (10herron) [14:08:39] 10Operations: ganeti2003 ipmi_sdr_cache_create: internal IPMI error - https://phabricator.wikimedia.org/T172115#3486263 (10herron) [14:08:40] Thanks [14:09:01] 10Operations, 10Traffic: Improve OCSP fetching and monitoring strategies - https://phabricator.wikimedia.org/T172116#3486266 (10BBlack) [14:09:08] (03Merged) 10jenkins-bot: Turn on reading from the term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367393 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup) [14:09:21] (03CR) 10jenkins-bot: Turn on reading from the term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367393 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup) [14:10:32] Amir1: 367393 is at mwdebug1002, please test and let me know if I can continue [14:11:13] addshore: any order in which I should deploy files from 363207? [14:11:19] or it does not matter' [14:11:20] ? [14:12:56] zeljkof: works fine [14:13:47] (03PS1) 10DCausse: Tune ordering of crossproject search results on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368776 (https://phabricator.wikimedia.org/T171803) [14:14:25] zeljkof: *looks* [14:14:34] !log zfilipin@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:367393|Turn on reading from the term_full_entity_id in testwikidata (T165197)]] (duration: 00m 42s) [14:14:41] zeljkof: commonsettings first [14:14:45] Amir1: deployed, please check [14:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:48] T165197: Change configuration of test Wikidata to write term_full_entity_id - https://phabricator.wikimedia.org/T165197 [14:14:54] addshore: ok, will do [14:15:22] zeljkof: Works like a charm [14:15:22] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363207 (owner: 10Addshore) [14:16:55] (03Merged) 10jenkins-bot: Remove wm?gRevisionSliderBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363207 (owner: 10Addshore) [14:16:57] 10Operations, 10Traffic: Improve OCSP fetching and monitoring strategies - https://phabricator.wikimedia.org/T172116#3486297 (10BBlack) Hmm I wrote that backwards above. The OCSP file-freshness checks look at age-of-mtime, not the timestamp within. In any case, we can still move them to crit=~3d and warn=~2d. [14:17:08] (03CR) 10jenkins-bot: Remove wm?gRevisionSliderBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363207 (owner: 10Addshore) [14:17:16] (03PS1) 10Rush: openstack: rabbitmq diamond user is 'monitoring' [puppet] - 10https://gerrit.wikimedia.org/r/368778 (https://phabricator.wikimedia.org/T171494) [14:17:49] addshore: 363207 is at mwdebug [14:17:52] ack [14:18:02] (please check) [14:18:09] zeljkof: looks good [14:18:18] addshore: deploying [14:19:02] (03CR) 10Rush: [C: 032] openstack: rabbitmq diamond user is 'monitoring' [puppet] - 10https://gerrit.wikimedia.org/r/368778 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:19:03] PROBLEM - Host elastic1017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:19:10] !log zfilipin@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:363207|Remove wm?gRevisionSliderBetaFeature]] (duration: 00m 42s) [14:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:01] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:363207|Remove wm?gRevisionSliderBetaFeature]] (duration: 00m 42s) [14:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:13] addshore: deployed, please check [14:20:30] Amir1: merging 366866 [14:20:32] looks good [14:20:51] addshore: in that case... thanks for releasing with #releng! ;) [14:20:57] Thanks!!! [14:21:04] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366866 (https://phabricator.wikimedia.org/T169060) (owner: 10Daniel Kinzler) [14:21:27] (03CR) 10Volans: "@aklapper: did you verify that the recent users were all from those networks? Because as uploads/downloads from Zero are blocked, I guess " [puppet] - 10https://gerrit.wikimedia.org/r/368775 (owner: 10Aklapper) [14:22:54] (03PS3) 10Zfilipin: Add P279 to $wgPropertySuggesterClassifyingPropertyIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366866 (https://phabricator.wikimedia.org/T169060) (owner: 10Daniel Kinzler) [14:23:32] PROBLEM - Host elastic1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:24:11] RECOVERY - Host elastic1017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [14:24:47] (03CR) 10Zfilipin: Add P279 to $wgPropertySuggesterClassifyingPropertyIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366866 (https://phabricator.wikimedia.org/T169060) (owner: 10Daniel Kinzler) [14:24:53] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366866 (https://phabricator.wikimedia.org/T169060) (owner: 10Daniel Kinzler) [14:26:29] (03PS1) 10BBlack: OCSP: Warn less, retry more [puppet] - 10https://gerrit.wikimedia.org/r/368779 (https://phabricator.wikimedia.org/T172116) [14:26:31] (03Merged) 10jenkins-bot: Add P279 to $wgPropertySuggesterClassifyingPropertyIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366866 (https://phabricator.wikimedia.org/T169060) (owner: 10Daniel Kinzler) [14:26:45] (03CR) 10jenkins-bot: Add P279 to $wgPropertySuggesterClassifyingPropertyIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366866 (https://phabricator.wikimedia.org/T169060) (owner: 10Daniel Kinzler) [14:27:16] Amir1: 366866 is at mwdebug1002, please check and let me know if I can deploy [14:27:32] (03CR) 10jerkins-bot: [V: 04-1] OCSP: Warn less, retry more [puppet] - 10https://gerrit.wikimedia.org/r/368779 (https://phabricator.wikimedia.org/T172116) (owner: 10BBlack) [14:27:37] okay [14:28:41] RECOVERY - Host elastic1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [14:28:41] PROBLEM - Host elastic1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:28:48] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Default to ext4 instead of ext3 - https://phabricator.wikimedia.org/T169605#3486344 (10fgiunchedi) [14:29:25] (03PS2) 10BBlack: OCSP: Warn less, retry more [puppet] - 10https://gerrit.wikimedia.org/r/368779 (https://phabricator.wikimedia.org/T172116) [14:30:32] Amir1: "okay" as in "it's ok, deploy"? or "I'm checking"? ;) [14:30:44] zeljkof: okay, I'm checking [14:30:49] this one is not easy to test [14:31:00] Amir1: take your time, just checking the status [14:31:14] 10Operations, 10Ops-Access-Requests, 10User-Addshore: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3486345 (10Addshore) [14:33:39] zeljkof: works fine! [14:33:51] RECOVERY - Host elastic1019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.68 ms [14:33:57] Amir1: deploying [14:34:22] Thanks [14:34:58] !log zfilipin@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:366866|Add P279 to $wgPropertySuggesterClassifyingPropertyIds (T169060)]] (duration: 00m 42s) [14:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:10] T169060: Set $wgPropertySuggesterClassifyingPropertyIds to [ 31, 279 ] for wikidata.org - https://phabricator.wikimedia.org/T169060 [14:35:11] Amir1: deployed, please check [14:36:09] Thanks [14:36:20] !log un-banning and repooling elastic10(17|18|19) - T168816 [14:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:30] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [14:36:37] (03PS3) 10Andrew Bogott: get $::labsproject from the certname [puppet] - 10https://gerrit.wikimedia.org/r/368606 (https://phabricator.wikimedia.org/T171289) [14:36:40] It's great, thanks. [14:36:57] !log banning and repooling elastic10(20|21) - T168816 [14:37:01] Amir1: thanks for releasing with #releng! ;) [14:37:02] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(17|18|19).eqiad.wmnet [14:37:05] !log EU SWAT finished [14:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:24] please don't forget to pick up all of your belongings [14:37:27] :D [14:37:33] ;) [14:38:04] please don't forget to leave all logs as clean as you have found them :) [14:39:14] :)))) That would work too [14:41:45] 10Operations, 10Wikidata, 10User-notice, 10Wikimedia-Incident: Wikidata and dewiki databases locked - https://phabricator.wikimedia.org/T171928#3486378 (10matej_suchanek) [14:46:02] (03PS1) 10Ottomata: Sync published datasets more often, and allow users to rsync to speed up the process. [puppet] - 10https://gerrit.wikimedia.org/r/368794 (https://phabricator.wikimedia.org/T152712) [14:47:01] (03CR) 10jerkins-bot: [V: 04-1] Sync published datasets more often, and allow users to rsync to speed up the process. [puppet] - 10https://gerrit.wikimedia.org/r/368794 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [14:48:33] (03PS2) 10Ottomata: Sync published datasets more often, and allow users to rsync to speed up the process. [puppet] - 10https://gerrit.wikimedia.org/r/368794 (https://phabricator.wikimedia.org/T152712) [14:49:38] (03PS6) 10Gehel: Decrease elasticsearch search thread pool to 32 for cirrus servers [puppet] - 10https://gerrit.wikimedia.org/r/367709 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [14:49:52] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [14:49:58] (03CR) 10Cmjohnson: [C: 032] Adding dns entries (mgmt and production) for labstore1006/7 public vlan T167984 [dns] - 10https://gerrit.wikimedia.org/r/368445 (owner: 10Cmjohnson) [14:50:14] (03CR) 10jerkins-bot: [V: 04-1] Sync published datasets more often, and allow users to rsync to speed up the process. [puppet] - 10https://gerrit.wikimedia.org/r/368794 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [14:51:32] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hive-server2] [14:52:14] (03PS3) 10Ottomata: Sync published datasets more often, allow users to rsync [puppet] - 10https://gerrit.wikimedia.org/r/368794 (https://phabricator.wikimedia.org/T152712) [14:55:44] (03PS4) 10Ottomata: Sync published datasets more often, allow users to rsync [puppet] - 10https://gerrit.wikimedia.org/r/368794 (https://phabricator.wikimedia.org/T152712) [14:55:56] (03CR) 10Ottomata: [V: 032 C: 032] Sync published datasets more often, allow users to rsync [puppet] - 10https://gerrit.wikimedia.org/r/368794 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [14:56:29] (03CR) 10Gehel: [C: 032] Decrease elasticsearch search thread pool to 32 for cirrus servers [puppet] - 10https://gerrit.wikimedia.org/r/367709 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [14:56:42] (03PS7) 10Gehel: Decrease elasticsearch search thread pool to 32 for cirrus servers [puppet] - 10https://gerrit.wikimedia.org/r/367709 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [14:59:22] !log banning elastic10(22|23) - T168816 [14:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:34] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [15:00:17] (03PS4) 10Andrew Bogott: labs puppetmaster: validate cert name before autosigning [puppet] - 10https://gerrit.wikimedia.org/r/368449 (https://phabricator.wikimedia.org/T171961) [15:02:24] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster: validate cert name before autosigning [puppet] - 10https://gerrit.wikimedia.org/r/368449 (https://phabricator.wikimedia.org/T171961) (owner: 10Andrew Bogott) [15:05:43] 10Operations, 10Traffic, 10Wikimedia-Blog, 10HTTPS: Change automatic shortlink in blog theme - https://phabricator.wikimedia.org/T165511#3486463 (10Volker_E) The code provided by WordPress VIP has been merged into the repo on 20 Jun and got deployed shortly after. One would need to look into the FB redirec... [15:07:01] (03PS1) 10Marostegui: db-codfw.php: Add version to db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368801 [15:10:53] (03CR) 10Marostegui: [C: 032] db-codfw.php: Add version to db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368801 (owner: 10Marostegui) [15:11:32] PROBLEM - Host ms-be2024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:12:36] (03PS1) 10Rush: openstack: pull rabbitmq monitoring from own module [puppet] - 10https://gerrit.wikimedia.org/r/368802 (https://phabricator.wikimedia.org/T171494) [15:13:03] (03Merged) 10jenkins-bot: db-codfw.php: Add version to db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368801 (owner: 10Marostegui) [15:13:07] (03PS2) 10Rush: openstack: pull rabbitmq monitoring from own module [puppet] - 10https://gerrit.wikimedia.org/r/368802 (https://phabricator.wikimedia.org/T171494) [15:13:19] (03CR) 10jenkins-bot: db-codfw.php: Add version to db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368801 (owner: 10Marostegui) [15:13:52] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:14:21] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Specify mariadb running version on db2072 (duration: 00m 43s) [15:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:02] PROBLEM - Hive Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 [15:15:28] this is me --^ [15:15:29] fixing [15:16:02] RECOVERY - Hive Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 [15:18:00] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#3486487 (10BBlack) [15:18:02] 10Operations, 10Pybal, 10Traffic: IPVS issues with UDP services, pybal depooling strategy - https://phabricator.wikimedia.org/T172103#3486486 (10BBlack) [15:18:04] (03PS1) 10Marostegui: mariadb: Add db2073 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/368804 (https://phabricator.wikimedia.org/T170662) [15:26:01] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#3486510 (10Fjalapeno) [15:26:55] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#3486510 (10Fjalapeno) [15:27:22] 10Operations, 10Android-app-feature-Compilations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Wikipedia-Android-App-Backlog: Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#3486510 (10Fjalapeno) [15:27:25] 10Operations, 10Pybal, 10Traffic: PyBal Feature: progressive depooling strategy for monitored failures - https://phabricator.wikimedia.org/T172124#3486529 (10BBlack) [15:27:50] 10Operations, 10Pybal, 10Traffic: Backport ipvsadm - https://phabricator.wikimedia.org/T171850#3486545 (10BBlack) [15:27:53] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#3486546 (10BBlack) [15:27:56] 10Operations, 10Pybal, 10Traffic: PyBal Feature: progressive depooling strategy for monitored failures - https://phabricator.wikimedia.org/T172124#3486529 (10BBlack) [15:28:27] 10Operations, 10Android-app-feature-Compilations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Wikipedia-Android-App-Backlog: Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#3486547 (10Fjalapeno) [15:30:56] 10Operations, 10Pybal, 10Traffic: PyBal Feature: progressive depooling strategy for monitored failures - https://phabricator.wikimedia.org/T172124#3486555 (10BBlack) It's also an interesting thought to consider progressively scaling the weight. For example, you could make the strategy configurable such that... [15:31:43] 10Operations, 10Analytics, 10Analytics-Cluster, 10User-Elukey: thorium - failed git clone of geowiki-data-private - https://phabricator.wikimedia.org/T171923#3486558 (10mforns) a:03Ottomata [15:32:13] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: thorium - failed git clone of geowiki-data-private - https://phabricator.wikimedia.org/T171923#3480324 (10mforns) [15:33:25] (03PS1) 10Ottomata: Set maximum yarn vcore allocation to 32 [puppet] - 10https://gerrit.wikimedia.org/r/368806 (https://phabricator.wikimedia.org/T172018) [15:33:40] !log Create index on u2041__ores_p.monthly_wp10_enwiki - T146718 [15:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:50] T146718: [Discuss] Hosting the monthly article quality dataset on labsDB - https://phabricator.wikimedia.org/T146718 [15:34:17] (03CR) 10Thcipriani: [C: 031] Cassandra: Switch metrics-collector to use Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/366459 (https://phabricator.wikimedia.org/T137371) (owner: 10Mobrovac) [15:34:48] (03CR) 10Thcipriani: [C: 031] "overall looks good, couple random nits inline" (032 comments) [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/366404 (https://phabricator.wikimedia.org/T137371) (owner: 10Mobrovac) [15:37:23] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7236/analytics1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/368806 (https://phabricator.wikimedia.org/T172018) (owner: 10Ottomata) [15:39:15] (03CR) 10Ottomata: "This host has not been 'decom'ed, just shut down. Decom to come in the future, so for now don't merge!" [puppet] - 10https://gerrit.wikimedia.org/r/368612 (https://phabricator.wikimedia.org/T152712) (owner: 10Elukey) [15:41:27] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3486631 (10mforns) [15:41:29] (03PS2) 10Rush: Quarry: Add package 'python-xlsxwriter' [puppet] - 10https://gerrit.wikimedia.org/r/368597 (https://phabricator.wikimedia.org/T76126) (owner: 10Zhuyifei1999) [15:42:57] (03CR) 10Rush: [C: 032] Quarry: Add package 'python-xlsxwriter' [puppet] - 10https://gerrit.wikimedia.org/r/368597 (https://phabricator.wikimedia.org/T76126) (owner: 10Zhuyifei1999) [15:46:03] 10Operations, 10Operations-Software-Development, 10Pybal, 10Traffic, 10Patch-For-Review: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893#3486672 (10ema) >>! In T134893#3486123, @BBlack wrote: > That it's happening often enough to re... [15:48:45] (03CR) 10Thcipriani: [C: 031] Cassandra: Switch logback-encoder to Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/366473 (https://phabricator.wikimedia.org/T116340) (owner: 10Mobrovac) [15:48:53] (03CR) 10Thcipriani: [C: 031] Add the Scap3 configuration [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/366466 (https://phabricator.wikimedia.org/T116340) (owner: 10Mobrovac) [15:52:57] 10Operations: Investigate check_nrpe -u option to reduce critical alerts - https://phabricator.wikimedia.org/T172131#3486701 (10herron) [15:54:18] 10Operations: Investigate check_nrpe -u option to reduce critical alerts - https://phabricator.wikimedia.org/T172131#3486714 (10herron) [15:54:38] (03CR) 10Ema: [C: 031] OCSP: Warn less, retry more [puppet] - 10https://gerrit.wikimedia.org/r/368779 (https://phabricator.wikimedia.org/T172116) (owner: 10BBlack) [15:54:44] (03PS2) 10DCausse: [cirrus] Tune ordering of crossproject search results on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368776 (https://phabricator.wikimedia.org/T171803) [16:04:02] 10Operations, 10Deployment-Systems, 10MediaWiki-JobRunner, 10Release-Engineering-Team (Next), 10Scap (Scap3-Adoption-Phase1): Figure out how to disable starting of jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3486758 (10thcipriani) I made a couple of patches that at... [16:04:42] RECOVERY - Host ms-be2024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.58 ms [16:10:51] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3486779 (10mforns) [16:15:37] !log depooling and shutting down elastic102[0123] for thermal paste - T168816 [16:15:44] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic10(20|21|22|23).eqiad.wmnet [16:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:55] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [16:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:22] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([elastic1034.eqiad.wmnet, elastic1025.eqiad.wmnet, elastic1042.eqiad.wmnet, elastic1018.eqiad.wmnet, elastic1017.eqiad.wmnet, elastic1026.eqiad.wmnet, elastic1019.eqiad.wmnet, elastic1039.eqiad.wmnet, elastic1046.eqiad.wmnet, elastic1043.eqiad.wmnet, elastic1050.eqiad.wmnet, elastic1032.eqiad.wmnet, elastic1027.eqiad.wmnet, [16:17:22] wmnet, elastic1040.eqiad.wmnet, elastic1038.eqiad.wmnet, elastic1033.eqiad.wmnet, elastic1035.eqiad.wmnet, elastic1031.eqiad.wmnet, elastic1051.eqiad.wmnet, elastic1041.eqiad.wmnet, elastic1047.eqiad.wmnet, elastic1024.eqiad.wmnet]) [16:18:22] PROBLEM - PyBal IPVS diff check on lvs1009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([elastic1034.eqiad.wmnet, elastic1025.eqiad.wmnet, elastic1042.eqiad.wmnet, elastic1044.eqiad.wmnet, elastic1029.eqiad.wmnet, elastic1039.eqiad.wmnet, elastic1046.eqiad.wmnet, elastic1043.eqiad.wmnet, elastic1037.eqiad.wmnet, elastic1032.eqiad.wmnet, elastic1027.eqiad.wmnet, elastic1036.eqiad.wmnet, elastic1049.eqiad.wmnet, [16:18:22] wmnet, elastic1038.eqiad.wmnet, elastic1035.eqiad.wmnet, elastic1031.eqiad.wmnet, elastic1051.eqiad.wmnet, elastic1041.eqiad.wmnet, elastic1045.eqiad.wmnet, elastic1047.eqiad.wmnet, elastic1024.eqiad.wmnet]) [16:18:27] PROBLEM - LVS HTTP IPv4 on search.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 523 bytes in 0.038 second response time [16:18:28] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1040.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1049.eqiad.wmnet because of too many down! [16:18:33] nice I guess the alert works correctly [16:18:37] PROBLEM - ElasticSearch health check for shards on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch http://10.2.2.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.2.2.30, port=9200): Read timed out. (read timeout=4) [16:18:37] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic1040.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1037.eqiad.wmnet because of too many down! [16:18:37] PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([elastic1039.eqiad.wmnet, elastic1027.eqiad.wmnet, elastic1036.eqiad.wmnet, elastic1031.eqiad.wmnet, elastic1049.eqiad.wmnet, elastic1042.eqiad.wmnet, elastic1017.eqiad.wmnet, elastic1046.eqiad.wmnet, elastic1048.eqiad.wmnet, elastic1041.eqiad.wmnet, elastic1043.eqiad.wmnet, elastic1051.eqiad.wmnet, elastic1045.eqiad.wmnet, [16:18:37] wmnet, elastic1038.eqiad.wmnet, elastic1033.eqiad.wmnet, elastic1035.eqiad.wmnet, elastic1037.eqiad.wmnet]) [16:18:37] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171183#3486810 (10Cmjohnson) A ticket has been created with HPE. Your case was successfully submitted. Please note your Case ID: 5321778513 for future reference. [16:19:21] <_joe_> ema: pybal diffs are real? [16:19:26] hello! the wikimania organizers need help hosting some data on a server. Thats all I know about what they need except that it also needs to be one quickly. Who is the best person for them to talk about this with? [16:19:29] <_joe_> that sounds so wrong [16:19:33] RECOVERY - PyBal IPVS diff check on lvs1009 is OK: OK: no difference between hosts in IPVS/PyBal [16:19:38] RECOVERY - ElasticSearch health check for shards on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 36, unassigned_shards: 1, number_of_pending_tasks: 439, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3154, task_max_waiting_in_queue_millis: 195932, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_numb [16:19:38] ctive_shards: 9478, initializing_shards: 0, number_of_data_nodes: 36, delayed_unassigned_shards: 0 [16:19:41] <_joe_> rfarrand: we have an outage going on, and the ops meeting [16:19:43] RECOVERY - LVS HTTP IPv4 on search.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.020 second response time [16:19:44] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [16:19:44] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [16:19:50] _joe_: I guess so, given the pages [16:19:53] <_joe_> rfarrand: I'd ask in ~ 1 hour [16:19:54] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [16:19:54] RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal [16:19:57] strange, all those hosts seems up according to icinga [16:20:34] _joe_ OK thank you [16:21:27] rfarrand: I guess joe was talking about the current alarms, not answering to you ;) [16:21:28] ok,. no idea what happened there, I did depool elastic102[0123], no idea why pybal seem screems about others...$ [16:22:04] <_joe_> gehel: no idea tbh [16:22:05] ema: is etcd working on those pybals? [16:22:13] <_joe_> it is, it recovered [16:22:17] were the elastic depooled properly? [16:22:20] <_joe_> we have to figure out what happened [16:22:31] <_joe_> volans: later is ok, we have the meeting now [16:22:37] yeah I'm wondering of a race condition between etcd and pybal [16:23:00] scratch that, we check ipvs vs pybal [16:23:19] <_joe_> so when things change, it is very possible there is a moment of discrepancy [16:23:22] <_joe_> but not minutes [16:23:46] agree [16:24:15] yeah if the check is hitting pybal's http output vs ipvsadm, you'd think the lag there would be small [16:24:45] godog: later could you check prometheus there? given that we get ipvs data from prometheus... [16:24:57] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2024 not powering on - https://phabricator.wikimedia.org/T171275#3486819 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi main board replacement complete. Tested power reset, power off and power on from CLI works with no problem. You can take over from now... [16:25:13] but then, we also (separately from the diff alert) got health check alerts (too many down) + general service down [16:25:17] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T171926#3486839 (10Cmjohnson) A ticket has been opened with HPE Your case was successfully submitted. Please note your Case ID: 5321778730 for future reference. [16:25:23] so this probably wasn't *only* a pybal issue? [16:25:24] volans: for sure, I'll take a look after the meeting [16:26:00] "IPv4 on search.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503" <- would expect some problem other than pybal lag here [16:26:13] indeed [16:26:55] <_joe_> yes, but I want to understand why we had that alert [16:27:02] <_joe_> I guess we're checking the wrong thing? [16:27:17] <_joe_> or we were in a situation where we hit the depool threshold [16:27:30] <_joe_> and we don't expose that in the instrumentation if not from alerts [16:27:31] (03PS13) 10Paladox: Gerrit: Add wmf branding to PolyGerrit [puppet] - 10https://gerrit.wikimedia.org/r/368547 [16:27:44] (03PS3) 10Paladox: Gerrit: Set auth.userNameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/368196 [16:27:45] _joe_: the last one seems the more plausible [16:28:09] if from pybal those are considered depooled but then pybal doesn't depoole them on IPVS [16:28:21] we clearly get false alarm discrepancy [16:28:34] * gehel will look in the elasticsearch logs... but not expecting much... [16:30:51] !log mistaken restart of elastic1030 as part of T168816 [16:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:01] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [16:31:03] ^ might explain some of the noise, but not really all of it [16:31:21] gehel: what is the size of the cluster? [16:31:34] 36 nodes [16:32:05] and you depooled 4 + this one right? [16:32:56] Oh, 30 might have been the master at that time, but even that should not cause the cluster to answer 503... but it might ... [16:36:04] Ok, the pybal check is not an actual search query, but a check of "/", which might ensure master has been re-elected before answering, will check [16:48:37] (03PS2) 10Jforrester: Enable OOjs UI EditPage on all wikis except Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366867 [16:54:31] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, 10User-Joe: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3486887 (10Halfak) a:03Halfak [16:57:12] 10Operations, 10Puppet, 10Traffic, 10Mobile, and 2 others: URLs with title query string parameter and additional query string parameters do not redirect to mobile site - https://phabricator.wikimedia.org/T154227#2904582 (10BBlack) It seems reasonable to relax the regex in question a bit (to allow additiona... [17:00:04] gehel: Respected human, time to deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170731T1700). Please do the needful. [17:01:15] (03PS2) 10ArielGlenn: write dump output files to temporary location, move in place when done [dumps] - 10https://gerrit.wikimedia.org/r/368744 (https://phabricator.wikimedia.org/T169849) [17:03:14] mutante: ping! :D [17:03:45] (03CR) 10Dzahn: [C: 04-1] "in jetty.pp: /var/lib/gerrit2/review_site/plugins/wikimedia_polygerrit_style.html" [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [17:03:57] Amir1: omg, yes, sorry, HERE [17:04:06] joining [17:06:25] (03CR) 10Paladox: [C: 031] "> in jetty.pp: /var/lib/gerrit2/review_site/plugins/wikimedia_polygerrit_style.html" [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [17:10:36] !log gehel@tin Started deploy [wdqs/wdqs@bdf3494]: (no justification provided) [17:10:45] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [17:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:14] (03PS1) 10BBlack: VCL mobile redirect: allow other params alongside title= [puppet] - 10https://gerrit.wikimedia.org/r/368814 (https://phabricator.wikimedia.org/T154227) [17:12:05] !log scb1001 restarted pdfrender service - T159922 [17:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:15] T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922 [17:12:24] !log gehel@tin Finished deploy [wdqs/wdqs@bdf3494]: (no justification provided) (duration: 01m 48s) [17:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:36] SMalyshev: deployment of wdqs completed, tests are green [17:14:07] 10Operations, 10Puppet, 10Traffic, 10Mobile, and 3 others: URLs with title query string parameter and additional query string parameters do not redirect to mobile site - https://phabricator.wikimedia.org/T154227#3486936 (10Jdlrobson) Something like this maybe? ``` @@ -23,8 +23,13 @@ sub mobile_redirect {... [17:16:03] (03CR) 10Jdlrobson: [C: 04-1] VCL mobile redirect: allow other params alongside title= (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/368814 (https://phabricator.wikimedia.org/T154227) (owner: 10BBlack) [17:22:02] gehel: thanks! [17:22:26] SMalyshev: At you service! I'll be here all week, try the fish on Friday! [17:22:38] hehe :) [17:25:24] RECOVERY - swift-object-auditor on ms-be2024 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:25:25] RECOVERY - configured eth on ms-be2024 is OK: OK - interfaces up [17:25:25] RECOVERY - dhclient process on ms-be2024 is OK: PROCS OK: 0 processes with command name dhclient [17:25:25] RECOVERY - swift-account-server on ms-be2024 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:25:25] RECOVERY - salt-minion processes on ms-be2024 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:25:25] RECOVERY - MD RAID on ms-be2024 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:25:28] !log un-banning and repooling elastic102[012] - T168816 [17:25:34] RECOVERY - swift-object-updater on ms-be2024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:25:34] RECOVERY - swift-account-auditor on ms-be2024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:25:34] RECOVERY - Host ms-be2024 is UP: PING OK - Packet loss = 0%, RTA = 36.47 ms [17:25:34] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(20|21|22).eqiad.wmnet [17:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:39] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [17:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:54] RECOVERY - very high load average likely xfs on ms-be2024 is OK: OK - load average: 29.10, 7.39, 2.48 [17:25:54] RECOVERY - Check size of conntrack table on ms-be2024 is OK: OK: nf_conntrack is 6 % full [17:25:54] RECOVERY - swift-object-server on ms-be2024 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:26:04] RECOVERY - SSH on ms-be2024 is OK: SSH OK - OpenSSH_7.4p1 Debian-10 (protocol 2.0) [17:26:04] RECOVERY - Check systemd state on ms-be2024 is OK: OK - running: The system is fully operational [17:26:04] RECOVERY - swift-container-replicator on ms-be2024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:26:04] RECOVERY - swift-container-server on ms-be2024 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:26:05] RECOVERY - swift-account-reaper on ms-be2024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:26:05] RECOVERY - Disk space on ms-be2024 is OK: DISK OK [17:26:05] RECOVERY - DPKG on ms-be2024 is OK: All packages OK [17:26:05] RECOVERY - swift-account-replicator on ms-be2024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:26:05] RECOVERY - swift-container-auditor on ms-be2024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:26:06] RECOVERY - swift-container-updater on ms-be2024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:26:14] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2024 is OK: OK ferm input default policy is set [17:26:14] RECOVERY - swift-object-replicator on ms-be2024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:26:14] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. [17:26:15] that's me, machine is up again [17:28:14] RECOVERY - puppet last run on ms-be2024 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [17:36:01] (03CR) 10Dzahn: [C: 031] "oh ok.. well if you tested it and that's how i derives the path then ok, lgtm, afaict" [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [17:36:14] RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be2024 is OK: OK: synced at Mon 2017-07-31 17:36:09 UTC. [17:44:41] (03CR) 10Paladox: [C: 031] "Note that this will not show until gerrit 2.15 but it's safe with gerrit 2.14 (won't break anything in that release) :)" [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [17:47:44] (03PS1) 10Smalyshev: Enable archive search via Elastic everywhere except Wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368819 (https://phabricator.wikimedia.org/T163235) [17:48:37] (03CR) 10Paladox: [C: 031] "Note you can see this live at http://gerrit-new.wmflabs.org/r/?polygerrit=1" [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [17:48:58] (03PS2) 10BBlack: VCL mobile redirect: allow other params alongside title= [puppet] - 10https://gerrit.wikimedia.org/r/368814 (https://phabricator.wikimedia.org/T154227) [17:50:07] (03PS1) 10Elukey: hive: remove etc default config in favor of hive-env.sh [puppet/cdh] - 10https://gerrit.wikimedia.org/r/368820 (https://phabricator.wikimedia.org/T172107) [17:50:09] (03CR) 10Jdlrobson: [C: 031] "regex LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/368814 (https://phabricator.wikimedia.org/T154227) (owner: 10BBlack) [17:52:40] !log un-banning and repooling elastic1023 - T168816 [17:52:45] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic10(23).eqiad.wmnet [17:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:51] T168816: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816 [17:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:25] (03CR) 10Ottomata: hive: remove etc default config in favor of hive-env.sh (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/368820 (https://phabricator.wikimedia.org/T172107) (owner: 10Elukey) [17:56:34] (03PS2) 10Elukey: hive: remove etc default config in favor of hive-env.sh [puppet/cdh] - 10https://gerrit.wikimedia.org/r/368820 (https://phabricator.wikimedia.org/T172107) [17:57:10] (03CR) 10Ottomata: [C: 031] hive: remove etc default config in favor of hive-env.sh [puppet/cdh] - 10https://gerrit.wikimedia.org/r/368820 (https://phabricator.wikimedia.org/T172107) (owner: 10Elukey) [17:57:20] (03CR) 10Elukey: hive: remove etc default config in favor of hive-env.sh (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/368820 (https://phabricator.wikimedia.org/T172107) (owner: 10Elukey) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170731T1800). [18:00:04] TabbyCat, bd808, and SMalyshev: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:39] here [18:00:51] o/ [18:02:06] I can SWAT [18:04:22] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3487120 (10Andrew) [18:04:26] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): instance root passwords vs. multiple puppetmasters - https://phabricator.wikimedia.org/T171959#3487118 (10Andrew) 05Open>03Resolved Due to various concerns we're going to just disable these passwords for now. [18:04:54] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368819 (https://phabricator.wikimedia.org/T163235) (owner: 10Smalyshev) [18:04:58] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3487123 (10mmodell) @dzahn: anything I can do to help get this one moving? I tried to log in to phab1001 so that I could verify that puppet has... [18:06:27] (03Merged) 10jenkins-bot: Enable archive search via Elastic everywhere except Wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368819 (https://phabricator.wikimedia.org/T163235) (owner: 10Smalyshev) [18:06:36] (03CR) 10jenkins-bot: Enable archive search via Elastic everywhere except Wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368819 (https://phabricator.wikimedia.org/T163235) (owner: 10Smalyshev) [18:07:10] thcipriani: thanks! [18:07:12] bd808: is yours a wikitech/silver-only thing? Can it be tested on mwdebug1002, or just deploy? [18:09:46] SMalyshev: your change is live on mwdebug1002, check please [18:10:53] thcipriani: seems to be fine [18:11:01] SMalyshev: ok, going live everywhere [18:12:56] RainbowSprinkles: are you pruning branches on tin at the moment? [18:13:08] Oh snap, was just pruning wmf.10 [18:13:10] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.10 [keeping static files] (duration: 01m 18s) [18:13:14] Sorry, done [18:13:18] (didn't look at the time) [18:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:21] np :) [18:13:35] I was more worried scap left a lock file :P [18:14:17] (03PS14) 10Paladox: Gerrit: Add wmf branding to PolyGerrit [puppet] - 10https://gerrit.wikimedia.org/r/368547 [18:14:51] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:368819|Enable archive search via Elastic everywhere except Wikidata]] T163235 (duration: 00m 42s) [18:14:56] (03CR) 10Dzahn: [C: 031] "http://gerrit-new.wmflabs.org/r/q/status:open looks better to me now with "WIKIMEDIA CODE REVIEW" as text and just the logo as image, yea" [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [18:14:59] ^ SMalyshev should be live everwhere [18:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:01] T163235: Archive search deployment plan - https://phabricator.wikimedia.org/T163235 [18:15:49] (03PS2) 10Smalyshev: Cleanup old BC config for JsonUnitStorage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366488 (https://phabricator.wikimedia.org/T171107) [18:16:22] (03CR) 10Chad: [C: 04-1] "I'm not sure I want to put this here. 2.15 is a long ways off for us (we haven't even moved to 2.14 yet). And with moving to scap deploys," [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [18:16:43] bd808: ping me when you're around for the SpecialNovaRole patch [18:16:52] (03PS3) 10Smalyshev: Cleanup old BC config for JsonUnitStorage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366488 (https://phabricator.wikimedia.org/T171107) [18:17:24] thcipriani: ready when you are. there's really no good test for it other than make it live [18:17:35] okie doke, going live [18:17:50] sorry I missed your earlier ping :/ [18:18:36] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3487147 (10Dzahn) @mmodell It doesn't have the puppet role for phab on it because we had to remove it. The role just isn't ready for being used... [18:19:12] no worries [18:19:38] !log thcipriani@tin Synchronized php-1.30.0-wmf.11/extensions/OpenStackManager/special/SpecialNovaRole.php: SWAT: [[gerrit:368596|Do not clobber $out in local scope]] T172077 (duration: 00m 42s) [18:19:44] ^ bd808 live now [18:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:47] T172077: removing a user from projectadmin on wikitech produces a blank page - https://phabricator.wikimedia.org/T172077 [18:20:26] thcipriani: thanks. I'll do a quick test. Probably can't be worse than the 500 response that it was doing before :) [18:20:26] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3487154 (10mmodell) @dzahn: Ok, I can fix that. Thanks! I think there is a lot of room for improvement in the way we handle IP addresses. [18:21:44] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:22:02] (03PS3) 10Rush: openstack: pull rabbitmq monitoring from own module [puppet] - 10https://gerrit.wikimedia.org/r/368802 (https://phabricator.wikimedia.org/T171494) [18:23:44] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [18:25:05] (03CR) 10Paladox: "At the time i created this i thought it would work for 2.14 but digging and testing revealed this would only work on 2.15." [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [18:27:17] SMalyshev: hrm, do the CirrusSearch alerts ^ have anything to do with enabling archive search?/should I be worried about this? [18:28:20] thcipriani: Cirrus slow down is probably related to cluster still recovering after shutting down 4 nodes for maintenance [18:28:36] gehel: ah, ok, thanks :) [18:28:38] thcipriani: response times look like they are coming down again [18:28:38] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3487196 (10mmodell) So what should we do instead of having host-specific IPs in `hieradata/role/[datacenter]/phabricator_server.yaml`? Should... [18:29:50] (03PS1) 10Ayounsi: Add codfw frack to Smokeping, Icinga and Rancid [puppet] - 10https://gerrit.wikimedia.org/r/368824 (https://phabricator.wikimedia.org/T171970) [18:30:06] thcipriani: the change I didn should not increase latency, it only affects GUI and only for admins... [18:30:18] (03CR) 10Rush: [C: 032] openstack: pull rabbitmq monitoring from own module [puppet] - 10https://gerrit.wikimedia.org/r/368802 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:30:41] so unless all admins suddenly decided to do millions of searches, should not be noticeable :) [18:30:48] :) [18:31:22] thanks, sorry for the noise. deploy + alert = me freaks out ;) [18:31:23] thcipriani: the OpenStackManager patch seems to work great. thanks for the deploy [18:31:49] yw :) [18:33:35] thcipriani: no i don't think so. It's an ongoing issue we are battling. Some thermal pasting of servers that are overheating and a reduction in thread pool sizes is happening this week to try and push back on it [18:34:19] * thcipriani *nods* [18:35:34] PROBLEM - Host cp4014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:35:34] PROBLEM - Host cp4017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:35:35] PROBLEM - Host cp4013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:35:35] PROBLEM - Host cp4016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:35:35] PROBLEM - Host cp4018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:35:35] PROBLEM - Host cp4015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:36:04] PROBLEM - Host asw-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [18:37:14] PROBLEM - Host lvs4004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:37:14] PROBLEM - Host cp4021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:37:15] PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:37:15] PROBLEM - Host cp4024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:37:21] !log we're migrating mr1-ulsfo, disregard mgmt icinga alerts [18:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:37] i was just too lazy to look up each one since icinga doesnt do wildcard in host matching strings via web gui [18:37:39] ;D [18:37:50] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3487210 (10Dereckson) Flow Topic: translation will be live at 1.30.0-wmf.12. [18:38:04] PROBLEM - Host lvs4003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:38:04] PROBLEM - Host lvs4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:38:04] PROBLEM - Host lvs4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:38:24] PROBLEM - Host cp4007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:38:24] PROBLEM - Host cp4008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:38:25] PROBLEM - Host cp4010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:38:25] PROBLEM - Host cp4009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:39:54] PROBLEM - Host bast4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:40:34] RECOVERY - Host asw-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 79.11 ms [18:40:44] RECOVERY - Host cp4014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.11 ms [18:40:44] RECOVERY - Host cp4013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.62 ms [18:40:44] RECOVERY - Host cp4016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.92 ms [18:40:44] RECOVERY - Host cp4017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.02 ms [18:40:44] RECOVERY - Host cp4015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 90.86 ms [18:40:45] RECOVERY - Host cp4018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 91.66 ms [18:42:01] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3487219 (10Paladox) @mmodell maybe a host specific hiera level [18:42:13] 10Operations, 10Phabricator, 10Release-Engineering-Team (Backlog): reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#3487220 (10mmodell) The latest change of plans is to set up `phab1001.eqiad.wmnet` before `phab2001.codfw.wmnet` as we can probably switch dir... [18:42:25] RECOVERY - Host cp4021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.18 ms [18:42:25] RECOVERY - Host cp4023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.37 ms [18:42:25] RECOVERY - Host cp4024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.12 ms [18:43:14] RECOVERY - Host lvs4003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.41 ms [18:43:14] RECOVERY - Host lvs4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.62 ms [18:43:14] RECOVERY - Host lvs4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.74 ms [18:43:15] RECOVERY - Host lvs4004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.08 ms [18:43:34] RECOVERY - Host cp4008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.43 ms [18:43:34] RECOVERY - Host cp4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.14 ms [18:43:35] RECOVERY - Host cp4010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.12 ms [18:43:35] RECOVERY - Host cp4009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.11 ms [18:44:05] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3487228 (10mmodell) [18:44:09] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3487227 (10mmodell) [18:45:04] RECOVERY - Host bast4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.15 ms [18:46:41] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3487236 (10mmodell) [18:50:25] gehel & ottomata: ping :) https://gerrit.wikimedia.org/r/#/c/367930/ [18:51:01] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3487265 (10Dzahn) role/[datacenter]/ seems actually correct and better than host names. [18:57:43] (03PS7) 10Ottomata: statistics::discovery: Reconfigure for Golden data retrieval [puppet] - 10https://gerrit.wikimedia.org/r/367930 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [18:57:52] (03CR) 10Ottomata: [V: 032 C: 032] statistics::discovery: Reconfigure for Golden data retrieval [puppet] - 10https://gerrit.wikimedia.org/r/367930 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [18:59:07] ottomata: thank you! :) [18:59:40] (03PS1) 10Ottomata: Include statistics::discovery just in private profile [puppet] - 10https://gerrit.wikimedia.org/r/368829 (https://phabricator.wikimedia.org/T170494) [18:59:42] ya bearloga just noticed this ^ merging that too and then running p [18:59:53] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3487325 (10mmodell) @dzahn: Do we have an IP assigned for `git-ssh` on phab1001? [19:00:11] (03PS2) 10Ottomata: Include statistics::discovery just in private profile [puppet] - 10https://gerrit.wikimedia.org/r/368829 (https://phabricator.wikimedia.org/T170494) [19:00:36] (03CR) 10Ottomata: [V: 032 C: 032] Include statistics::discovery just in private profile [puppet] - 10https://gerrit.wikimedia.org/r/368829 (https://phabricator.wikimedia.org/T170494) (owner: 10Ottomata) [19:01:48] ottomata: ah, okie dokie. noted for future :) thanks! [19:02:54] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:02:56] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:04:38] (03Restored) 10Mforns: Add MediaWikiInstallPingback to EL purging white-list [puppet] - 10https://gerrit.wikimedia.org/r/366049 (https://phabricator.wikimedia.org/T170986) (owner: 10Mforns) [19:05:04] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:05:05] (03PS2) 10Mforns: Add MediaWikiInstallPingback to EL purging white-list [puppet] - 10https://gerrit.wikimedia.org/r/366049 (https://phabricator.wikimedia.org/T170986) [19:08:15] ha bearloga ::compute applies to common stat boxes [19:08:21] maybe i shoudl call that statistics::common :p [19:09:22] (03PS1) 10Ottomata: Remove re-declaration of 'wikidev' group in discovery class [puppet] - 10https://gerrit.wikimedia.org/r/368831 (https://phabricator.wikimedia.org/T170494) [19:09:36] (03CR) 10Ottomata: [V: 032 C: 032] Remove re-declaration of 'wikidev' group in discovery class [puppet] - 10https://gerrit.wikimedia.org/r/368831 (https://phabricator.wikimedia.org/T170494) (owner: 10Ottomata) [19:11:12] (03PS3) 10Mforns: Add MediaWikiInstallPingback to EL purging white-list [puppet] - 10https://gerrit.wikimedia.org/r/366049 (https://phabricator.wikimedia.org/T170986) [19:11:30] (03PS4) 10Mforns: Add MediaWikiPingback to EL purging white-list [puppet] - 10https://gerrit.wikimedia.org/r/366049 (https://phabricator.wikimedia.org/T170986) [19:11:46] bearloga: Notice: /Stage[main]/Statistics::Discovery/Cron[wikimedia-discovery-golden]/ensure: created [19:11:56] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:11:56] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:12:16] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:17:50] (03PS5) 10Mforns: Add MediaWikiPingback to EL purging white-list [puppet] - 10https://gerrit.wikimedia.org/r/366049 (https://phabricator.wikimedia.org/T170986) [19:20:13] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3487412 (10Dzahn) @mmodell Here's the thing. There is the git-ssh IP for eqiad 208.80.154.250 and git-ssh for codfw 208.80.153.250. This IP... [19:21:52] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3487427 (10mmodell) Yeah I think scheduled downtime to switch the IP is reasonable. I'll make a patch and we can do it this week if you're up f... [19:40:29] (03Draft1) 10Paladox: phabricator: rsync /srv/repos on iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368841 [19:40:32] (03PS2) 10Paladox: phabricator: rsync /srv/repos on iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368841 [19:40:49] twentyafterfour and mutante did i do ^^ that correctly? [19:41:51] paladox: looks ok but maybe we want to use variables for the host names [19:41:56] ok [19:42:06] i wonder will this not cause failures on labs? [19:43:35] paladox: it's correct that it's in the profile.. but yea.. variables [19:43:41] ok [19:43:50] paladox: no, because we put the values in Hiera and NOT in puppet code [19:43:57] ah [19:43:58] that is how we also avoid labs failure [19:43:58] ok [19:44:09] because then you can just set it to the right value in Hiera: wiki [19:44:20] but then an empty hiera value will cause it to complain about missing variable [19:44:24] we can set it to null [19:44:28] set it in hieradata/common.yaml [19:44:30] ok [19:44:31] yep [19:44:35] next to "install_server" "netmon_server" etc [19:44:38] you will see it [19:44:57] fyi, I'm rebooting pfw3-codfw for upgrade, it should not generate any alarms, (but might if I forgot to downtime something) [19:45:10] ok [19:47:07] (03PS3) 10Paladox: phabricator: rsync /srv/repos on iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368841 [19:47:08] mutante done [19:47:09] brb [19:48:06] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:48:23] 10Operations, 10Android-app-feature-Compilations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Wikipedia-Android-App-Backlog: Determine URL paths for Zim files - https://phabricator.wikimedia.org/T172148#3487493 (10Fjalapeno) [19:48:27] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 104, down: 1, dormant: 0, excluded: 3, unused: 0 [19:48:52] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine URL paths for Zim files - https://phabricator.wikimedia.org/T172148#3487493 (10Fjalapeno) [19:48:57] paladox: nitpick, just drop the "_maint" part. was it "main"? either way, just "phabricator_server" is enough [19:51:06] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [19:56:02] 10Operations, 10Epic, 10Goal, 10Services (doing): End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3487530 (10Eevans) 05stalled>03Open [19:59:36] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 105, down: 0, dormant: 0, excluded: 3, unused: 0 [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170731T2000). Please do the needful. [20:00:27] Nothing for ORES today [20:03:15] no mobileapps deploy today [20:03:45] parsoid deploy today, whee! [20:06:04] when was the last time the ssh key on tin was changed? SSH is warning me about host id, but I haven't done a deploy in a while. [20:06:26] The fingerprint for the ECDSA key sent by the remote host (tin.eqiad.wmnet) is SHA256:jZhHHpPiAspcYnKiJIo+h380CoMBpaBSS5Bw03mMCTs. [20:09:14] cscott: that's correct https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/tin.eqiad.wmnet [20:09:28] I just discovered that (very helpful) URL, thanks. [20:12:00] (03CR) 10Mobrovac: [C: 031] JobQueueEventBus: Enable job events in group0 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368258 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [20:22:08] !log cscott@tin Started deploy [parsoid/deploy@c1cba48]: Updating Parsoid to 08114f35 [20:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:11] (03CR) 10Nuria: [C: 031] Add MediaWikiPingback to EL purging white-list [puppet] - 10https://gerrit.wikimedia.org/r/366049 (https://phabricator.wikimedia.org/T170986) (owner: 10Mforns) [20:32:58] !log cscott@tin Finished deploy [parsoid/deploy@c1cba48]: Updating Parsoid to 08114f35 (duration: 10m 50s) [20:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:25] !log Updated Parsoid to version 08114f35 (T43716, T154718, T166413) [20:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:38] T43716: Support language variant conversion in Parsoid - https://phabricator.wikimedia.org/T43716 [20:33:38] T166413: Link with text beginning with a bracket not parsed correctly - https://phabricator.wikimedia.org/T166413 [20:33:38] T154718: On serialising, if a parameter alias is used Parsoid should use its main item's paramOrder - https://phabricator.wikimedia.org/T154718 [20:41:15] mutante back [20:43:38] (03PS4) 10Paladox: phabricator: rsync /srv/repos on iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368841 [20:46:01] (03PS1) 10Cmjohnson: Removing old mgmt dns entries for decom'd hosts, added 1 mgmt entry for cp1008 [dns] - 10https://gerrit.wikimedia.org/r/368896 [20:46:56] PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused [20:47:46] ^ ugh, ok.. looking, we know that one [20:47:55] paladox: yep:) [20:47:57] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3487579 (10MarcoAurelio) > This solution to disable Flow isn't acceptable. I totally disagree here. If the translation has been provided th... [20:48:11] :) [20:48:59] !log restarting pdfrender service on sc1001 after icinga alert (T159922) [20:49:08] scb1002 what am i doing [20:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:11] T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922 [20:49:14] fixing on wiki [20:49:47] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Reading-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3487583 (10Dzahn) ^ that was scb1002 - not sc1001 - typo [20:49:56] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.004 second response time [20:50:55] (03PS7) 10Thcipriani: CI/integration: Create role for docker CI agent [puppet] - 10https://gerrit.wikimedia.org/r/365416 (https://phabricator.wikimedia.org/T150502) [20:52:14] (03CR) 10Paladox: "This can be reverted on pabricator's main date which is thursday 1am utc +0" [puppet] - 10https://gerrit.wikimedia.org/r/368841 (owner: 10Paladox) [20:57:21] 10Operations, 10Epic, 10Goal, 10Services (doing): Consider a lower virutal node count - https://phabricator.wikimedia.org/T172149#3487606 (10Eevans) [20:59:59] (03PS11) 10Urbanecm: Initial configuration for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) [21:00:04] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170731T2100). [21:01:43] (03PS12) 10Urbanecm: Initial configuration for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) [21:06:46] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#3487635 (10mmodell) a:03mmodell [21:07:57] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#3487640 (10mmodell) [21:08:25] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#2839436 (10mmodell) [21:09:26] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:12:27] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [21:12:27] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:19:50] ema, bblack, is that cp1099 mailbox lag a known issue? [21:22:12] 10Operations, 10Ops-Access-Requests, 10User-Addshore: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3481349 (10ayounsi) @GoranSMilovanovic What do you exactly need access to? And could you have your manager to sign off on this request? Thank you [21:22:53] 10Operations, 10Ops-Access-Requests, 10User-Addshore: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3487757 (10ayounsi) a:03ayounsi [21:23:13] 10Operations, 10Ops-Access-Requests, 10User-Addshore: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3481349 (10ayounsi) p:05Triage>03Normal [21:24:36] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [21:26:36] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:35:52] (03PS1) 10Rush: openstack: move novaobserver to a profile [puppet] - 10https://gerrit.wikimedia.org/r/368938 (https://phabricator.wikimedia.org/T171494) [21:36:49] (03CR) 10jerkins-bot: [V: 04-1] openstack: move novaobserver to a profile [puppet] - 10https://gerrit.wikimedia.org/r/368938 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [21:37:17] (03PS1) 10MarcoAurelio: Allow bureaucrats on WMF wikis to grant and remove 'confirmed' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368939 (https://phabricator.wikimedia.org/T101983) [21:37:36] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [21:38:24] (03PS2) 10Rush: openstack: move novaobserver to a profile [puppet] - 10https://gerrit.wikimedia.org/r/368938 (https://phabricator.wikimedia.org/T171494) [21:45:34] (03CR) 10Urbanecm: [C: 031] "Regarding codebase, LGTM. Don't know if it consensus is okay right now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368939 (https://phabricator.wikimedia.org/T101983) (owner: 10MarcoAurelio) [21:50:08] (03CR) 10MarcoAurelio: "> Regarding codebase, LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368939 (https://phabricator.wikimedia.org/T101983) (owner: 10MarcoAurelio) [21:52:23] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#3487861 (10Tbayer) [21:55:40] (03CR) 10MarcoAurelio: [C: 031] Initial configuration for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368168 (https://phabricator.wikimedia.org/T155038) (owner: 10Urbanecm) [21:57:37] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:58:04] (03PS3) 10Rush: openstack: move novaobserver to a profile [puppet] - 10https://gerrit.wikimedia.org/r/368938 (https://phabricator.wikimedia.org/T171494) [21:58:19] (03PS5) 10Urbanecm: Initial configuration for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368168 (https://phabricator.wikimedia.org/T155038) [21:59:00] 10Operations, 10Ops-Access-Requests, 10User-Addshore: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3487863 (10GoranSMilovanovic) @Addshore What exactly do I need access to - could you please provide an answer to this question since I still haven't seen th... [22:29:36] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0 [22:29:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [22:32:48] !log mobrovac@tin Started deploy [citoid/deploy@7ad598d]: Do not wait for PubMed requests to complete - T162886 [22:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:00] T162886: Parallelise pubmed requests to get IDs earlier in the request chain and skip it when the DOIs are scraped from the page (rarer occurrence.) - https://phabricator.wikimedia.org/T162886 [22:38:40] !log mobrovac@tin Finished deploy [citoid/deploy@7ad598d]: Do not wait for PubMed requests to complete - T162886 (duration: 05m 52s) [22:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:50] T162886: Parallelise pubmed requests to get IDs earlier in the request chain and skip it when the DOIs are scraped from the page (rarer occurrence.) - https://phabricator.wikimedia.org/T162886 [22:45:57] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3487965 (10mobrovac) >>! In T133178#3482880, @GWicke wrote: > This sounds reasonable to me. Any objections against going with www.wikimedia.or... [22:46:34] (03PS1) 10Dzahn: phabricator/admins: give phab admins access to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368947 (https://phabricator.wikimedia.org/T163938) [22:46:48] (03CR) 10jerkins-bot: [V: 04-1] phabricator/admins: give phab admins access to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368947 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [22:47:17] (03PS2) 10Dzahn: phabricator/admins: give phab admins access to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368947 (https://phabricator.wikimedia.org/T163938) [22:48:42] (03CR) 10Paladox: [C: 031] phabricator/admins: give phab admins access to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368947 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [22:48:49] (03CR) 10Dzahn: "@20after4 also see the other 2 things we are already setting here based on hostname. We want to remember removing/flipping that when we ma" [puppet] - 10https://gerrit.wikimedia.org/r/368947 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [22:49:15] jouncebot: next [22:49:16] In 0 hour(s) and 10 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170731T2300) [22:51:04] (03CR) 10Dzahn: [C: 032] phabricator/admins: give phab admins access to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368947 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [22:52:47] paladox: we need the reverse of the review bot , heh [22:52:56] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [22:52:56] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 [22:52:57] lol [22:53:20] paladox: i mean.. it detecs who to add as reviewer based on filenames and pathes, right [22:53:40] that is define by some magic page on mediawiki [22:53:40] it could do the same thing but influence which channel wikibugs talks to [22:53:53] like "if phab module is touched, then tell -releng" [22:54:00] mutante: https://www.mediawiki.org/wiki/Git/Reviewers [22:54:05] yea, i know that page [22:54:21] there should be another one that maps files to channels , heh [22:54:39] puppet module->irc channel [22:54:45] oh yep [22:57:33] (03PS5) 10Paladox: phabricator: rsync /srv/repos on iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368841 (https://phabricator.wikimedia.org/T163938) [22:57:40] (03PS6) 10Paladox: phabricator: rsync /srv/repos on iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368841 (https://phabricator.wikimedia.org/T163938) [22:58:43] (03CR) 10Rush: [C: 032] openstack: move novaobserver to a profile [puppet] - 10https://gerrit.wikimedia.org/r/368938 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [22:58:51] (03PS4) 10Rush: openstack: move novaobserver to a profile [puppet] - 10https://gerrit.wikimedia.org/r/368938 (https://phabricator.wikimedia.org/T171494) [22:59:46] (03CR) 10Rush: openstack: move novaobserver to a profile [puppet] - 10https://gerrit.wikimedia.org/r/368938 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [23:00:06] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170731T2300). Please do the needful. [23:00:06] Smalyshev and TabbyCat: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:12] here [23:00:16] o/ [23:00:19] (03CR) 10Rush: [C: 032] openstack: move novaobserver to a profile [puppet] - 10https://gerrit.wikimedia.org/r/368938 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [23:00:54] (03PS7) 10Paladox: phabricator: rsync /srv/repos on iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368841 (https://phabricator.wikimedia.org/T163938) [23:02:56] (03PS8) 10Dzahn: phabricator: rsync /srv/repos on iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368841 (https://phabricator.wikimedia.org/T163938) (owner: 10Paladox) [23:03:11] I can SWAT [23:04:30] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366488 (https://phabricator.wikimedia.org/T171107) (owner: 10Smalyshev) [23:05:48] <3 thcipriani [23:05:57] (03Merged) 10jenkins-bot: Cleanup old BC config for JsonUnitStorage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366488 (https://phabricator.wikimedia.org/T171107) (owner: 10Smalyshev) [23:06:27] (03CR) 10jenkins-bot: Cleanup old BC config for JsonUnitStorage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366488 (https://phabricator.wikimedia.org/T171107) (owner: 10Smalyshev) [23:07:16] (03PS9) 10Paladox: phabricator: rsync /srv/repos on iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368841 (https://phabricator.wikimedia.org/T163938) [23:08:31] SMalyshev: looks like this should be a noop, live on mwdebug1002 if there's anything to test [23:08:46] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:08:51] thcipriani: correct, this should be a noop, I'll check if wikidata doesn't malfuncion [23:09:46] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [23:09:59] (03PS5) 10Thcipriani: Make ptwikimedia a fishbowl wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367892 (https://phabricator.wikimedia.org/T171501) (owner: 10MarcoAurelio) [23:10:04] thcipriani: yep, everything looks normal [23:10:06] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367892 (https://phabricator.wikimedia.org/T171501) (owner: 10MarcoAurelio) [23:10:13] SMalyshev: ok, going live [23:10:22] * TabbyCat preparese to check ptwikimedia [23:11:10] !log thcipriani@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:366488|Cleanup old BC config for JsonUnitStorage]] T171107 (duration: 00m 42s) [23:11:15] ^ SMalyshev live everywhere [23:11:20] thcipriani: thanks! [23:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:21] T171107: Class undefined: \Wikibase\Lib\JsonUnitStorage - https://phabricator.wikimedia.org/T171107 [23:11:28] thanks for the cleanup :) [23:11:39] (03Merged) 10jenkins-bot: Make ptwikimedia a fishbowl wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367892 (https://phabricator.wikimedia.org/T171501) (owner: 10MarcoAurelio) [23:11:52] (03CR) 10jenkins-bot: Make ptwikimedia a fishbowl wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367892 (https://phabricator.wikimedia.org/T171501) (owner: 10MarcoAurelio) [23:13:00] TabbyCat: Make ptwikimedia a fishbowl wiki live on mwdebug1002, check please [23:13:29] thcipriani: on mwdebug1002 ptwikimedia links to create accounts, etc are not publicy avalaible as expected [23:14:29] TabbyCat: ok, going live [23:14:34] ty [23:15:15] ptwikimedia used Flow? [23:15:30] (flow-computed.dblist = all.dblist - nonflow.dblist - private.dblist - fishbowl.dblist + flowprivate.dblist) [23:15:51] so if so, we need to add it to flowprivate.dblist [23:16:03] Dereckson: it gave me jenkins error if I didn't added it to that list [23:16:24] + they've not used it so I don't think we need to add it [23:17:04] flowprivate is not very intuitive :/ [23:17:24] fishbowl != private in the sense of not publicy viewable [23:17:48] !log thcipriani@tin Synchronized dblists: SWAT: [[gerrit:367892|Make ptwikimedia a fishbowl wiki]] T171501 (duration: 00m 43s) [23:17:56] ^ TabbyCat live now [23:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:58] a flowfishbowl.dblist becomes a little overkill [23:17:59] T171501: Disable accountcreation on ptwikimedia - https://phabricator.wikimedia.org/T171501 [23:18:41] (03PS3) 10Thcipriani: Path for enwikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368233 (https://phabricator.wikimedia.org/T171810) (owner: 10MarcoAurelio) [23:18:47] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368233 (https://phabricator.wikimedia.org/T171810) (owner: 10MarcoAurelio) [23:18:56] thcipriani: why my password isn't working there now? :( [23:19:35] I had an account before fishbowling it... I guess the credentials should continue working normally right? [23:19:58] Dereckson: yep, a flowfishbowl would be overkill indeed [23:20:02] fishbowl wikis don't use SUL [23:20:18] so I just broke my account? [23:20:19] don't use CentralAuth sorry [23:20:23] yup [23:20:31] f*ck [23:20:46] well, who is bureaucrat on this wiki? [23:21:01] Can use createAndPromote.php to promote a user to b-crat [23:21:19] it's easy to fix: recreate the account for the bureaucrat, and hop they'll do the remaining accounts, as RainbowSprinkles said [23:21:36] alchimista and waldir [23:21:50] (03Merged) 10jenkins-bot: Path for enwikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368233 (https://phabricator.wikimedia.org/T171810) (owner: 10MarcoAurelio) [23:22:00] (03CR) 10jenkins-bot: Path for enwikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368233 (https://phabricator.wikimedia.org/T171810) (owner: 10MarcoAurelio) [23:22:05] but if my account is broken (and I cannot even reset my password), how can we know if they'll be able to log in too? [23:22:20] (03PS1) 10Dereckson: Reenable Flow on ptwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368950 [23:22:22] maybe we should revert? this was totally unexpected [23:23:02] TabbyCat: each account has to be recreated [23:23:12] Dereckson: and if we revert? [23:23:31] that will be CentralAuth accounts again [23:23:39] I'd move back to the status quo [23:23:45] +1 [23:23:50] I think you'll want a larger window to handle the account migration from SUL -> private [23:24:02] * Dereckson nods [23:24:04] sure, I can revert until all ducks are in a row [23:24:14] this was totally unexpected - I thought accounts with its credentials would stay [23:24:38] thcipriani: so, if you think it's okay, I'd like to proceed with enwikiquote logo but revert ptwikimedia [23:24:53] (03PS1) 10Dereckson: Revert "Make ptwikimedia a fishbowl wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368951 [23:24:54] * Dereckson prepares the revert [23:24:57] here you are ^ [23:24:57] after we take care of the logo [23:25:02] thank you Dereckson [23:25:31] (03PS1) 10Thcipriani: Revert "Make ptwikimedia a fishbowl wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368952 [23:25:56] TabbyCat: sure, both the logo and the revert are live on mwdebug1002, check please [23:27:00] 10Operations, 10Traffic, 10Wikimedia-Blog, 10HTTPS: Change automatic shortlink in blog theme - https://phabricator.wikimedia.org/T165511#3488092 (10EdErhart-WMF) @Volker_E Facebook's debugging tool allows you to see what their scraper sees on our blog. ([[ https://developers.facebook.com/tools/debug/echo/?... [23:27:26] thcipriani: on it [23:27:32] thcipriani: if you take mine (368951) I've put a note why we're reverting [23:27:47] Dereckson: ack, thanks [23:28:15] thcipriani: both look good to me [23:28:16] (03Abandoned) 10Thcipriani: Revert "Make ptwikimedia a fishbowl wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368952 (owner: 10Thcipriani) [23:28:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368951 (owner: 10Dereckson) [23:29:36] (03Abandoned) 10Dereckson: Reenable Flow on ptwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368950 (owner: 10Dereckson) [23:29:44] TabbyCat: ok going live with logo change first [23:30:20] (03Merged) 10jenkins-bot: Revert "Make ptwikimedia a fishbowl wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368951 (owner: 10Dereckson) [23:30:33] (03CR) 10jenkins-bot: Revert "Make ptwikimedia a fishbowl wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368951 (owner: 10Dereckson) [23:30:39] TabbyCat: https://gerrit.wikimedia.org/r/#/c/368950/ so tests pass for "ptwikimedia in flow.dblist, flowprivate.dblist and fishbowl.dblist" [23:31:26] operations-mw-config-composer-hhvm-jessie FAILURE in 1m 17s [23:31:34] https://gerrit.wikimedia.org/r/#/c/367892/ [23:32:04] 13:53:19 1) DbListTests::testComputedListsFreshness [23:32:04] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:368233|Path for enwikiquote logo]] T171810 (duration: 00m 43s) [23:32:06] 13:53:19 Contents of 'flow' must match expansion of 'flow-computed' [23:32:07] 13:53:19 Failed asserting that two arrays are equal. [23:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:15] T171810: Path for en.wikiquote.org uploaded logo is missing in InitialiseSettings.php - https://phabricator.wikimedia.org/T171810 [23:32:26] TabbyCat: yep flow and flow-computed must match [23:32:27] I guess I missed 'flowprivate.dblists' there [23:32:44] TabbyCat: path for enwikiquote logo live now, doing revert now [23:32:50] flow-computed provides all.dblist - nonflow.dblist - private.dblist - fishbowl.dblist + flowprivate.dblist [23:33:02] so I need two entries? one for flow.dblist + floprivate.dblist? [23:33:07] so fishbowl are excluded, except if it's in floprivate.dblist [23:33:19] ah, ok [23:33:31] thcipriani: ack, sorry for the issues [23:34:21] no worries :) [23:34:44] !log thcipriani@tin Synchronized dblists: SWAT: [[gerrit:368951|Revert "Make ptwikimedia a fishbowl wiki"]] T171501 (duration: 00m 42s) [23:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:54] T171501: Disable accountcreation on ptwikimedia - https://phabricator.wikimedia.org/T171501 [23:34:56] ^ TabbyCat Dereckson revert is now live [23:35:06] I guess once we move out ptwikimedia from fishbowl they start using the centralauth table again [23:35:15] checking on production [23:35:45] works for me thci [23:35:48] thcipriani: [23:36:24] cool, thanks for checking [23:38:59] (03CR) 10Dzahn: [C: 032] phabricator: rsync /srv/repos on iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/368841 (https://phabricator.wikimedia.org/T163938) (owner: 10Paladox) [23:39:16] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2069789 [23:40:30] (03Draft1) 10Paladox: phabricator: Remove old rsync file [puppet] - 10https://gerrit.wikimedia.org/r/368956 [23:40:34] (03PS2) 10Paladox: phabricator: Remove old rsync file [puppet] - 10https://gerrit.wikimedia.org/r/368956 [23:42:03] Dereckson: RainbowSprinkles -- https://phabricator.wikimedia.org/T172160 [23:42:16] maybe you'd like to comment on the parent task as well? [23:49:41] (03PS1) 10Dzahn: phab1001: add interface::add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/368957 (https://phabricator.wikimedia.org/T137928) [23:49:50] maybe someone would also like to take care of https://phabricator.wikimedia.org/T47746 ? [23:50:11] renameUserCleanup.php ? [23:50:54] (03CR) 10Paladox: [C: 031] phab1001: add interface::add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/368957 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [23:52:18] TabbyCat: There's not enough information to run the job [23:53:00] Reedy: what do you need, former and current name? [23:53:06] Yeah [23:53:22] I don't have access to collabwiki but maybe Trizek does? [23:53:34] I can probably do it via sql [23:53:45] heh [23:54:33] Or maybe not [23:57:49] I've asked Jamesofur, he may be able to assist [23:58:02] * TabbyCat time to sleep [23:58:05] (03CR) 10Paladox: [C: 031] phab1001: add interface::add_ip6_mapped (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/368957 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [23:58:35] looking [23:59:13] BJones (WMF) [23:59:37] The user row is fine [23:59:48] so Bjones to BJones_(WMF)