[00:00:05] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T0000). [00:02:22] (03PS1) 10Dzahn: netbox: fix psql client auth for monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/454723 (https://phabricator.wikimedia.org/T185504) [00:02:58] (03CR) 10jerkins-bot: [V: 04-1] netbox: fix psql client auth for monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/454723 (https://phabricator.wikimedia.org/T185504) (owner: 10Dzahn) [00:03:39] (03PS2) 10Dzahn: netbox: fix psql client auth for monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/454723 (https://phabricator.wikimedia.org/T185504) [00:05:50] (03PS1) 10Krinkle: mtail: Escape the '.' in /w/load.php for varnishrls.mtail [puppet] - 10https://gerrit.wikimedia.org/r/454724 (https://phabricator.wikimedia.org/T202479) [00:07:00] (03PS3) 10Dzahn: netbox: fix psql client auth for monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/454723 (https://phabricator.wikimedia.org/T185504) [00:07:05] (03CR) 10Krinkle: "This is an improvement as-is, but I'd like to prevent /w/load.phpFOO from matching as well. Basically only if the url stops at '.php', or " [puppet] - 10https://gerrit.wikimedia.org/r/454724 (https://phabricator.wikimedia.org/T202479) (owner: 10Krinkle) [00:07:53] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) Thank you for the electrifying IRC meeting today! I owe a few actionables to the group, includin... [00:08:08] !log preparing phabricator update [00:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:14] (03PS4) 10Dzahn: netbox: fix psql client auth for monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/454723 (https://phabricator.wikimedia.org/T185504) [00:12:59] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) p:05Triage>03Normal [00:13:20] (03CR) 10jerkins-bot: [V: 04-1] netbox: fix psql client auth for monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/454723 (https://phabricator.wikimedia.org/T185504) (owner: 10Dzahn) [00:17:01] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) Part of completing these schemas will be to look at what MCR can do for the schema. It already s... [00:19:41] (03PS5) 10Dzahn: netbox: fix psql client auth for monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/454723 (https://phabricator.wikimedia.org/T185504) [00:23:04] (03CR) 10Dzahn: [C: 032] "tested by manually running icinga plugin and editing pg_hba.conf" [puppet] - 10https://gerrit.wikimedia.org/r/454723 (https://phabricator.wikimedia.org/T185504) (owner: 10Dzahn) [00:31:17] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504 (10Dzahn) Fixed after the change above. on netmon1002 (master), pg_hba.conf ``` +host netbox replication 2620:0:860:4:208:80:153:110/128 md5 ``` on netmon2001 (slave), nagios... [00:33:02] 10Operations, 10Goal: Migrate the hardware inventory from Racktables to Netbox - https://phabricator.wikimedia.org/T199083 (10Dzahn) [00:33:06] 10Operations, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634 (10Dzahn) [00:33:09] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504 (10Dzahn) 05Open>03Resolved [00:36:36] !log phabricator will be down momentarily for apache restart, downtime scheduled in icinga [00:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:28] (03PS1) 10Dzahn: postgresql::slave::monitoring: make check description configurable [puppet] - 10https://gerrit.wikimedia.org/r/454730 (https://phabricator.wikimedia.org/T185504) [01:09:14] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 24 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:14:23] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 13 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [02:23:01] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.16) (duration: 09m 10s) [02:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:24] PROBLEM - puppet last run on wtp1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:56:03] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.18) (duration: 13m 27s) [02:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:34] PROBLEM - Filesystem available is greater than filesystem size on ms-be2043 is CRITICAL: cluster=swift device=/dev/sdj1 fstype=xfs instance=ms-be2043:9100 job=node mountpoint=/srv/swift-storage/sdj1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [03:06:25] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Thu Aug 23 03:06:25 UTC 2018 (duration 10m 22s) [03:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:53] (03PS1) 10Krinkle: Document meaning and origin of 'cluster' in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454740 [03:13:04] (03CR) 10jerkins-bot: [V: 04-1] Document meaning and origin of 'cluster' in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454740 (owner: 10Krinkle) [03:13:55] (03CR) 10Krinkle: "Wasn't clear to me at first, but eventually found a mention of this string somewhere in Puppet. Documented based on what I found and infer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454740 (owner: 10Krinkle) [03:15:18] (03PS2) 10Krinkle: Document meaning and origin of 'cluster' in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454740 [03:15:44] RECOVERY - puppet last run on wtp1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:27:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:29:14] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:37:00] 10Operations, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Create mailing list for Bureaucrat of zh.wikipedia - https://phabricator.wikimedia.org/T202435 (10Shizhao) thx @Wong128hk [04:12:41] (03PS1) 10Tim Starling: pt-heartbeat class for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/454741 [04:13:21] (03CR) 10jerkins-bot: [V: 04-1] pt-heartbeat class for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/454741 (owner: 10Tim Starling) [04:16:16] (03PS2) 10Tim Starling: pt-heartbeat class for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/454741 [04:16:43] (03CR) 10jerkins-bot: [V: 04-1] pt-heartbeat class for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/454741 (owner: 10Tim Starling) [04:28:46] (03CR) 10Zhuyifei1999: "What is the difference among 'include', 'require', and a 'class {' declaration?" [puppet] - 10https://gerrit.wikimedia.org/r/454715 (owner: 10Dzahn) [04:31:35] (03PS3) 10Tim Starling: pt-heartbeat class for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/454741 [04:40:01] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454742 [04:41:54] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454742 (owner: 10Marostegui) [04:43:08] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454742 (owner: 10Marostegui) [04:44:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1096:3316 (duration: 00m 57s) [04:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:01] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454743 [04:49:02] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454742 (owner: 10Marostegui) [04:50:04] (03PS2) 10Marostegui: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454743 [04:52:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454743 (owner: 10Marostegui) [04:53:13] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454743 (owner: 10Marostegui) [04:55:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1098:3316 (duration: 00m 55s) [04:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:33] (03PS4) 10Tim Starling: pt-heartbeat class for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/454741 [05:04:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454743 (owner: 10Marostegui) [05:19:25] !log Drop blob_orphans and blob_tracking from s5 T59186 [05:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:30] T59186: Drop blob_tracking and blob_orphans everywhere - https://phabricator.wikimedia.org/T59186 [05:40:01] (03PS1) 10KartikMistry: Add LingoCloud MT config [puppet] - 10https://gerrit.wikimedia.org/r/454745 (https://phabricator.wikimedia.org/T202604) [05:41:47] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/450900 (https://phabricator.wikimedia.org/T199962) (owner: 10KartikMistry) [05:46:59] !log Drop blob_orphans and blob_tracking from s6 T59186 [05:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:04] T59186: Drop blob_tracking and blob_orphans everywhere - https://phabricator.wikimedia.org/T59186 [05:50:24] !log Drop blob_orphans and blob_tracking from s4 T59186 [05:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:18] (03CR) 10jerkins-bot: [V: 04-1] hfst: Sync package from Debian [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/450900 (https://phabricator.wikimedia.org/T199962) (owner: 10KartikMistry) [05:57:49] !log Drop blob_orphans and blob_tracking from s2 T59186 [05:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:54] T59186: Drop blob_tracking and blob_orphans everywhere - https://phabricator.wikimedia.org/T59186 [06:11:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454746 [06:13:55] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454746 (owner: 10Marostegui) [06:14:23] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:15:07] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454746 (owner: 10Marostegui) [06:15:57] 10Operations: Support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T202255 (10akosiaris) [06:16:00] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): jessie support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T201942 (10akosiaris) [06:16:24] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:16:41] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1098:3316 (duration: 00m 55s) [06:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:15] (03PS1) 10Marostegui: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454748 [06:19:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454748 (owner: 10Marostegui) [06:20:10] 10Operations: Support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T202255 (10akosiaris) This has already been mentioned in T199125 where it was bypassed by using the 1G cards instead. However a different issue turned out there with support for the Perc H740P RAID con... [06:20:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454748 (owner: 10Marostegui) [06:22:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1113:3316 (duration: 00m 54s) [06:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454746 (owner: 10Marostegui) [06:24:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454748 (owner: 10Marostegui) [06:29:14] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:29:43] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apache-status] [06:31:11] !log Enable semi-sync on es2 - T202364 [06:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:16] T202364: Switchover es2 master (es1011) to es1015 - https://phabricator.wikimedia.org/T202364 [06:32:24] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/reboot-host],File[/usr/local/sbin/enforce-users-groups] [06:47:51] !log Drop blob_orphans and blob_tracking from s7 T59186 [06:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:56] T59186: Drop blob_tracking and blob_orphans everywhere - https://phabricator.wikimedia.org/T59186 [06:57:34] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:33] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:59:54] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:12:03] !log Drop blob_orphans and blob_tracking from s1 T59186 [07:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:09] T59186: Drop blob_tracking and blob_orphans everywhere - https://phabricator.wikimedia.org/T59186 [07:19:33] (03CR) 10Muehlenhoff: "As mentioned in https://phabricator.wikimedia.org/T197791#4460655 this needs an access request ticket." [puppet] - 10https://gerrit.wikimedia.org/r/448505 (owner: 10Aklapper) [07:19:52] (03PS4) 10Muehlenhoff: Remove enable_microcode logic [puppet] - 10https://gerrit.wikimedia.org/r/454203 [07:23:38] !log Drop blob_orphans and blob_tracking from s3 on codfw (lag will be generated on codfw) T59186 [07:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:43] T59186: Drop blob_tracking and blob_orphans everywhere - https://phabricator.wikimedia.org/T59186 [07:30:02] !log installing openssl updates [07:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:43] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10thiemowmde) [07:34:21] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454759 [07:35:57] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454759 (owner: 10Marostegui) [07:37:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454759 (owner: 10Marostegui) [07:37:11] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give WMDE-Fisch permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202475 (10WMDE-Fisch) >>! In T202475#4525265, @RobH wrote: > * Please review and sign the L3 document. > ** Provide a public... [07:37:23] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10thiemowmde) I struggle a bit with the SSH key request, as I would like to reduce the complexity to manage all this... [07:39:13] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1113:3316 (duration: 00m 56s) [07:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454760 [07:40:20] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10jcrespo) @thiemowmde No, it is explicitly said above: "This ssh key pair should only be used for WMF cluster acces... [07:41:43] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454760 (owner: 10Marostegui) [07:42:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454760 (owner: 10Marostegui) [07:43:00] jouncebot: next [07:43:00] In 0 hour(s) and 16 minute(s): Wikidata cache labels initial config (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T0800) [07:44:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1085 (duration: 00m 55s) [07:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454759 (owner: 10Marostegui) [07:44:49] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454760 (owner: 10Marostegui) [07:56:10] (03PS1) 10Muehlenhoff: Drop obsolete removal of timidity-daemon [puppet] - 10https://gerrit.wikimedia.org/r/454761 [07:56:12] (03PS2) 10Addshore: Wikidata: Added config variable to change new link formatter item range [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454510 (https://phabricator.wikimedia.org/T201832) (owner: 10WMDE-leszek) [07:56:18] (03PS3) 10Addshore: Wikidata: Use new item ID formatter for Q1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452365 (https://phabricator.wikimedia.org/T201832) (owner: 10WMDE-leszek) [07:56:54] PROBLEM - MariaDB Slave SQL: s6 on db1125 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1677, Errmsg: Column 2 of table frwiki.externallinks cannot be converted from type tinyblob to type int(11) [07:57:05] ^ checking that [07:57:15] Ah right [07:57:16] I know why [08:00:04] addshore: How many deployers does it take to do Wikidata cache labels initial config deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T0800). [08:00:05] addshore, Aleksey_WMDE, and Aleksey_WMDE: A patch you scheduled for Wikidata cache labels initial config is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:05:46] (03PS4) 10Addshore: testwikidata: Use new item ID formatter for Q1-100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452365 (https://phabricator.wikimedia.org/T201832) (owner: 10WMDE-leszek) [08:06:28] (03CR) 10Addshore: [C: 032] Wikidata: Added config variable to change new link formatter item range [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454510 (https://phabricator.wikimedia.org/T201832) (owner: 10WMDE-leszek) [08:10:23] la la la [08:10:26] (03Merged) 10jenkins-bot: Wikidata: Added config variable to change new link formatter item range [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454510 (https://phabricator.wikimedia.org/T201832) (owner: 10WMDE-leszek) [08:11:23] (03CR) 10Giuseppe Lavagetto: [C: 031] Drop obsolete removal of timidity-daemon [puppet] - 10https://gerrit.wikimedia.org/r/454761 (owner: 10Muehlenhoff) [08:12:33] (03PS2) 10Jcrespo: mariadb backups: Capture connection error exceptions [puppet] - 10https://gerrit.wikimedia.org/r/454509 (https://phabricator.wikimedia.org/T198987) [08:12:55] (03CR) 10Giuseppe Lavagetto: Document meaning and origin of 'cluster' in mc.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454740 (owner: 10Krinkle) [08:14:20] <_joe_> onimisionipe: welcome aboard! :) [08:14:43] Hi Guiseppe.. [08:14:46] Thank you! [08:16:16] onimisionipe: welcome as well! (Moritz, also from SRE) [08:16:38] (03CR) 10jenkins-bot: Wikidata: Added config variable to change new link formatter item range [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454510 (https://phabricator.wikimedia.org/T201832) (owner: 10WMDE-leszek) [08:16:42] onimisionipe: Welcome! Manuel here (DBA) [08:16:56] Thank you Moritz [08:17:04] Thanks Manuel! [08:17:08] (03PS1) 10Vogone: This change adds the permission 'editcontentmodel' to the 'massmessage-senders' user group on metawiki. The change will allow 'massmessage-senders' on metawiki to use Special:CreateMassMessageList. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454762 (https://phabricator.wikimedia.org/T202597) [08:17:13] <_joe_> hey I thought I was late in cheering him as I'm on/off in the last few days :P [08:17:14] welcome! (Luca, analytics :) [08:17:26] lol [08:17:36] Thanks Luca! [08:19:24] welcome onimisionipe! (Valentín, SRE/Traffic) [08:19:58] Thanks Valentin!.. [08:20:59] (03PS5) 10Muehlenhoff: Remove enable_microcode logic [puppet] - 10https://gerrit.wikimedia.org/r/454203 (https://phabricator.wikimedia.org/T127825) [08:21:08] (03PS2) 10KartikMistry: Add LingoCloud MT config [puppet] - 10https://gerrit.wikimedia.org/r/454745 (https://phabricator.wikimedia.org/T202604) [08:21:37] (03CR) 10Mathew.onipe: "I am a reviewer!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446801 (owner: 10DCausse) [08:21:50] (03CR) 10Ema: [C: 031] Remove enable_microcode logic [puppet] - 10https://gerrit.wikimedia.org/r/454203 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [08:24:25] (03PS1) 10Addshore: Add log channel Wikibase.NewItemIdFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454764 (https://phabricator.wikimedia.org/T201832) [08:24:42] (03CR) 10Addshore: [C: 032] Add log channel Wikibase.NewItemIdFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454764 (https://phabricator.wikimedia.org/T201832) (owner: 10Addshore) [08:24:55] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454765 [08:26:16] (03Merged) 10jenkins-bot: Add log channel Wikibase.NewItemIdFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454764 (https://phabricator.wikimedia.org/T201832) (owner: 10Addshore) [08:28:12] (03CR) 10Addshore: [C: 032] testwikidata: Use new item ID formatter for Q1-100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452365 (https://phabricator.wikimedia.org/T201832) (owner: 10WMDE-leszek) [08:29:47] (03PS5) 10Addshore: testwikidata: Use new item ID formatter for Q1-100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452365 (https://phabricator.wikimedia.org/T201832) (owner: 10WMDE-leszek) [08:29:51] (03CR) 10Addshore: [C: 032] testwikidata: Use new item ID formatter for Q1-100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452365 (https://phabricator.wikimedia.org/T201832) (owner: 10WMDE-leszek) [08:31:08] (03Merged) 10jenkins-bot: testwikidata: Use new item ID formatter for Q1-100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452365 (https://phabricator.wikimedia.org/T201832) (owner: 10WMDE-leszek) [08:32:36] (03PS2) 10Vogone: This change adds the permission 'editcontentmodel' to the 'massmessage-senders' user group on metawiki. The change will allow 'massmessage-senders' on metawiki to use Special:CreateMassMessageList. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454762 (https://phabricator.wikimedia.org/T202597) [08:32:42] (03CR) 10jenkins-bot: Add log channel Wikibase.NewItemIdFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454764 (https://phabricator.wikimedia.org/T201832) (owner: 10Addshore) [08:32:44] (03CR) 10jenkins-bot: testwikidata: Use new item ID formatter for Q1-100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452365 (https://phabricator.wikimedia.org/T201832) (owner: 10WMDE-leszek) [08:34:54] (03PS1) 10Ema: prometheus: add trafficserver_exporter [puppet] - 10https://gerrit.wikimedia.org/r/454766 (https://phabricator.wikimedia.org/T202381) [08:36:29] (03PS3) 10Vogone: Add 'editcontentmodel' permission to 'massmessage-senders' on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454762 (https://phabricator.wikimedia.org/T202597) [08:37:24] RECOVERY - MariaDB Slave SQL: s6 on db1125 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:37:30] (03PS2) 10Aklapper: Phab: Allow aklapper to delete personal Herald filter rules [puppet] - 10https://gerrit.wikimedia.org/r/448505 (https://phabricator.wikimedia.org/T202503) [08:37:46] (03PS2) 10Ema: prometheus: add trafficserver_exporter [puppet] - 10https://gerrit.wikimedia.org/r/454766 (https://phabricator.wikimedia.org/T202381) [08:39:31] addshore: are you done? Can I deploy? [08:39:35] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T201832 Stuff for new link formatter on testwikidata [[gerrit:454510]] [[gerrit:452365]] [[gerrit:454764]] (duration: 00m 56s) [08:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:40] T201832: Use link formatter that uses cache instead of wb_terms for item Q1 - https://phabricator.wikimedia.org/T201832 [08:41:17] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: T201832 Stuff for new link formatter on testwikidata [[gerrit:454510]] (duration: 00m 55s) [08:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:30] !log done with deploy window [08:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:38] <_joe_> win 18 [08:42:50] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454765 (owner: 10Marostegui) [08:42:54] collect 200 [08:43:08] <_joe_> addshore: that looks like a deal [08:43:11] sorry marostegui just saw your ping, yes :) [08:43:17] :) [08:43:20] Thanks! [08:44:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454765 (owner: 10Marostegui) [08:44:12] (03CR) 10Ema: "pcc is pleased https://puppet-compiler.wmflabs.org/compiler02/12183/" [puppet] - 10https://gerrit.wikimedia.org/r/454766 (https://phabricator.wikimedia.org/T202381) (owner: 10Ema) [08:45:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1085 (duration: 00m 55s) [08:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454767 [08:51:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1093 (duration: 00m 55s) [08:51:28] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454765 (owner: 10Marostegui) [08:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454767 (owner: 10Marostegui) [08:56:23] (03PS5) 10Giuseppe Lavagetto: PHP: create module for modern Debian-based distributions [puppet] - 10https://gerrit.wikimedia.org/r/452664 (https://phabricator.wikimedia.org/T201140) [08:56:25] (03PS5) 10Giuseppe Lavagetto: mediawiki: move php to a profile, use the php class [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) [08:56:37] (03PS7) 10Giuseppe Lavagetto: php: add service management for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/454478 (https://phabricator.wikimedia.org/T201140) [08:56:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454767 (owner: 10Marostegui) [08:56:45] (03CR) 10Giuseppe Lavagetto: [C: 032] PHP: create module for modern Debian-based distributions [puppet] - 10https://gerrit.wikimedia.org/r/452664 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [08:58:04] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10thiemowmde) * I don't know what "WMCS" stands for, despite working with Wikimedia infrastructure for about a decade... [09:01:46] onimisionipe: welcome! Arturo Borrero here from the Cloud Services team, working from Spain :-) [09:05:07] arturo: Thank You! [09:10:34] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10Gehel) >>! In T202476#4525961, @thiemowmde wrote: > * I don't know what "WMCS" stands for, despite working with Wik... [09:15:51] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10MoritzMuehlenhoff) >>! In T202476#4525961, @thiemowmde wrote: > * I don't know what "WMCS" stands for, despite work... [09:16:02] (03CR) 10Muehlenhoff: "Looks good, two suggestions." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/454766 (https://phabricator.wikimedia.org/T202381) (owner: 10Ema) [09:16:03] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10dcausse) @thiemowmde managing multiple ssh keys can be really easy once you setup it properly, please see https://w... [09:16:44] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454767 (owner: 10Marostegui) [09:16:47] (03PS3) 10Ema: prometheus: add trafficserver_exporter [puppet] - 10https://gerrit.wikimedia.org/r/454766 (https://phabricator.wikimedia.org/T202381) [09:17:11] !log starting to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/444610 as part of T198351, including regeneration of SSL certs. Disabling puppet on elastic* during the operation [09:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:16] T198351: Refactor puppet to support multiple elasticsearch instances on same node - https://phabricator.wikimedia.org/T198351 [09:17:26] (03PS1) 10Elukey: profile::archiva: limit rsync access to DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/454770 [09:18:22] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10Addshore) >>! In T202476#4525961, @thiemowmde wrote: > * L3 states to use different keys for "production" and "labs... [09:19:33] (03CR) 10Muehlenhoff: [C: 031] profile::archiva: limit rsync access to DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/454770 (owner: 10Elukey) [09:19:56] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454771 [09:19:59] (03PS7) 10Vgutierrez: Certcentral integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/454045 [09:20:45] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/454766 (https://phabricator.wikimedia.org/T202381) (owner: 10Ema) [09:21:20] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454771 (owner: 10Marostegui) [09:22:38] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454771 (owner: 10Marostegui) [09:22:42] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10Joe) >>! In T202476#4525961, @thiemowmde wrote: > * I don't know what "WMCS" stands for, despite working with Wikim... [09:23:17] (03PS4) 10Ema: prometheus: add trafficserver_exporter [puppet] - 10https://gerrit.wikimedia.org/r/454766 (https://phabricator.wikimedia.org/T202381) [09:23:37] (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454773 [09:23:47] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1093 (duration: 00m 56s) [09:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:18] (03CR) 10Elukey: "Ottomata: is it intended to be configured in this way or should we change it?" [puppet] - 10https://gerrit.wikimedia.org/r/454770 (owner: 10Elukey) [09:25:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454773 (owner: 10Marostegui) [09:25:55] (03CR) 10Ema: [C: 032] prometheus: add trafficserver_exporter [puppet] - 10https://gerrit.wikimedia.org/r/454766 (https://phabricator.wikimedia.org/T202381) (owner: 10Ema) [09:26:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454773 (owner: 10Marostegui) [09:27:00] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: move nova DBs to m5-master [puppet] - 10https://gerrit.wikimedia.org/r/454774 (https://phabricator.wikimedia.org/T202549) [09:28:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1088 (duration: 00m 55s) [09:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:33] 10Operations, 10Traffic, 10Patch-For-Review: Traffic Server - Prometheus integration - https://phabricator.wikimedia.org/T202381 (10ema) [09:29:39] !log installing nodejs security updates on wtp* (Parsoid servers) [09:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:51] (03PS21) 10Gehel: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [09:35:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454771 (owner: 10Marostegui) [09:35:51] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454773 (owner: 10Marostegui) [09:35:57] (03CR) 10ArielGlenn: [C: 031] "Indeed no host currently has this package in production." [puppet] - 10https://gerrit.wikimedia.org/r/454761 (owner: 10Muehlenhoff) [09:38:06] (03CR) 10ArielGlenn: [C: 032] tar up dumps status files for rsync for each back end in turn [puppet] - 10https://gerrit.wikimedia.org/r/454549 (https://phabricator.wikimedia.org/T202482) (owner: 10ArielGlenn) [09:39:44] (03PS22) 10Gehel: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [09:40:45] !log restarting etherpad-lite for nodejs security update [09:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:15] (03CR) 10Gehel: [C: 032] Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [09:44:33] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: move nova DBs to m5-master [puppet] - 10https://gerrit.wikimedia.org/r/454774 (https://phabricator.wikimedia.org/T202549) [09:45:59] (03PS1) 10Gehel: elasticsearch: new SSL cert for relforge [puppet] - 10https://gerrit.wikimedia.org/r/454778 (https://phabricator.wikimedia.org/T198351) [09:46:01] (03PS3) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: move nova DBs to m5-master [puppet] - 10https://gerrit.wikimedia.org/r/454774 (https://phabricator.wikimedia.org/T202549) [09:46:17] (03CR) 10Gehel: [C: 032] elasticsearch: new SSL cert for relforge [puppet] - 10https://gerrit.wikimedia.org/r/454778 (https://phabricator.wikimedia.org/T198351) (owner: 10Gehel) [09:50:34] !log installing debdeploy update [09:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:25] (03PS4) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: move nova DBs to m5-master [puppet] - 10https://gerrit.wikimedia.org/r/454774 (https://phabricator.wikimedia.org/T202549) [09:55:27] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454779 [09:57:46] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454779 (owner: 10Marostegui) [09:59:02] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454779 (owner: 10Marostegui) [09:59:36] !log Deploy schema change on s6 primary master (db1061) [09:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1088 (duration: 00m 54s) [10:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:17] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [10:04:37] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [10:05:07] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [10:05:16] mmmm [10:05:17] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [10:05:18] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [10:05:25] joal: --^ [10:05:25] (03PS1) 10Gehel: elasticsearch: new SSL cert for search.svc.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/454780 (https://phabricator.wikimedia.org/T198351) [10:05:27] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [10:05:40] this seems something related to druid? [10:06:10] <_joe_> elukey: need help? [10:06:22] (03CR) 10Gehel: [C: 032] elasticsearch: new SSL cert for search.svc.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/454780 (https://phabricator.wikimedia.org/T198351) (owner: 10Gehel) [10:06:44] (03PS1) 10Alexandros Kosiaris: Renumber analytics-tool* hosts to analytics vlan [dns] - 10https://gerrit.wikimedia.org/r/454781 (https://phabricator.wikimedia.org/T202559) [10:06:45] not really atm, it seems only related to edit metrics (they are mostly used by Wikistats beta) [10:06:48] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled [10:07:04] this doesn't sound nice [10:07:28] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled [10:07:38] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454779 (owner: 10Marostegui) [10:07:45] ok checking the druid cluster [10:07:57] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:08:58] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) [10:09:07] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) [10:10:33] <_joe_> elukey: something we can do? [10:11:28] nono it is probably related to what we have done this morning, the impact is minimal and related to a (almost beta) site, so nothing terrible atm. I am trying to check what happened from the logs now [10:11:29] elukey: I just regenerated SSL certs for search.svc.codfw.wmnet, I can't see how this would be related, appart timing correlation [10:12:18] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [10:12:27] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [10:12:27] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy [10:12:27] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy [10:12:29] gehel: ack thanks for that! [10:12:35] I just restarted one broker on druid1004 [10:12:38] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [10:12:38] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [10:12:42] there were jvm allocation failures [10:12:45] lovely [10:12:54] sorry, GC (Allocation Failure) [10:12:58] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1291 bytes in 0.001 second response time [10:12:58] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [10:13:07] for young gen apparently [10:13:17] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy [10:13:55] elukey: allocation failures in themselves are not an issue, just an indication that GC has to run [10:14:07] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [10:14:08] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal [10:14:34] elukey: though of course, that could be an indication of not enough space in young gen [10:15:11] yeah, a bit weird.. I'll dig a bit into it (restart first, then deep check :) [10:15:45] elukey: if you can take a few thread dumps before restart that could help with investigation [10:15:58] !log restart druid-broker on druid100[4-5] due to unresponsiveness (still unclear why) [10:15:59] elukey: https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook#Further_analysis [10:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:37] good point thanks! [10:24:31] 10Operations, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Create mailing list for Bureaucrat of zh.wikipedia - https://phabricator.wikimedia.org/T202435 (10MarcoAurelio) >>! In T202435#4523278, @Wong128hk wrote: > * The requested name of the mailing list is wikizh-bureaucrats@lists.wikimedia.org > * The purp... [10:27:35] !log purge rec dns caches for analytics-tool* hosts T202559 [10:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:40] T202559: Allow ganeti instance inside of the Analytics VLAN; move analytics-tool* to it and change IPs. - https://phabricator.wikimedia.org/T202559 [10:27:52] !log reboot analytics-tool* hosts for the IP renumbering change T202559 [10:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:28] PROBLEM - Host analytics-tool1002 is DOWN: PING CRITICAL - Packet loss = 100% [10:35:57] !log installing openldap updates from stretch 9.5 point release [10:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:36] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp2010.codfw.wmnet', 'cp4030.ulsfo.wmnet'] ``` The log can be found in `/var/l... [10:40:04] (03PS1) 10Ema: prometheus: Job definition for trafficserver_exporter [puppet] - 10https://gerrit.wikimedia.org/r/454784 [10:40:24] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Job definition for trafficserver_exporter [puppet] - 10https://gerrit.wikimedia.org/r/454784 (owner: 10Ema) [10:40:39] (03CR) 10Elukey: [C: 031] Renumber analytics-tool* hosts to analytics vlan [dns] - 10https://gerrit.wikimedia.org/r/454781 (https://phabricator.wikimedia.org/T202559) (owner: 10Alexandros Kosiaris) [10:40:41] (03PS1) 10Jonas Kress (WMDE): Revert "Limit page creation and edit rate on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454785 [10:40:43] (03CR) 10jerkins-bot: [V: 04-1] Revert "Limit page creation and edit rate on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454785 (owner: 10Jonas Kress (WMDE)) [10:40:52] (03CR) 10Alexandros Kosiaris: [C: 032] Renumber analytics-tool* hosts to analytics vlan [dns] - 10https://gerrit.wikimedia.org/r/454781 (https://phabricator.wikimedia.org/T202559) (owner: 10Alexandros Kosiaris) [10:40:54] (03PS2) 10Ema: prometheus: Job definition for trafficserver_exporter [puppet] - 10https://gerrit.wikimedia.org/r/454784 (https://phabricator.wikimedia.org/T202381) [10:41:12] (03PS8) 10Vgutierrez: Certcentral integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/454045 [10:41:14] (03PS9) 10Vgutierrez: Certcentral integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/454045 [10:41:16] (03PS1) 10Gehel: elasticsearch: new SSL cert for search.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/454788 (https://phabricator.wikimedia.org/T198351) [10:41:18] (03CR) 10Gehel: [C: 032] elasticsearch: new SSL cert for search.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/454788 (https://phabricator.wikimedia.org/T198351) (owner: 10Gehel) [10:41:21] (03CR) 10Ema: "pcc looks good to me: https://puppet-compiler.wmflabs.org/compiler03/12194/" [puppet] - 10https://gerrit.wikimedia.org/r/454784 (https://phabricator.wikimedia.org/T202381) (owner: 10Ema) [10:42:23] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move internal sites hosted on thorium to ganeti instance(s) - https://phabricator.wikimedia.org/T202011 (10akosiaris) [10:42:54] PROBLEM - Elasticsearch HTTPS on elastic1042 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:42:54] 10Operations: Integrate stretch 9.4 point update - https://phabricator.wikimedia.org/T189435 (10MoritzMuehlenhoff) 05Open>03Resolved This has been rolled out fleet-wide for a while. [10:43:53] (03PS2) 10Muehlenhoff: Drop obsolete removal of timidity-daemon [puppet] - 10https://gerrit.wikimedia.org/r/454761 [10:44:15] (03CR) 10Alexandros Kosiaris: [C: 031] "Let's syncup to deploy this" [puppet] - 10https://gerrit.wikimedia.org/r/454574 (owner: 10Ppchelko) [10:46:05] (03CR) 10Muehlenhoff: [C: 032] Drop obsolete removal of timidity-daemon [puppet] - 10https://gerrit.wikimedia.org/r/454761 (owner: 10Muehlenhoff) [10:53:03] PROBLEM - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2010_v4, cp2010_v6 [10:54:13] 10Operations, 10Legalpad: Update terms "Labs" and "Operations" in L3 - https://phabricator.wikimedia.org/T202617 (10Aklapper) [10:54:49] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10Aklapper) >>! In T202476#4526037, @Joe wrote: > About the L3 document: it needs to be amended, I'll ping @RobH late... [10:55:33] RECOVERY - Elasticsearch HTTPS on elastic1042 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1824 days) [10:56:19] RECOVERY - Host analytics-tool1002 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [10:56:48] PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4030_v4, cp4030_v6 [10:56:54] !log Deploy schema change on s2 codfw masters (db2035) with replication - this will generate lag on s2 codfw [10:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:19] RECOVERY - IPsec on cp4027 is OK: Strongswan OK - 32 ESP OK [10:57:39] ACKNOWLEDGEMENT - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4030_v4, cp4030_v6 Ema reimaging [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T1100). [11:00:04] addshore: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:19] oh! [11:00:29] !log new SSL certs / tlsproxy deployed on elastic nodes - T198351 [11:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:34] T198351: Refactor puppet to support multiple elasticsearch instances on same node - https://phabricator.wikimedia.org/T198351 [11:00:54] RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 52 ESP OK [11:01:48] !log Deploy schema change on dbstore1002:s2 [11:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:37] (03PS1) 10Vgutierrez: Deliver certificates in every save mode [software/certcentral] - 10https://gerrit.wikimedia.org/r/454794 [11:07:51] (03CR) 10jerkins-bot: [V: 04-1] Deliver certificates in every save mode [software/certcentral] - 10https://gerrit.wikimedia.org/r/454794 (owner: 10Vgutierrez) [11:08:20] (03PS2) 10Addshore: Enable moved paragrah detection everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454026 (https://phabricator.wikimedia.org/T199800) (owner: 10WMDE-Fisch) [11:08:25] (03CR) 10Addshore: [C: 032] Enable moved paragrah detection everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454026 (https://phabricator.wikimedia.org/T199800) (owner: 10WMDE-Fisch) [11:08:47] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Cleanup wikdiff2 mobile moved paragraph config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454019 (https://phabricator.wikimedia.org/T199800) (owner: 10WMDE-Fisch) [11:09:34] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2010.codfw.wmnet', 'cp4030.ulsfo.wmnet'] ``` and were **ALL** successful. [11:09:55] (03Merged) 10jenkins-bot: Enable moved paragrah detection everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454026 (https://phabricator.wikimedia.org/T199800) (owner: 10WMDE-Fisch) [11:12:28] (03CR) 10jenkins-bot: Enable moved paragrah detection everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454026 (https://phabricator.wikimedia.org/T199800) (owner: 10WMDE-Fisch) [11:12:30] (03PS2) 10Vgutierrez: Deliver certificates in every save mode [software/certcentral] - 10https://gerrit.wikimedia.org/r/454794 [11:13:46] (03CR) 10jerkins-bot: [V: 04-1] Deliver certificates in every save mode [software/certcentral] - 10https://gerrit.wikimedia.org/r/454794 (owner: 10Vgutierrez) [11:15:52] (03PS3) 10Vgutierrez: Deliver certificates in every save mode [software/certcentral] - 10https://gerrit.wikimedia.org/r/454794 [11:18:53] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: T199800 [[gerrit:454026|Enable moved paragrah detection everywhere]] (duration: 00m 55s) [11:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:58] T199800: Enabling moved paragraphs everywhere and for mobile diffs - https://phabricator.wikimedia.org/T199800 [11:19:16] !log installing shared-mime-info updates from stretch 9.5 point release [11:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:46] (03CR) 10Addshore: [C: 032] Cleanup wikdiff2 mobile moved paragraph config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454019 (https://phabricator.wikimedia.org/T199800) (owner: 10WMDE-Fisch) [11:23:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454796 [11:26:24] addshore: do you have time for a couple of patches if I add them to the calendar? [11:26:37] Hauskatze: are they easy? [11:26:47] yes, a permission change for wikidata [11:26:53] still only 1/2 way though the 2 on the list (had a slow start) [11:26:57] Hauskatze: link? :) [11:27:03] sure, let me fetch [11:27:23] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/454330/ addshore [11:27:48] GroupsAddToSelf [11:28:28] and https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/454590/, super easy as well [11:28:43] (03CR) 10Addshore: [C: 032] Switch 'wikidata-staff' add/remove 'interface-admin' for own account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454330 (https://phabricator.wikimedia.org/T202065) (owner: 10MarcoAurelio) [11:28:48] (03PS3) 10Addshore: Switch 'wikidata-staff' add/remove 'interface-admin' for own account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454330 (https://phabricator.wikimedia.org/T202065) (owner: 10MarcoAurelio) [11:28:52] (03CR) 10Addshore: [C: 032] Switch 'wikidata-staff' add/remove 'interface-admin' for own account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454330 (https://phabricator.wikimedia.org/T202065) (owner: 10MarcoAurelio) [11:30:26] (03Merged) 10jenkins-bot: Switch 'wikidata-staff' add/remove 'interface-admin' for own account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454330 (https://phabricator.wikimedia.org/T202065) (owner: 10MarcoAurelio) [11:32:11] Hauskatze: hmm, how should I test it, users in the wikidata-staff group should be able to add the interface admin group to people? or just to themselves? [11:32:25] only self? [11:32:37] addshore: pull it to mwdebug and I'll check Special:ListGroupRights for you :) [11:32:45] already done [11:32:51] 1 or 2? [11:32:54] (sorry, I should have said)! [11:32:58] mwdebug1002 [11:33:01] checking [11:33:55] addshore: lgtm, they can now only add I-A to their own accounts as requested initially [11:34:17] https://www.wikidata.org/wiki/Special:ListGroupRights#wikidata-staff [11:34:32] syncing [11:35:17] added both patches to the calendar, but to the wrong day, damn, fixing [11:35:17] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T202065 [[gerrit:454330|Switch 'wikidata-staff' add/remove 'interface-admin' for own account]] (duration: 00m 55s) [11:35:22] (03PS2) 10Addshore: Require autoconfirmed status to edit the 828 namespace at es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454590 (https://phabricator.wikimedia.org/T202555) (owner: 10MarcoAurelio) [11:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:23] T202065: Interface administrators in Wikidata - https://phabricator.wikimedia.org/T202065 [11:35:26] (03CR) 10Addshore: [C: 032] Require autoconfirmed status to edit the 828 namespace at es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454590 (https://phabricator.wikimedia.org/T202555) (owner: 10MarcoAurelio) [11:35:38] Hauskatze: thanks! [11:36:01] addshore: fixed & thanks to you :) [11:36:07] 10Operations, 10DBA, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [11:36:40] (03Merged) 10jenkins-bot: Require autoconfirmed status to edit the 828 namespace at es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454590 (https://phabricator.wikimedia.org/T202555) (owner: 10MarcoAurelio) [11:37:04] Hauskatze: that one is on mwdebug1002 [11:37:14] checking [11:37:59] and checked +1 lgtm [11:38:54] okay! [11:39:45] syncing [11:40:36] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T202555 [[gerrit:454590|Require autoconfirmed status to edit the 828 namespace at es.wikibooks]] (duration: 00m 56s) [11:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:41] T202555: Require autoconfirmed status to edit Module namespaces at es.wikibooks - https://phabricator.wikimedia.org/T202555 [11:41:05] Hauskatze: all done [11:41:18] addshore: great, thanks! [11:41:21] (03PS5) 10Addshore: Cleanup wikdiff2 mobile moved paragraph config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454019 (https://phabricator.wikimedia.org/T199800) (owner: 10WMDE-Fisch) [11:41:25] (03CR) 10Addshore: [C: 032] Cleanup wikdiff2 mobile moved paragraph config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454019 (https://phabricator.wikimedia.org/T199800) (owner: 10WMDE-Fisch) [11:42:51] (03Merged) 10jenkins-bot: Cleanup wikdiff2 mobile moved paragraph config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454019 (https://phabricator.wikimedia.org/T199800) (owner: 10WMDE-Fisch) [11:44:34] (03CR) 10jenkins-bot: Switch 'wikidata-staff' add/remove 'interface-admin' for own account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454330 (https://phabricator.wikimedia.org/T202065) (owner: 10MarcoAurelio) [11:50:22] !log addshore@deploy1001 Synchronized wmf-config: T199800 [[gerrit:Cleanup wikdiff2 mobile moved paragraph config]] (duration: 00m 55s) [11:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:27] T199800: Enabling moved paragraphs everywhere and for mobile diffs - https://phabricator.wikimedia.org/T199800 [11:56:16] !log SWAT done [11:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T1200) [12:00:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1105:3312 (duration: 00m 55s) [12:00:37] !log Deploy schema change on db1105:3312 [12:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:06] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10faidon) a:05RobH>03Muehlenhoff [12:05:28] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10MoritzMuehlenhoff) >>! In T199125#4524995, @RobH wrote: > Same thing in stretch. This is odd, since we must have installed other... [12:06:01] !log created new archiva-deployers LDAP group (T200454) [12:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:05] \o/ [12:10:02] (03CR) 10jenkins-bot: Require autoconfirmed status to edit the 828 namespace at es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454590 (https://phabricator.wikimedia.org/T202555) (owner: 10MarcoAurelio) [12:10:04] (03CR) 10jenkins-bot: Cleanup wikdiff2 mobile moved paragraph config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454019 (https://phabricator.wikimedia.org/T199800) (owner: 10WMDE-Fisch) [12:10:16] (03PS2) 10Marostegui: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454796 [12:10:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454796 (owner: 10Marostegui) [12:10:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454796 (owner: 10Marostegui) [12:12:13] (03CR) 10Muehlenhoff: [C: 032] Recognise archiva-deployers LDAP group in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/454797 (https://phabricator.wikimedia.org/T200454) (owner: 10Muehlenhoff) [12:24:32] PROBLEM - HHVM rendering on mw2206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:22] RECOVERY - HHVM rendering on mw2206 is OK: HTTP OK: HTTP/1.1 200 OK - 74484 bytes in 0.308 second response time [12:31:49] (03PS6) 10Daimona Eaytoy: Enable $wgAbuseFilterProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423660 (https://phabricator.wikimedia.org/T191039) [12:34:45] !log install Java updates on Hadoop nodes [12:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:56] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, 10Multimedia: cronspam cleanup: Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T150375 (10jcrespo) [12:45:59] (03PS2) 10Volans: Add README [cookbooks] - 10https://gerrit.wikimedia.org/r/454559 (https://phabricator.wikimedia.org/T199079) [12:46:01] (03PS1) 10Volans: Initial structure for the cookbooks hierarchy [cookbooks] - 10https://gerrit.wikimedia.org/r/454800 (https://phabricator.wikimedia.org/T199079) [12:46:03] 10Operations, 10media-storage: cleanupUploadStash.php / swift-codfw backend-fail-delete / cron spam - https://phabricator.wikimedia.org/T202584 (10jcrespo) [12:46:22] (03PS1) 10Volans: cookbook: fix BaseCookbooksItem interface [software/spicerack] - 10https://gerrit.wikimedia.org/r/454801 (https://phabricator.wikimedia.org/T199079) [12:46:24] (03PS1) 10Volans: cookbook: fix links to parent in interactive menu [software/spicerack] - 10https://gerrit.wikimedia.org/r/454802 (https://phabricator.wikimedia.org/T199079) [12:46:26] (03PS1) 10Volans: cookbook: properly handle KeyboardInterrupt [software/spicerack] - 10https://gerrit.wikimedia.org/r/454803 (https://phabricator.wikimedia.org/T199079) [12:46:28] (03PS1) 10Volans: cookbook: allow to pass parameters in the menu [software/spicerack] - 10https://gerrit.wikimedia.org/r/454804 (https://phabricator.wikimedia.org/T199079) [12:46:30] (03PS1) 10Volans: cookbook: handle SystemExit exceptions [software/spicerack] - 10https://gerrit.wikimedia.org/r/454805 (https://phabricator.wikimedia.org/T199079) [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T1300) [13:03:25] marostegui: https://ask.openstack.org/en/question/99925/mitaka-nova-error-too-many-connection-mysql/ LOL cc bblack [13:03:29] cc bstorm_ ** [13:03:53] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:04:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:05:12] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [13:06:12] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [13:06:19] !log upgrading dbstore200[12], will take a while due to ongoing alter tables [13:06:22] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:52] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:07:03] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:10:23] 10Operations: Reboots of dumps/snapshot hosts for L1TF/microcode updates - https://phabricator.wikimedia.org/T202623 (10MoritzMuehlenhoff) [13:11:45] 10Operations, 10Dumps-Generation: Reboots of dumps/snapshot hosts for L1TF/microcode updates - https://phabricator.wikimedia.org/T202623 (10ArielGlenn) [13:13:42] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [13:14:43] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [13:14:53] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:15:40] jouncebot: now [13:15:45] For the next 1 hour(s) and 44 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T1300) [13:15:46] jouncebot: next [13:15:46] In 2 hour(s) and 44 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T1600) [13:19:57] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454814 [13:21:43] (03PS2) 10Addshore: Revert "Limit page creation and edit rate on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454785 (https://phabricator.wikimedia.org/T198396) (owner: 10Jonas Kress (WMDE)) [13:22:07] marostegui: are you going to deploy that db related one? :) [13:24:25] (03PS3) 10Jonas Kress (WMDE): Revert "Limit page creation and edit rate on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454785 (https://phabricator.wikimedia.org/T198396) [13:25:12] (03CR) 10Addshore: [C: 032] Revert "Limit page creation and edit rate on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454785 (https://phabricator.wikimedia.org/T198396) (owner: 10Jonas Kress (WMDE)) [13:26:29] (03Merged) 10jenkins-bot: Revert "Limit page creation and edit rate on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454785 (https://phabricator.wikimedia.org/T198396) (owner: 10Jonas Kress (WMDE)) [13:28:19] addshore: nope, go ahead, I can wait for the train to be finished :) [13:28:46] marostegui: this isn't train, just sneaking one out that I missed in my swat slot! [13:28:57] figured i'd steal some train slot time as the train is in US mode :) [13:28:57] Ah! :) [13:29:10] Then I will in a few minutes and once you are done [13:29:27] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T198396 Revert "Limit page creation and edit rate on Wikidata" [[gerrit:454785]] (duration: 00m 56s) [13:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:32] T198396: Remove ratelimit from Wikidata for some groups again - https://phabricator.wikimedia.org/T198396 [13:29:37] marostegui: thats me all done! [13:29:43] that was fast! :) [13:29:51] ;) [13:30:01] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454814 (owner: 10Marostegui) [13:30:16] (03CR) 10ArielGlenn: "There seem to be several trusty hosts with python-conftool installed on them yet: labcontrol[1001-1002].wikimedia.org,labtestcontrol2001.w" [puppet] - 10https://gerrit.wikimedia.org/r/454592 (owner: 10Dzahn) [13:31:21] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454814 (owner: 10Marostegui) [13:33:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1105:3312 (duration: 00m 54s) [13:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:28] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454818 [13:33:42] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp2011.codfw.wmnet', 'cp4029.ulsfo.wmnet'] ``` The log can be found in `/var/l... [13:35:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454818 (owner: 10Marostegui) [13:36:08] (03CR) 10Giuseppe Lavagetto: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/454592 (owner: 10Dzahn) [13:36:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454818 (owner: 10Marostegui) [13:37:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103:3312 (duration: 00m 54s) [13:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:10] !log Deploy schema change on db1103:3312 [13:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:28] !log upgrading dbstore1001 [13:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:58] (03CR) 10jenkins-bot: Revert "Limit page creation and edit rate on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454785 (https://phabricator.wikimedia.org/T198396) (owner: 10Jonas Kress (WMDE)) [13:40:00] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454814 (owner: 10Marostegui) [13:40:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454818 (owner: 10Marostegui) [13:40:43] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) After one day: ``` elukey@stat1005:~$ grep https ipv6_after_changes.log| while read line; do endpoint=$(echo $line | cut -d" "... [13:47:10] (03PS6) 10Giuseppe Lavagetto: mediawiki: move php to a profile, use the php class [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) [13:48:41] (03CR) 10Gehel: [C: 031] "trivial enough at this point!" [cookbooks] - 10https://gerrit.wikimedia.org/r/454800 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:54:51] (03CR) 10Gehel: [C: 031] "Minor comment inline, otherwise LGTM" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/454559 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:58:12] (03CR) 10Gehel: cookbook: fix links to parent in interactive menu (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/454802 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:59:04] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/454803 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:00:39] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4029.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['cp4029.ulsfo.wmnet'] ``` [14:03:17] (03CR) 10Gehel: cookbook: handle SystemExit exceptions (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/454805 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:06:05] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) It appears that a sitemap for the mobile site isn't going to be he... [14:10:42] (03CR) 10Ottomata: [C: 04-1] "Naw, needs to be public. This is a read-only rsync daemon module. It uses git-fat to pull down large artifact files to local working cop" [puppet] - 10https://gerrit.wikimedia.org/r/454770 (owner: 10Elukey) [14:11:53] vgutierrez: --^ [14:11:59] (03Abandoned) 10Elukey: profile::archiva: limit rsync access to DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/454770 (owner: 10Elukey) [14:14:10] (03PS3) 10Gehel: Remove support for elasticsearch 2.x [puppet] - 10https://gerrit.wikimedia.org/r/447564 (owner: 10EBernhardson) [14:15:11] 10Operations, 10ops-esams, 10Traffic: cp3036 PS Redundancy Lost - https://phabricator.wikimedia.org/T202627 (10ema) [14:15:14] 10Operations, 10ops-esams, 10Traffic: cp3036 PS Redundancy Lost - https://phabricator.wikimedia.org/T202627 (10ema) p:05Triage>03Normal [14:15:37] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3036 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ema https://phabricator.wikimedia.org/T202627 [14:16:28] (03PS7) 10Giuseppe Lavagetto: mediawiki: move php to a profile, use the php class [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) [14:20:03] (03PS1) 10Jcrespo: mariadb: depool db1113 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454823 [14:20:05] (03CR) 10Gehel: "confirmed this is a noop: https://puppet-compiler.wmflabs.org/compiler02/12195/" [puppet] - 10https://gerrit.wikimedia.org/r/447564 (owner: 10EBernhardson) [14:20:08] (03CR) 10Gehel: [C: 032] Remove support for elasticsearch 2.x [puppet] - 10https://gerrit.wikimedia.org/r/447564 (owner: 10EBernhardson) [14:21:02] !log otto@deploy1001 Started deploy [analytics/turnilo/deploy@240b357]: Deploying 1.7.2 to analytics-tool1002 [14:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:41] (03CR) 10Muehlenhoff: "Let's add a comment to the ferm::service, though. This isn't really obvious otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/454770 (owner: 10Elukey) [14:26:09] !log otto@deploy1001 Finished deploy [analytics/turnilo/deploy@240b357]: Deploying 1.7.2 to analytics-tool1002 (duration: 05m 07s) [14:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:36] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) Outside of the scope of this ticket, but I wanted to note it. Thi... [14:27:44] (03PS1) 10Elukey: profile::archiva: add comment about rsync's firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/454824 [14:28:07] (03PS2) 10Elukey: profile::archiva: add comment about rsync's firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/454824 [14:28:09] moritzm: --^ done [14:29:26] (03CR) 10Elukey: [C: 032] profile::archiva: add comment about rsync's firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/454824 (owner: 10Elukey) [14:29:50] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/454824 (owner: 10Elukey) [14:31:58] (03PS1) 10Andrew Bogott: region-migrate: update for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/454825 (https://phabricator.wikimedia.org/T191790) [14:33:10] (03CR) 10Jcrespo: [C: 032] mariadb: depool db1113 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454823 (owner: 10Jcrespo) [14:34:26] (03Merged) 10jenkins-bot: mariadb: depool db1113 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454823 (owner: 10Jcrespo) [14:35:04] (03CR) 10jenkins-bot: mariadb: depool db1113 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454823 (owner: 10Jcrespo) [14:36:09] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1113 (duration: 00m 57s) [14:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:09] (03PS8) 10Giuseppe Lavagetto: mediawiki: move php to a profile, use the php class [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) [14:37:42] !log otto@deploy1001 Started deploy [analytics/turnilo/deploy@50a8845]: Deploying 1.7.2 to analytics-tool1002 with dep fixes [14:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:50] !log otto@deploy1001 Finished deploy [analytics/turnilo/deploy@50a8845]: Deploying 1.7.2 to analytics-tool1002 with dep fixes (duration: 00m 08s) [14:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:03] (03CR) 10Andrew Bogott: [C: 032] region-migrate: update for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/454825 (https://phabricator.wikimedia.org/T191790) (owner: 10Andrew Bogott) [14:42:15] 10Operations, 10Analytics, 10Analytics-EventLogging, 10EventBus, and 2 others: RFC: Modern Event Platform - Choose Schema Tech - https://phabricator.wikimedia.org/T198256 (10kchapman) Since there was broad agreement at the RFC meeting and hasn't been any objection raised since, TechCom has approved this. [14:42:33] 10Operations, 10Analytics, 10Analytics-EventLogging, 10EventBus, and 2 others: RFC: Modern Event Platform - Choose Schema Tech - https://phabricator.wikimedia.org/T198256 (10kchapman) [14:45:30] 10Operations, 10ops-codfw: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T201757 (10Marostegui) 05declined>03Open a:03Papaul I have been talking to @Papaul and we can re-use db2064's BBU (T195228) to replace db2033's (T184888) Given the fact that 1) This host is really scheduled for decomm... [14:47:24] !log upgrading db1113 [14:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:37] (03PS9) 10Gehel: Make elasticsearch http and transport ports explicit [puppet] - 10https://gerrit.wikimedia.org/r/447568 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [14:49:15] (03PS1) 10Andrew Bogott: region-migrate: fix network_name for eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/454828 (https://phabricator.wikimedia.org/T191790) [14:50:54] (03PS1) 10Jcrespo: Revert "mariadb: depool db1113 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454829 [14:51:42] (03PS1) 10Bstorm: nova: restrict worker numbers further [puppet] - 10https://gerrit.wikimedia.org/r/454830 (https://phabricator.wikimedia.org/T188589) [14:52:17] (03CR) 10Andrew Bogott: [C: 032] region-migrate: fix network_name for eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/454828 (https://phabricator.wikimedia.org/T191790) (owner: 10Andrew Bogott) [14:52:26] 10Operations, 10ops-codfw: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T201757 (10Marostegui) Interesting, the BBU now doesn't look broken anymore. Maybe another case of BBUs recovering after a reboot: ``` root@db2033:~# hpssacli controller all show status Smart Array P420i in Slot 0 (Embedde... [14:53:15] (03CR) 10Gehel: [C: 032] "ppc agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler02/12197/" [puppet] - 10https://gerrit.wikimedia.org/r/447568 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [14:53:26] (03PS10) 10Gehel: Make elasticsearch http and transport ports explicit [puppet] - 10https://gerrit.wikimedia.org/r/447568 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [14:55:39] (03PS37) 10Gehel: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [14:56:22] (03CR) 10jerkins-bot: [V: 04-1] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [14:56:43] (03CR) 10Vogone: [C: 04-1] "Pending community discussion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454762 (https://phabricator.wikimedia.org/T202597) (owner: 10Vogone) [14:57:14] ah there he is, welcome onimisionipe! I didn't see you come in [14:57:51] 10Operations, 10ops-codfw: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T201757 (10Papaul) a:05Papaul>03Marostegui complete [14:58:25] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: depool db1113 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454829 (owner: 10Jcrespo) [14:58:27] (03Abandoned) 10Bstorm: nova: restrict worker numbers further [puppet] - 10https://gerrit.wikimedia.org/r/454830 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [14:58:38] 10Operations, 10ops-codfw: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T201757 (10Marostegui) Thanks! ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 1% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding) ``` [14:59:11] (03PS1) 10Ottomata: Create auth ldap proxy for turnilo on analytics-tool1002 [puppet] - 10https://gerrit.wikimedia.org/r/454831 (https://phabricator.wikimedia.org/T202011) [14:59:42] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1113 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454829 (owner: 10Jcrespo) [15:00:01] (03CR) 10jerkins-bot: [V: 04-1] Create auth ldap proxy for turnilo on analytics-tool1002 [puppet] - 10https://gerrit.wikimedia.org/r/454831 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [15:01:39] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454832 [15:02:02] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1113 (duration: 00m 54s) [15:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:55] PROBLEM - ElasticSearch health check for shards on logstash1004 is CRITICAL: (null) [15:03:24] PROBLEM - ElasticSearch health check for shards on logstash1006 is CRITICAL: (null) [15:03:25] PROBLEM - ElasticSearch health check for shards on logstash1008 is CRITICAL: (null) [15:03:29] godog: ^ [15:03:35] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: (null) [15:03:35] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: (null) [15:03:35] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) >>! In T199125#4526460, @MoritzMuehlenhoff wrote: >>>! In T199125#4524995, @RobH wrote: >> Same thing in stretch. This is o... [15:04:04] Oh... that's probably my last patch, checking [15:04:49] gehel: thanks! I'm in a meeting but LMK if it is actually logstash [15:05:17] most probably just a change to the check itself, I'm on it [15:05:49] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454832 (owner: 10Marostegui) [15:07:04] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454832 (owner: 10Marostegui) [15:08:10] yep, shards are all ok, problem is with the check itself. My last merge was a noop according to puppet compiler, but I have probably missed something [15:08:12] (03PS2) 10Ottomata: Create auth ldap proxy for turnilo on analytics-tool1002 [puppet] - 10https://gerrit.wikimedia.org/r/454831 (https://phabricator.wikimedia.org/T202011) [15:08:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103:3312 (duration: 00m 55s) [15:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:31] Ofc, no changes on the elastic nodes, but there was a change on the icinga side that I did not check, my bad [15:08:48] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454833 [15:08:59] (03CR) 10jerkins-bot: [V: 04-1] Create auth ldap proxy for turnilo on analytics-tool1002 [puppet] - 10https://gerrit.wikimedia.org/r/454831 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [15:12:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454833 (owner: 10Marostegui) [15:12:37] ACKNOWLEDGEMENT - HP RAID on db2033 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T202635 [15:12:42] 10Operations, 10ops-codfw: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T202635 (10ops-monitoring-bot) [15:13:10] 10Operations, 10ops-codfw: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T202635 (10Marostegui) [15:13:12] 10Operations, 10ops-codfw: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T201757 (10Marostegui) [15:13:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454833 (owner: 10Marostegui) [15:14:23] !log upgrading db2089 [15:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3312 (duration: 00m 54s) [15:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:01] !log Deploy schema change on db1090:3312 [15:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:47] godog: shard check is coming back together. The specific port to check was added and there was transient incoherence before einsteinium was updated. All good now. [15:19:21] (03PS3) 10Ottomata: Create auth ldap proxy for turnilo on analytics-tool1002 [puppet] - 10https://gerrit.wikimedia.org/r/454831 (https://phabricator.wikimedia.org/T202011) [15:20:00] gehel: sweet! thanks for the update [15:22:05] PROBLEM - ElasticSearch health check for shards on logstash1009 is CRITICAL: (null) [15:22:23] (03PS4) 10Ottomata: Create auth ldap proxy for turnilo on analytics-tool1002 [puppet] - 10https://gerrit.wikimedia.org/r/454831 (https://phabricator.wikimedia.org/T202011) [15:24:27] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/12200/analytics-tool1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/454831 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [15:24:41] elukey: when you have a min ^ [15:29:04] (03CR) 10Elukey: [C: 031] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/454831 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [15:31:38] (03PS2) 10RobH: adding user Tim WMDE [puppet] - 10https://gerrit.wikimedia.org/r/454150 (https://phabricator.wikimedia.org/T202063) [15:32:22] (03CR) 10RobH: [C: 032] adding user Tim WMDE [puppet] - 10https://gerrit.wikimedia.org/r/454150 (https://phabricator.wikimedia.org/T202063) (owner: 10RobH) [15:32:24] (03CR) 10Ottomata: [C: 032] Create auth ldap proxy for turnilo on analytics-tool1002 [puppet] - 10https://gerrit.wikimedia.org/r/454831 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [15:33:45] 10Operations, 10ops-codfw, 10Services (watching): restbase2003 has a broken disk (at least) - https://phabricator.wikimedia.org/T201804 (10Eevans) From the SRE/CP Scalability meeting: We should attempt to update the RAID firmware, and ensure that the host is capable of rebooting successfully without interve... [15:34:11] (03PS5) 10Ottomata: Create auth ldap proxy for turnilo on analytics-tool1002 [puppet] - 10https://gerrit.wikimedia.org/r/454831 (https://phabricator.wikimedia.org/T202011) [15:34:17] (03CR) 10Ottomata: [V: 032 C: 032] Create auth ldap proxy for turnilo on analytics-tool1002 [puppet] - 10https://gerrit.wikimedia.org/r/454831 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [15:38:28] PROBLEM - Check systemd state on dbstore1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:39:24] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10Bstorm) Looking closer at the nova connections right now, they are all for two older servers. That said, we have 8 workers and one master... [15:40:07] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:40:57] (03PS2) 10RobH: adding tieu to groups [puppet] - 10https://gerrit.wikimedia.org/r/454153 (https://phabricator.wikimedia.org/T202063) [15:41:19] (03CR) 10RobH: [C: 032] adding tieu to groups [puppet] - 10https://gerrit.wikimedia.org/r/454153 (https://phabricator.wikimedia.org/T202063) (owner: 10RobH) [15:42:48] (03CR) 10jenkins-bot: Revert "mariadb: depool db1113 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454829 (owner: 10Jcrespo) [15:42:50] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454832 (owner: 10Marostegui) [15:42:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454833 (owner: 10Marostegui) [15:45:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to view EventLogging data for Tim WMDE - https://phabricator.wikimedia.org/T202063 (10RobH) [15:45:59] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to view EventLogging data for Tim WMDE - https://phabricator.wikimedia.org/T202063 (10RobH) 05Open>03Resolved a:03RobH No objections noted, and access has been merged live. Please note it can take up to 30 m... [15:47:19] (03PS1) 10Bstorm: nova: reduce the pool size for database connections a lot [puppet] - 10https://gerrit.wikimedia.org/r/454843 (https://phabricator.wikimedia.org/T188589) [15:49:28] (03PS2) 10RobH: adding user tonia to admin module [puppet] - 10https://gerrit.wikimedia.org/r/454327 (https://phabricator.wikimedia.org/T202069) [15:51:04] (03CR) 10RobH: [C: 032] adding user tonia to admin module [puppet] - 10https://gerrit.wikimedia.org/r/454327 (https://phabricator.wikimedia.org/T202069) (owner: 10RobH) [15:52:27] (03PS2) 10RobH: adding tonina to groups in admin module [puppet] - 10https://gerrit.wikimedia.org/r/454328 (https://phabricator.wikimedia.org/T202069) [15:52:52] granting access to all the things [15:53:21] (03CR) 10RobH: [C: 032] adding tonina to groups in admin module [puppet] - 10https://gerrit.wikimedia.org/r/454328 (https://phabricator.wikimedia.org/T202069) (owner: 10RobH) [15:53:58] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tonina WMDE - https://phabricator.wikimedia.org/T202069 (10RobH) [15:54:24] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tonina WMDE - https://phabricator.wikimedia.org/T202069 (10RobH) 05Open>03Resolved a:03RobH No objections have been noted and all other criteria met, so this is now merged live. Please note it can tak... [15:56:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Samuel Guebo - https://phabricator.wikimedia.org/T202362 (10RobH) [15:59:45] goddamn it [15:59:49] i got to update the L3 and get 'Only admins may require signature.' [15:59:53] when editing the body... [16:00:05] godog, moritzm, and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:11] (03PS1) 10Vgutierrez: [WIP] Implement DNS01 challenge support [software/certcentral] - 10https://gerrit.wikimedia.org/r/454845 [16:00:15] i wonder if there is any drawback to tossing my user in the admin grouip to avoid this crap [16:00:35] hrmm, says my tam can edit but [16:01:02] 10Operations, 10ops-codfw: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T201757 (10jcrespo) Please upgrade kernel and mariadb server version on reboot! Thanks. [16:01:16] (03CR) 10Arturo Borrero Gonzalez: [C: 031] nova: reduce the pool size for database connections a lot [puppet] - 10https://gerrit.wikimedia.org/r/454843 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [16:01:23] (03PS2) 10Arturo Borrero Gonzalez: nova: reduce the pool size for database connections a lot [puppet] - 10https://gerrit.wikimedia.org/r/454843 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [16:01:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Implement DNS01 challenge support [software/certcentral] - 10https://gerrit.wikimedia.org/r/454845 (owner: 10Vgutierrez) [16:09:08] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:11:28] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:12:28] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:13:08] (03CR) 10Bstorm: [C: 032] nova: reduce the pool size for database connections a lot [puppet] - 10https://gerrit.wikimedia.org/r/454843 (https://phabricator.wikimedia.org/T188589) (owner: 10Bstorm) [16:13:29] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10aborrero) >>! In T188589#4527126, @Bstorm wrote: > Looking closer at the nova connections right now, they are all for two older servers. T... [16:13:38] 10Operations, 10Legalpad: Update terms "Labs" and "Operations" in L3 - https://phabricator.wikimedia.org/T202617 (10RobH) 05Open>03Resolved a:03RobH All done, added irc options as well as clarified that real time is preferred over email. Added in the security@ alias in case IRC is not an option (as it h... [16:13:52] (03PS1) 10Bstorm: Revert "nova: reduce the pool size for database connections a lot" [puppet] - 10https://gerrit.wikimedia.org/r/454850 [16:15:03] !log upgrade hp raid firmware on restbase2003 - T201804 T141756 [16:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:10] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [16:15:10] T201804: restbase2003 has a broken disk (at least) - https://phabricator.wikimedia.org/T201804 [16:17:09] !log deactivating IX/Transit links on cr1-eqdfw - T196941 [16:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:14] T196941: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 [16:18:20] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10RobH) Ok, I've gone ahead and updated the L3 per the instructions on T202617. So that should eliminate the earlier... [16:18:47] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:20:07] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:20:44] !log filippo@neodymium conftool action : set/pooled=no; selector: name=restbase2003.codfw.wmnet [16:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:00] !log depool and cassandra-drain restbase2003, then reboot [16:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:10] (03PS1) 10Bmansurov: Enable logging for Schema:CitationUsage at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454854 (https://phabricator.wikimedia.org/T191086) [16:24:08] PROBLEM - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.134 and port 9042: Connection refused [16:24:17] PROBLEM - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:25:57] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 44 probes of 320 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:26:18] PROBLEM - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.135 and port 9042: Connection refused [16:26:26] 10Operations, 10ops-eqdfw, 10netops: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) [16:26:57] PROBLEM - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:28:57] PROBLEM - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.136 and port 9042: Connection refused [16:29:37] PROBLEM - cassandra-c SSL 10.192.32.136:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:30:14] oops, that's me [16:30:16] downtiming [16:30:19] (03PS1) 10Filippo Giunchedi: graphite: double quotes for carbonate.conf content [puppet] - 10https://gerrit.wikimedia.org/r/454855 [16:30:58] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 13 probes of 320 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:33:14] (03CR) 10Filippo Giunchedi: [C: 032] graphite: double quotes for carbonate.conf content [puppet] - 10https://gerrit.wikimedia.org/r/454855 (owner: 10Filippo Giunchedi) [16:39:14] (03PS4) 10Dzahn: quarry::database: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/454711 [16:41:35] (03CR) 10Zhuyifei1999: [C: 031] quarry::web: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/454715 (owner: 10Dzahn) [16:41:39] (03CR) 10Zhuyifei1999: [C: 031] quarry::database: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/454711 (owner: 10Dzahn) [16:46:47] RECOVERY - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-b valid until 2020-06-24 13:01:33 +0000 (expires in 670 days) [16:46:48] RECOVERY - cassandra-c SSL 10.192.32.136:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-c valid until 2020-06-24 13:01:33 +0000 (expires in 670 days) [16:46:48] RECOVERY - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-a valid until 2020-06-24 13:01:32 +0000 (expires in 670 days) [16:46:58] RECOVERY - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.135 port 9042 [16:46:58] RECOVERY - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is OK: TCP OK - 1.071 second response time on 10.192.32.134 port 9042 [16:47:27] RECOVERY - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is OK: TCP OK - 0.038 second response time on 10.192.32.136 port 9042 [16:49:04] !log filippo@neodymium conftool action : set/pooled=yes; selector: name=restbase2003.codfw.wmnet [16:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:39] 10Operations, 10Maps, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Jhernandez) [16:50:36] 10Operations, 10ops-codfw, 10Services (watching): restbase2003 has a broken disk (at least) - https://phabricator.wikimedia.org/T201804 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Firmware upgrade and reboot done, the reboot completed unattended. I'm tentatively resolving, we'll reopen if it occurs... [16:53:01] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [16:53:28] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [16:54:09] 10Operations, 10Maps, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Jhernandez) [16:56:12] (03CR) 10Dzahn: [C: 032] quarry::database: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/454711 (owner: 10Dzahn) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T1700). [17:00:37] (03CR) 10Filippo Giunchedi: [C: 031] mariadb backups: Capture connection error exceptions [puppet] - 10https://gerrit.wikimedia.org/r/454509 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [17:01:52] !log powering off cr1-eqdfw - T196941 [17:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:57] T196941: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 [17:02:37] 10Operations, 10ops-eqdfw, 10netops: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10Papaul) [17:06:46] (03PS4) 10Dzahn: quarry::web: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/454715 [17:08:51] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10RobH) [17:10:19] the moon is old hat. mars is where it's at any more [17:14:23] (03CR) 10Muehlenhoff: "That seems fine approach-wise, but we should also verify the status for WMCS and analytics mysqls, adding some additional for additional c" [puppet] - 10https://gerrit.wikimedia.org/r/454291 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [17:15:27] 10Operations, 10ops-eqdfw, 10netops: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) [17:16:07] (03CR) 10Dzahn: [C: 032] quarry::web: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/454715 (owner: 10Dzahn) [17:16:41] !log enable v6 neighbors on cr2-eqdfw - T196941 [17:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:47] T196941: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 [17:25:37] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10RobH) a:05RobH>03Cmjohnson Ok, I've synced up with @Cmjohnson and have the next steps to bring these online: [... [17:28:22] !log enable v4 neighbors on cr2-eqdfw - T196941 [17:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:28] T196941: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 [17:30:04] (03PS1) 10Dzahn: quarry::web: replace class declaration with require for base [puppet] - 10https://gerrit.wikimedia.org/r/454862 [17:30:44] (03CR) 10jerkins-bot: [V: 04-1] quarry::web: replace class declaration with require for base [puppet] - 10https://gerrit.wikimedia.org/r/454862 (owner: 10Dzahn) [17:30:53] (03PS2) 10Dzahn: quarry::web: replace class declaration with require for base [puppet] - 10https://gerrit.wikimedia.org/r/454862 [17:31:28] (03CR) 10jerkins-bot: [V: 04-1] quarry::web: replace class declaration with require for base [puppet] - 10https://gerrit.wikimedia.org/r/454862 (owner: 10Dzahn) [17:31:47] 10Operations, 10SRE-Access-Requests: Please add everyone on the performance team to perf-roots - https://phabricator.wikimedia.org/T202648 (10Imarlier) [17:32:18] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: Job definition for trafficserver_exporter [puppet] - 10https://gerrit.wikimedia.org/r/454784 (https://phabricator.wikimedia.org/T202381) (owner: 10Ema) [17:33:02] 10Operations, 10SRE-Access-Requests: Please add aaron to perf-team - https://phabricator.wikimedia.org/T202650 (10Imarlier) [17:37:17] (03CR) 10Legoktm: "Thanks for taking care of this :)" [puppet] - 10https://gerrit.wikimedia.org/r/454701 (https://phabricator.wikimedia.org/T202473) (owner: 10Dzahn) [17:39:41] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give WMDE-Fisch permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202475 (10Legoktm) >>! In T202475#4525265, @RobH wrote: > Once we have that info/NDA on file for you, we should be able to mo... [17:39:52] (03CR) 10Dzahn: "gotta override jenkins vote on this one because it's a chicken-egg issue. will follow-up shortly fixing that issue in another change" [puppet] - 10https://gerrit.wikimedia.org/r/454862 (owner: 10Dzahn) [17:40:08] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0 [17:40:09] (03CR) 10Dzahn: [V: 032 C: 032] quarry::web: replace class declaration with require for base [puppet] - 10https://gerrit.wikimedia.org/r/454862 (owner: 10Dzahn) [17:41:00] what is that now [17:42:27] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 [17:42:44] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10PEarleyWMF) Hey all, Thanks for moving forward on this request. To address the concerns above from @RobH, this is me posting... [17:43:22] 10Operations, 10ops-eqdfw, 10netops: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) [17:43:23] good [17:47:53] (03PS1) 10Ottomata: Include superset on analytics-tool1003 [puppet] - 10https://gerrit.wikimedia.org/r/454863 (https://phabricator.wikimedia.org/T202011) [17:50:15] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/12202/analytics-tool1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/454863 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [17:50:17] (03CR) 10Ottomata: [C: 032] Include superset on analytics-tool1003 [puppet] - 10https://gerrit.wikimedia.org/r/454863 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [17:52:19] (03PS1) 10Ottomata: Include base firewall and standard in role::analytics_cluster::superset [puppet] - 10https://gerrit.wikimedia.org/r/454864 [17:53:08] (03CR) 10Ottomata: [C: 032] Include base firewall and standard in role::analytics_cluster::superset [puppet] - 10https://gerrit.wikimedia.org/r/454864 (owner: 10Ottomata) [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:06] (03PS1) 10Ayounsi: Puppet: rename cr1-eqdfw to cr2-eqdfw [puppet] - 10https://gerrit.wikimedia.org/r/454865 (https://phabricator.wikimedia.org/T196941) [18:01:15] PROBLEM - puppet last run on analytics-tool1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[analytics/superset/deploy] [18:04:32] (03PS1) 10Ayounsi: DNS: rename cr1-eqdfw to cr2-eqdfw [dns] - 10https://gerrit.wikimedia.org/r/454867 (https://phabricator.wikimedia.org/T196941) [18:04:37] (03CR) 10Ayounsi: [C: 032] Puppet: rename cr1-eqdfw to cr2-eqdfw [puppet] - 10https://gerrit.wikimedia.org/r/454865 (https://phabricator.wikimedia.org/T196941) (owner: 10Ayounsi) [18:05:01] (03CR) 10Ayounsi: [C: 032] DNS: rename cr1-eqdfw to cr2-eqdfw [dns] - 10https://gerrit.wikimedia.org/r/454867 (https://phabricator.wikimedia.org/T196941) (owner: 10Ayounsi) [18:05:55] PROBLEM - superset on analytics-tool1003 is CRITICAL: connect to address 10.64.36.112 and port 9080: Connection refused [18:06:15] RECOVERY - Filesystem available is greater than filesystem size on ms-be2043 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [18:08:52] !log otto@deploy1001 Started deploy [analytics/superset/deploy@de75f23]: 0.26.3 to analytics-tool1003 [18:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:27] !log otto@deploy1001 Finished deploy [analytics/superset/deploy@de75f23]: 0.26.3 to analytics-tool1003 (duration: 00m 36s) [18:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:00] RECOVERY - superset on analytics-tool1003 is OK: TCP OK - 0.001 second response time on 10.64.36.112 port 9080 [18:13:54] (03PS1) 10Ottomata: Install libmariadbclient18 for mysql databse for superset in stretch [puppet] - 10https://gerrit.wikimedia.org/r/454869 (https://phabricator.wikimedia.org/T202011) [18:14:52] (03PS2) 10Ottomata: Install libmariadbclient18 for mysql databse for superset in stretch [puppet] - 10https://gerrit.wikimedia.org/r/454869 (https://phabricator.wikimedia.org/T202011) [18:14:57] (03CR) 10Ottomata: [V: 032 C: 032] Install libmariadbclient18 for mysql databse for superset in stretch [puppet] - 10https://gerrit.wikimedia.org/r/454869 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [18:15:18] !log otto@deploy1001 Started deploy [analytics/superset/deploy@de75f23]: 0.26.3 to analytics-tool1003 with requirements setup [18:15:21] !log otto@deploy1001 Finished deploy [analytics/superset/deploy@de75f23]: 0.26.3 to analytics-tool1003 with requirements setup (duration: 00m 03s) [18:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:10] RECOVERY - puppet last run on analytics-tool1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:16:38] (03PS1) 10Dzahn: quarry::base: convert to a profile [puppet] - 10https://gerrit.wikimedia.org/r/454870 [18:17:24] (03CR) 10jerkins-bot: [V: 04-1] quarry::base: convert to a profile [puppet] - 10https://gerrit.wikimedia.org/r/454870 (owner: 10Dzahn) [18:17:25] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give WMDE-Fisch permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202475 (10RobH) >>! In T202475#4527572, @Legoktm wrote: >>>! In T202475#4525265, @RobH wrote: >> Once we have that info/NDA o... [18:19:07] (03PS1) 10Filippo Giunchedi: Shift carbon/statsd write traffic to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/454872 (https://phabricator.wikimedia.org/T196484) [18:19:16] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10User-Addshore: Give WMDE-Fisch permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202475 (10RobH) [18:23:14] (03PS1) 10RobH: adding Christoph Fischer to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/454873 (https://phabricator.wikimedia.org/T202475) [18:24:04] (03PS1) 10Filippo Giunchedi: diamond: send metrics to graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/454874 (https://phabricator.wikimedia.org/T196484) [18:24:06] (03PS1) 10Filippo Giunchedi: graphite: add graphite1004 to cluster_servers [puppet] - 10https://gerrit.wikimedia.org/r/454875 (https://phabricator.wikimedia.org/T196484) [18:24:10] (03PS1) 10Filippo Giunchedi: varnish: move to graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/454876 (https://phabricator.wikimedia.org/T196484) [18:24:12] (03PS1) 10Filippo Giunchedi: graphite: move alerting to graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/454877 (https://phabricator.wikimedia.org/T196484) [18:24:14] (03PS1) 10Filippo Giunchedi: calico: allow statsd traffic to graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/454878 (https://phabricator.wikimedia.org/T196484) [18:24:47] (03PS1) 10RobH: adding Christoph Jauera to releasers-wikidiff2 [puppet] - 10https://gerrit.wikimedia.org/r/454879 (https://phabricator.wikimedia.org/T202475) [18:25:10] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10Patch-For-Review, 10User-Addshore: Give WMDE-Fisch permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202475 (10RobH) [18:25:25] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10Patch-For-Review, 10User-Addshore: Give WMDE-Fisch permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202475 (10RobH) a:05WMDE-Fisch>03None [18:26:31] !log ppchelko@deploy1001 Started deploy [restbase/deploy@b2b61e8]: Resurrect MCS on wikivoyage [18:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:30] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10RobH) a:05PEarleyWMF>03RobH Ok, everything looks good on this. I'll go ahead and get the patchsets going. [18:27:59] the past 2 weeks have had quite a number of shell requests, heh. [18:28:22] !log Deployed patch for T151910 [18:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:38] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.18/extensions/FlaggedRevs/backend/FlaggedRevs.class.php: [[gerrit:454868|Fix incorrect return value in CurrentRevisionCallback]] T202580 (duration: 00m 54s) [18:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:43] T202580: BadMethodCallException from line 3671 of /srv/mediawiki/php-1.32.0-wmf.18/includes/parser/Parser.php: Call to a member function getId() on a non-object (boolean) - https://phabricator.wikimedia.org/T202580 [18:30:04] !log remove MAC filter workaround on cr2-eqdfw - T196941 [18:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:10] T196941: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 [18:30:25] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@b2b61e8]: Resurrect MCS on wikivoyage (duration: 03m 54s) [18:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:38] !log ppchelko@deploy1001 Started deploy [restbase/deploy@b2b61e8]: Resurrect MCS on wikivoyage, take 2, onthisday is failing [18:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:51] 10Operations, 10ops-eqdfw, 10netops, 10Patch-For-Review: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) [18:32:14] RoanKattouw: could you ping me when you're done with updates and I'll roll-forward train to try to get caught up? [18:32:20] !log Deployed patch for T199993 [18:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:24] thcipriani: Sorry done now [18:32:37] no worries, just didn't want to step on your toes :) [18:32:41] Nor I on yours [18:32:47] thank for the ping :) [18:33:08] !log starting train a little early to play catchup [18:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:31] !log Equinix updated their MAC filter, IX sessions up - T196941 [18:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:04] (03PS1) 10Thcipriani: group1 wikis to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454880 [18:34:06] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454880 (owner: 10Thcipriani) [18:35:30] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454880 (owner: 10Thcipriani) [18:36:13] (03PS1) 10RobH: adding Patrick Earley to shell users [puppet] - 10https://gerrit.wikimedia.org/r/454882 (https://phabricator.wikimedia.org/T201667) [18:37:00] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.18 [18:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:08] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@b2b61e8]: Resurrect MCS on wikivoyage, take 2, onthisday is failing (duration: 06m 30s) [18:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:20] !log ppchelko@deploy1001 Started deploy [restbase/deploy@b2b61e8]: Resurrect MCS on wikivoyage, take 3, onthisday is failing [18:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:55] !log thcipriani@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.18 (duration: 00m 54s) [18:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:50] (03PS1) 10RobH: adding Patrick Earley to groups [puppet] - 10https://gerrit.wikimedia.org/r/454885 (https://phabricator.wikimedia.org/T201667) [18:39:04] (03CR) 10RobH: [C: 032] adding Patrick Earley to shell users [puppet] - 10https://gerrit.wikimedia.org/r/454882 (https://phabricator.wikimedia.org/T201667) (owner: 10RobH) [18:39:39] ottomata: i have your change [18:39:41] merging on puppetmatere [18:39:44] puppetmaster even [18:39:54] ottomata: Ottomata: Install libmariadbclient18 for mysql databse for superset in stretch (872b9a0457) [18:39:57] merged =] [18:40:25] (03CR) 10RobH: [C: 032] adding Patrick Earley to groups [puppet] - 10https://gerrit.wikimedia.org/r/454885 (https://phabricator.wikimedia.org/T201667) (owner: 10RobH) [18:41:14] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@b2b61e8]: Resurrect MCS on wikivoyage, take 3, onthisday is failing (duration: 03m 53s) [18:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:20] (03PS2) 10Dzahn: quarry::base: convert to a profile [puppet] - 10https://gerrit.wikimedia.org/r/454870 [18:42:00] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10RobH) [18:43:31] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10RobH) 05stalled>03Resolved Ok, all access has been merged live (since this request is weeks old, it was well over the 3 bu... [18:44:49] !log ppchelko@deploy1001 Started deploy [restbase/deploy@b2b61e8]: Resurrect MCS on wikivoyage, take 3, onthisday is timing out. Just push this through.. [18:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:25] 10Operations, 10SRE-Access-Requests: Please add aaron to perf-team - https://phabricator.wikimedia.org/T202650 (10RobH) p:05Triage>03Normal [18:48:21] (03CR) 10Dzahn: [C: 032] "18:42:06 Resolved violations:" [puppet] - 10https://gerrit.wikimedia.org/r/454870 (owner: 10Dzahn) [18:48:35] (03PS3) 10Dzahn: quarry::base: convert to a profile [puppet] - 10https://gerrit.wikimedia.org/r/454870 [18:48:40] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454880 (owner: 10Thcipriani) [18:48:48] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@b2b61e8]: Resurrect MCS on wikivoyage, take 3, onthisday is timing out. Just push this through.. (duration: 03m 59s) [18:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:08] !log ppchelko@deploy1001 Started deploy [restbase/deploy@b2b61e8]: Resurrect MCS on wikivoyage, take 4, onthisday is timing out. Just push this through.. [18:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:07] (03PS4) 10Dzahn: quarry::base: convert to a profile [puppet] - 10https://gerrit.wikimedia.org/r/454870 [18:51:29] (03CR) 10Dzahn: [C: 032] quarry::base: convert to a profile [puppet] - 10https://gerrit.wikimedia.org/r/454870 (owner: 10Dzahn) [18:54:30] (03PS1) 10RobH: add aaron to perf-team and perf-roots groups [puppet] - 10https://gerrit.wikimedia.org/r/454887 (https://phabricator.wikimedia.org/T202650) [18:54:47] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@b2b61e8]: Resurrect MCS on wikivoyage, take 4, onthisday is timing out. Just push this through.. (duration: 04m 39s) [18:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:52] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please add aaron to perf-team - https://phabricator.wikimedia.org/T202650 (10RobH) [18:56:12] (03PS1) 10Dzahn: quarry: update includes of converted base class [puppet] - 10https://gerrit.wikimedia.org/r/454889 [18:56:42] (03CR) 10jerkins-bot: [V: 04-1] quarry: update includes of converted base class [puppet] - 10https://gerrit.wikimedia.org/r/454889 (owner: 10Dzahn) [18:57:22] (03PS2) 10Dzahn: quarry: update includes of converted base class [puppet] - 10https://gerrit.wikimedia.org/r/454889 [18:57:58] (03CR) 10jerkins-bot: [V: 04-1] quarry: update includes of converted base class [puppet] - 10https://gerrit.wikimedia.org/r/454889 (owner: 10Dzahn) [18:59:43] 10Operations, 10SRE-Access-Requests: Please add everyone on the performance team to perf-roots - https://phabricator.wikimedia.org/T202648 (10RobH) At this time kinkle and gilles are already in that group: ``` perf-roots: gid: 766 description: users who have root on memcached, varnish, application... [19:00:04] thcipriani: That opportune time is upon us again. Time for a MediaWiki train - Americas version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T1900). [19:00:48] 10Operations, 10SRE-Access-Requests: request to add imarlier to perf-roots - https://phabricator.wikimedia.org/T202657 (10RobH) p:05Triage>03Normal [19:00:49] * thcipriani feels the excitement [19:00:53] 10Operations, 10SRE-Access-Requests: request to add phendeskog to perf-roots - https://phabricator.wikimedia.org/T202658 (10RobH) p:05Triage>03Normal [19:01:13] 10Operations, 10SRE-Access-Requests: Please add everyone on the performance team to perf-roots - https://phabricator.wikimedia.org/T202648 (10Imarlier) That's great, thanks @RobH [19:01:15] 10Operations, 10SRE-Access-Requests: request to add phendeskog to perf-roots - https://phabricator.wikimedia.org/T202658 (10RobH) [19:01:43] (03PS2) 10Dzahn: quarry::redis: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/454717 [19:01:54] 10Operations, 10SRE-Access-Requests: Please add everyone on the performance team to perf-roots - https://phabricator.wikimedia.org/T202648 (10RobH) 05Open>03declined [19:03:32] (03CR) 10Dzahn: [C: 032] quarry::redis: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/454717 (owner: 10Dzahn) [19:05:33] 10Operations, 10SRE-Access-Requests: request to add phendeskog to perf-roots - https://phabricator.wikimedia.org/T202658 (10RobH) a:03Imarlier @Imarlier: Apologies, but I'm uncertain who phendeskog is, can you please provide the following: Their real name, shell name, & if they already have shell access?... [19:06:30] 10Operations, 10SRE-Access-Requests: request to add imarlier to perf-roots - https://phabricator.wikimedia.org/T202657 (10RobH) [19:06:59] (03PS3) 10Dzahn: quarry: convert remaining module to profile, update includes [puppet] - 10https://gerrit.wikimedia.org/r/454889 [19:10:33] 10Operations, 10SRE-Access-Requests: request to add phendeskog to perf-roots - https://phabricator.wikimedia.org/T202658 (10RobH) a:05Imarlier>03None [19:12:02] oh robh sorry, thanks [19:13:27] 10Operations, 10SRE-Access-Requests: request to add phendeskog to perf-roots - https://phabricator.wikimedia.org/T202658 (10RobH) [19:15:16] (03PS1) 10Ottomata: Include yarn.wikimedia.org proxy vhost on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) [19:15:59] (03CR) 10jerkins-bot: [V: 04-1] Include yarn.wikimedia.org proxy vhost on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [19:16:51] (03PS2) 10Ottomata: Include yarn.wikimedia.org proxy vhost on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) [19:16:53] (03CR) 10Dzahn: [C: 032] quarry: convert remaining module to profile, update includes [puppet] - 10https://gerrit.wikimedia.org/r/454889 (owner: 10Dzahn) [19:16:59] (03PS4) 10Dzahn: quarry: convert remaining module to profile, update includes [puppet] - 10https://gerrit.wikimedia.org/r/454889 [19:17:36] (03CR) 10jerkins-bot: [V: 04-1] Include yarn.wikimedia.org proxy vhost on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [19:20:06] no worries! [19:20:10] PROBLEM - puppet last run on wdqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:22:54] (03PS1) 10Thcipriani: all wikis to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454894 [19:22:56] (03CR) 10Thcipriani: [C: 032] all wikis to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454894 (owner: 10Thcipriani) [19:23:40] (03PS1) 10Dzahn: quarry: update include of querkiller in killer class [puppet] - 10https://gerrit.wikimedia.org/r/454896 [19:24:41] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454894 (owner: 10Thcipriani) [19:24:47] (03PS2) 10Dzahn: quarry: update include of querykiller in killer class [puppet] - 10https://gerrit.wikimedia.org/r/454896 [19:25:38] (03PS1) 10Bstorm: nova: scale back the database and worker usage for cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/454897 (https://phabricator.wikimedia.org/T202549) [19:26:07] (03PS3) 10Dzahn: quarry: update include of querykiller in killer class [puppet] - 10https://gerrit.wikimedia.org/r/454896 [19:26:11] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.18 [19:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:40] (03CR) 10Dzahn: [C: 032] quarry: update include of querykiller in killer class [puppet] - 10https://gerrit.wikimedia.org/r/454896 (owner: 10Dzahn) [19:31:02] !log ppchelko@deploy1001 Started deploy [restbase/deploy@b2b61e8]: Resurrect MCS on wikivoyage, take 5, onthisday is timing out. Just push this through.. [19:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:17] (03PS3) 10Ottomata: Include yarn.wikimedia.org proxy vhost on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) [19:34:29] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@b2b61e8]: Resurrect MCS on wikivoyage, take 5, onthisday is timing out. Just push this through.. (duration: 03m 27s) [19:34:30] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: revert all wikis to 1.32.0-wmf.18 [19:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:44] (03CR) 10jerkins-bot: [V: 04-1] Include yarn.wikimedia.org proxy vhost on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [19:36:01] (03PS4) 10Ottomata: Include yarn.wikimedia.org proxy vhost on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) [19:37:27] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454894 (owner: 10Thcipriani) [19:38:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:39:23] (03PS5) 10Ottomata: Include yarn.wikimedia.org proxy vhost on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) [19:40:20] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:40:37] (03PS1) 10Dzahn: quarry: fix changed class name in celeryrunner [puppet] - 10https://gerrit.wikimedia.org/r/454898 [19:40:46] (03PS1) 10Thcipriani: Revert "all wikis to 1.32.0-wmf.18" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454899 [19:40:59] (03PS6) 10Ottomata: Include yarn.wikimedia.org proxy vhost on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) [19:41:49] (03CR) 10Thcipriani: [C: 032] Revert "all wikis to 1.32.0-wmf.18" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454899 (owner: 10Thcipriani) [19:42:14] (03CR) 10Dzahn: [C: 032] quarry: fix changed class name in celeryrunner [puppet] - 10https://gerrit.wikimedia.org/r/454898 (owner: 10Dzahn) [19:42:43] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/12206/analytics-tool1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [19:42:49] (03CR) 10Ottomata: [C: 032] Include yarn.wikimedia.org proxy vhost on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [19:42:59] (03PS7) 10Ottomata: Include yarn.wikimedia.org proxy vhost on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) [19:43:12] (03CR) 10Ottomata: [V: 032 C: 032] Include yarn.wikimedia.org proxy vhost on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454892 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [19:43:33] (03Merged) 10jenkins-bot: Revert "all wikis to 1.32.0-wmf.18" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454899 (owner: 10Thcipriani) [19:50:30] RECOVERY - puppet last run on wdqs1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:51:17] (03PS1) 10Ottomata: Need mod_headers for yarn proxy [puppet] - 10https://gerrit.wikimedia.org/r/454901 [19:52:06] (03CR) 10Ottomata: [C: 032] Need mod_headers for yarn proxy [puppet] - 10https://gerrit.wikimedia.org/r/454901 (owner: 10Ottomata) [19:52:29] (03PS1) 10Andrew Bogott: mwopenstackclients: provide ability to enumerage all instances for all regions [puppet] - 10https://gerrit.wikimedia.org/r/454902 [19:52:52] (03PS1) 10Dzahn: quarry: fix clone_path inclusion in celeryrunner/web template [puppet] - 10https://gerrit.wikimedia.org/r/454903 [19:53:23] (03CR) 10jenkins-bot: Revert "all wikis to 1.32.0-wmf.18" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454899 (owner: 10Thcipriani) [19:54:01] (03PS2) 10Dzahn: quarry: fix clone_path inclusion in celeryrunner/web template [puppet] - 10https://gerrit.wikimedia.org/r/454903 [19:54:18] (03CR) 10Dzahn: [V: 032 C: 032] quarry: fix clone_path inclusion in celeryrunner/web template [puppet] - 10https://gerrit.wikimedia.org/r/454903 (owner: 10Dzahn) [19:55:04] (03PS2) 10Andrew Bogott: mwopenstackclients: provide ability to enumerate all instances for all regions [puppet] - 10https://gerrit.wikimedia.org/r/454902 [19:55:47] (03PS3) 10Andrew Bogott: mwopenstackclients: provide ability to enumerate all instances for all regions [puppet] - 10https://gerrit.wikimedia.org/r/454902 [19:56:24] (03CR) 10Andrew Bogott: [C: 032] mwopenstackclients: provide ability to enumerate all instances for all regions [puppet] - 10https://gerrit.wikimedia.org/r/454902 (owner: 10Andrew Bogott) [19:57:24] 10Operations, 10Growth-Team, 10Mail, 10Notifications, 10User-herron: SRE query: Is it possible to measure how many e-mails are sent to "black hole" e-mail addresses? - https://phabricator.wikimedia.org/T202329 (10herron) >>! In T202329#4525178, @Jdforrester-WMF wrote: > I don't know if the denominator (n... [20:03:40] (03PS1) 10Dzahn: quarry: add Hiera lookups for clone_path to profiles [puppet] - 10https://gerrit.wikimedia.org/r/454904 [20:04:52] (03PS2) 10Dzahn: quarry: add Hiera lookups for clone_path to profiles [puppet] - 10https://gerrit.wikimedia.org/r/454904 [20:05:05] (03CR) 10Dzahn: [C: 032] quarry: add Hiera lookups for clone_path to profiles [puppet] - 10https://gerrit.wikimedia.org/r/454904 (owner: 10Dzahn) [20:05:35] (03PS1) 10Ottomata: Rename profile::hadoop::sites::yarn to profile::hadoop::yarn_proxy [puppet] - 10https://gerrit.wikimedia.org/r/454905 (https://phabricator.wikimedia.org/T202011) [20:06:11] (03PS2) 10Ottomata: Rename profile::hadoop::sites::yarn to profile::hadoop::yarn_proxy [puppet] - 10https://gerrit.wikimedia.org/r/454905 (https://phabricator.wikimedia.org/T202011) [20:06:28] (03CR) 10Dzahn: [C: 032] "fixing https://www.irccloud.com/pastebin/TPu1pLlE/" [puppet] - 10https://gerrit.wikimedia.org/r/454904 (owner: 10Dzahn) [20:07:23] (03CR) 10Ottomata: [C: 032] "no op https://puppet-compiler.wmflabs.org/compiler02/12207/analytics-tool1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/454905 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [20:07:33] (03PS3) 10Ottomata: Rename profile::hadoop::sites::yarn to profile::hadoop::yarn_proxy [puppet] - 10https://gerrit.wikimedia.org/r/454905 (https://phabricator.wikimedia.org/T202011) [20:07:35] (03CR) 10Ottomata: [V: 032 C: 032] Rename profile::hadoop::sites::yarn to profile::hadoop::yarn_proxy [puppet] - 10https://gerrit.wikimedia.org/r/454905 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [20:12:01] Is 1.32.0-wmf.18 blocked? If so, could 7857e63925a0 (Score extension) be added as an evening SWAT? [20:13:15] (03PS2) 10Dzahn: postgresql::slave::monitoring: make check description configurable [puppet] - 10https://gerrit.wikimedia.org/r/454730 (https://phabricator.wikimedia.org/T185504) [20:14:37] (03CR) 10Dzahn: [C: 032] postgresql::slave::monitoring: make check description configurable [puppet] - 10https://gerrit.wikimedia.org/r/454730 (https://phabricator.wikimedia.org/T185504) (owner: 10Dzahn) [20:15:45] Ebe123: wmf.18 is currently blocked, see subtasks of the train task for blockers: https://phabricator.wikimedia.org/T191064 to add a patch for evening SWAT please add it to the schedule: https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_August_23 [20:16:37] (03PS1) 10Ottomata: Include hue on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454909 (https://phabricator.wikimedia.org/T202011) [20:17:22] (03CR) 10jerkins-bot: [V: 04-1] Include hue on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454909 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [20:18:41] (03PS2) 10Ottomata: Include hue on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454909 (https://phabricator.wikimedia.org/T202011) [20:20:18] thcipriani: done. Should have done it a while back :) [20:21:37] (03PS1) 10Ottomata: Don't include turnilo proxy on thorium [puppet] - 10https://gerrit.wikimedia.org/r/454910 (https://phabricator.wikimedia.org/T202011) [20:21:55] (03CR) 10Ottomata: [V: 032 C: 032] Don't include turnilo proxy on thorium [puppet] - 10https://gerrit.wikimedia.org/r/454910 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [20:24:42] (03PS3) 10Ottomata: Include hue on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454909 (https://phabricator.wikimedia.org/T202011) [20:25:23] (03CR) 10jerkins-bot: [V: 04-1] Include hue on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454909 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [20:28:07] (03PS4) 10Ottomata: Include hue on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454909 (https://phabricator.wikimedia.org/T202011) [20:28:19] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:28:23] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/12210/analytics-tool1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/454909 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [20:29:22] (03PS5) 10Ottomata: Include hue on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454909 (https://phabricator.wikimedia.org/T202011) [20:29:52] (03CR) 10Ottomata: [V: 032 C: 032] Include hue on analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/454909 (https://phabricator.wikimedia.org/T202011) (owner: 10Ottomata) [20:35:30] PROBLEM - puppet last run on analytics-tool1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[hue] [20:36:24] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2018), 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Johan) @mark @akosiaris So what's going out to Monday is in https://meta.wikimedia.org/wiki/Tech/News/2018/35 which links to... [20:36:50] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504 (10Dzahn) netbox related checks are now grouped together at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=netbox [20:49:07] (03PS2) 10Krinkle: mtail: Escape the '.' in /w/load.php for varnishrls.mtail [puppet] - 10https://gerrit.wikimedia.org/r/454724 (https://phabricator.wikimedia.org/T202479) [20:49:17] (03CR) 10Krinkle: "OK. I think that does the trick. – https://regexr.com/3ud1o" [puppet] - 10https://gerrit.wikimedia.org/r/454724 (https://phabricator.wikimedia.org/T202479) (owner: 10Krinkle) [20:49:45] (03CR) 10jerkins-bot: [V: 04-1] mtail: Escape the '.' in /w/load.php for varnishrls.mtail [puppet] - 10https://gerrit.wikimedia.org/r/454724 (https://phabricator.wikimedia.org/T202479) (owner: 10Krinkle) [20:51:32] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move internal sites hosted on thorium to ganeti instance(s) - https://phabricator.wikimedia.org/T202011 (10Ottomata) Oo, just ran into a problem with Hue that I need a little help with from some debian packaging pros. Let's try @MoritzMueh... [20:52:01] (03CR) 10Krinkle: [C: 031] "CC-ed Tyler and Dan for input." [puppet] - 10https://gerrit.wikimedia.org/r/453554 (owner: 10Dzahn) [20:54:05] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move internal sites hosted on thorium to ganeti instance(s) - https://phabricator.wikimedia.org/T202011 (10Ottomata) @Dzahn suggests: > i think.. the proper way would be to unpack the hue-common packags, find the "Depends" line in the cont... [20:57:45] (03PS3) 10Krinkle: mtail: Escape the '.' in /w/load.php for varnishrls.mtail [puppet] - 10https://gerrit.wikimedia.org/r/454724 (https://phabricator.wikimedia.org/T202479) [20:58:26] (03CR) 10Krinkle: "Meh, mtail's parser is different. Now testing with golang's – https://regex-golang.appspot.com/assets/html/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/454724 (https://phabricator.wikimedia.org/T202479) (owner: 10Krinkle) [20:58:55] Krinkle: sorry about the mtail tests error "messages" when sth goes wrong :( not very helpful stacktrace [20:59:14] godog: He, yeah. I assumed as much [21:05:31] *nod* [21:06:04] just the test name but not much [21:14:43] PROBLEM - Hue Server on analytics-tool1001 is CRITICAL: NRPE: Command check_hue not defined [21:21:12] oh ^ i gotta ack [21:21:44] ACKNOWLEDGEMENT - Hue Server on analytics-tool1001 is CRITICAL: NRPE: Command check_hue not defined ottomata wip [21:21:46] ACKNOWLEDGEMENT - puppet last run on analytics-tool1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 17 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hue] ottomata wip [21:22:19] 10Operations, 10ops-eqdfw, 10netops, 10Patch-For-Review: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10Papaul) [21:30:39] 10Operations, 10Legalpad: Update terms "Labs" and "Operations" in L3 - https://phabricator.wikimedia.org/T202617 (10MarcoAurelio) @RobH The "Scope" section at L3 still mentions Labs. Not much of an issue as the page redirects to Cloud Services Terms of Use though. Regards. [21:56:36] 10Operations, 10CommRel-Internals, 10Wikimedia-Mailing-lists: Close https://lists.wikimedia.org/mailman/listinfo/cep and keep the archive for now - https://phabricator.wikimedia.org/T155683 (10Dzahn) Is this a request to rename cep@lists to commrel-support@lists? Renaming means the subscribed users, the con... [22:00:13] PROBLEM - swift-account-auditor on ms-be2043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [22:00:13] PROBLEM - swift-account-replicator on ms-be2043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:00:24] PROBLEM - swift-object-updater on ms-be2043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [22:00:34] PROBLEM - swift-container-auditor on ms-be2043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:00:43] PROBLEM - swift-object-server on ms-be2043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [22:00:43] PROBLEM - swift-object-replicator on ms-be2043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [22:00:44] PROBLEM - swift-container-server on ms-be2043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [22:00:53] PROBLEM - swift-account-server on ms-be2043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [22:00:54] PROBLEM - swift-account-reaper on ms-be2043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [22:01:03] PROBLEM - swift-object-auditor on ms-be2043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [22:01:13] PROBLEM - swift-container-updater on ms-be2043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [22:01:14] PROBLEM - swift-container-replicator on ms-be2043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [22:04:18] godog: there it is again [22:04:34] well no, it's different [22:05:10] i see the comment with puppet agent .. you are on it.. nvm [22:05:15] bbl [22:15:36] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Ty Hargrove - https://phabricator.wikimedia.org/T202363 (10Thargrovewmf) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDJO7wkp5QMmHB+/GiFTHhd8pMwQ+1yUXu4jV+mBpRcnO0wPa18UPI1jkF0arwmfF1IWOcjmHH/PZT... [22:27:25] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.18/extensions/FlaggedRevs/backend/FlaggedRevs.class.php: [[gerrit:455017|Follow-up 413e11e2f: Skip CurrentRevisionCallback if not stabilized]] T202659 (duration: 00m 56s) [22:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:31] T202659: prepareForParse() nor setReviewedVersions() called yet - https://phabricator.wikimedia.org/T202659 [22:27:44] ^ RoanKattouw live now, thanks again for the fix [22:27:57] * thcipriani tries to roll-forward one last time [22:28:34] (03PS1) 10Thcipriani: all wikis to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455021 [22:28:36] (03CR) 10Thcipriani: [C: 032] all wikis to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455021 (owner: 10Thcipriani) [22:29:57] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455021 (owner: 10Thcipriani) [22:31:39] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.18 [22:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:37] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Ty Hargrove - https://phabricator.wikimedia.org/T202363 (10Jalexander) a:05Thargrovewmf>03None [22:34:02] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) The meeting minutes are pretty accurate, but I wanted to collect a few more points for my own ref... [22:34:17] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Jalexander) a:05Kbrown>03None [22:36:47] !log clear peer 80.249.208.51 AS 8708 on cr2-esams [22:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:34] 10Operations, 10Parsing-Team, 10RESTBase, 10Traffic, 10Services (next): Improve multi-content-bucket designA - https://phabricator.wikimedia.org/T202682 (10Pchelolo) p:05Triage>03Normal [22:39:50] !log deactivating down AMSIX peer (no reply to emails) [22:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:23] 10Operations, 10Parsing-Team, 10RESTBase, 10Traffic, 10Services (next): Improve Accept header normalization in VCL for REST API - https://phabricator.wikimedia.org/T202682 (10Pchelolo) [22:41:35] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455021 (owner: 10Thcipriani) [22:42:51] (03PS1) 10Zoranzoki21: Add *.aucklandmuseum.com to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455027 (https://phabricator.wikimedia.org/T202680) [22:44:17] (03PS2) 10Zoranzoki21: Add *.aucklandmuseum.com to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455027 (https://phabricator.wikimedia.org/T202680) [22:46:41] !log Add v6 session in eqsin to 10089 [22:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:14] jouncebot, refresh [22:53:15] I refreshed my knowledge about deployments. [22:53:18] !log Add v4 session in eqdfw to AS8075 [22:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:18] jouncebot: now [22:59:18] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [22:59:22] jouncebot: next [22:59:22] In 0 hour(s) and 0 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T2300) [22:59:28] Hi Zoranzoki21 [22:59:35] Hi [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180823T2300). [23:00:05] Zoranzoki21 and Urbanecm: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:10] Here :) [23:00:27] Here and I [23:07:24] No deployer... [23:08:31] Looks so.. But maybe will be [23:08:47] Maybe you should suggest yourself for deployer [23:09:02] I can do it [23:09:04] Good! [23:09:14] :D tnx [23:12:23] * Reedy pokes wikibugs [23:13:53] Reedy, would you mind adding other patches while you're deploying? ;) [23:13:59] Feel free [23:15:44] Y U NO SPEAK wikibugs? [23:15:53] !log reedy@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 00s) [23:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:06] !log reedy@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 01s) [23:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:35] what happening with wikibugs if is not secret? [23:16:39] 10Operations, 10netops: cr1/2-eqiad PFE_FW_SYSLOG_IP6_GEN log entries - https://phabricator.wikimedia.org/T201149 (10ayounsi) 05Open>03Resolved I disabled logging a couple weeks ago, so no more `PFE_FW_SYSLOG_IP6_GEN` logs. `Next-hop resolution requests from interface XXX throttled` are probably due to de... [23:16:44] He no want to wake up? [23:16:56] !log reedy@deploy1001 scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [23:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:09] That's... No good [23:17:43] Definitely unrelated [23:19:24] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T202680 (duration: 00m 48s) [23:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:29] T202680: Add *.aucklandmuseum.com to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T202680 [23:20:23] Reedy, there'll be a permission change soon [23:20:48] Reedy: You already deployed my change while I checked my instagram :D [23:21:01] Zoranzoki21: Some changes don't need testing on mwdebug [23:21:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:21:16] Reedy: I know :) [23:21:57] Thanks! [23:23:55] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: T202139 (duration: 00m 48s) [23:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:00] T202139: Remove the "autoreview" user group from ru.wikisource - https://phabricator.wikimedia.org/T202139 [23:24:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:26:14] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [23:26:17] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030 (10ayounsi) 05stalled>03Resolved We will use them for ulsfo. If the new FS test optics (with new firmware) work, then we could consider using them in future deployments. [23:28:10] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T184943 (duration: 00m 48s) [23:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:15] T184943: Set $wgCategoryCollation = uca-tr on trwiki - https://phabricator.wikimedia.org/T184943 [23:28:42] !log running `mwscript updateCollation.php --wiki=trwiki --previous-collation=uppercase` T184943 [23:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:53] Reedy, I've uploaded the patches and updated the calendar (https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=1801105&oldid=1801095) [23:33:38] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) [23:34:09] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T202599 (duration: 00m 49s) [23:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:14] T202599: New user groups in zhwikiversity - https://phabricator.wikimedia.org/T202599 [23:34:34] 10Operations, 10netops: Intermitent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10ayounsi) p:05Unbreak!>03Normal Lowering the priority, as as far I know this didn't happen again. Focusing on the row A/B issues. [23:36:51] wikibugs are online [23:37:47] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T202599 (duration: 00m 53s) [23:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:36] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) @Kbrown That looks good, it matches what we had before. Since i have the patchset to create the use... [23:42:41] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T201236 (duration: 00m 55s) [23:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:46] T201236: Please add karbobala.com to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T201236 [23:43:19] !log created wikilove tables on zh_yuewiki T202548 [23:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:23] T202548: Enable Wikilove extension on zh_yuewiki - https://phabricator.wikimedia.org/T202548 [23:47:04] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T202548 (duration: 01m 01s) [23:47:06] 10Operations, 10ops-eqdfw: unrack/decom cr1-eqdfw - https://phabricator.wikimedia.org/T202700 (10ayounsi) p:05Triage>03Normal [23:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:30] 10Operations, 10ops-eqdfw, 10netops, 10Patch-For-Review: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) [23:48:38] 10Operations, 10netops, 10Goal: Increase network capacity (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T199142 (10ayounsi) [23:48:41] 10Operations, 10ops-eqdfw, 10netops, 10Patch-For-Review: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) 05Open>03Resolved [23:49:22] 10Operations, 10ops-eqdfw: unrack/decom cr1-eqdfw - https://phabricator.wikimedia.org/T202700 (10ayounsi) [23:50:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Dzahn) The new user has been created on bastion hosts but doesn't have additional groups yet. Those will ha... [23:51:40] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073 (10ayounsi) [23:51:43] 10Operations, 10fundraising-tech-ops, 10netops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10ayounsi) 05Open>03Resolved a:03ayounsi I believe we're good here. Please re-open if not (or need firewall policies).