[00:10:39] (03PS15) 10EBernhardson: convert role::logstash::elasticsearch to profiles [puppet] - 10https://gerrit.wikimedia.org/r/441894 [00:10:41] (03PS19) 10EBernhardson: prometheus/elasticsearch support multiple exporters per host [puppet] - 10https://gerrit.wikimedia.org/r/441321 [00:10:43] (03PS22) 10EBernhardson: Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 [00:10:45] (03PS50) 10EBernhardson: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [00:13:14] (03PS1) 10Alex Monk: deployment-prep: Add new deployment host [puppet] - 10https://gerrit.wikimedia.org/r/442229 (https://phabricator.wikimedia.org/T192561) [01:38:50] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.93 seconds [01:48:51] RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 45.77 seconds [02:05:30] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet operation_type={container_status,create_container,image_status,podsandbox_status,remove_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:06:40] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:20:36] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.8) (duration: 08m 33s) [02:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:35] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.10) (duration: 15m 53s) [02:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:07] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Wed Jun 27 03:04:07 UTC 2018 (duration 10m 32s) [03:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:40] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54249 MB (3% inode=99%) [03:14:50] RECOVERY - Disk space on maps1001 is OK: DISK OK [04:42:07] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@2207b66]: Update mobileapps to d7221ba [04:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:49] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@2207b66]: Update mobileapps to d7221ba (duration: 05m 42s) [04:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:44] !log Deploy schema change on db1090:3312 T191316 T192926 T89737 T195193 [05:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:48] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:08:48] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:08:48] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:08:48] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:08:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3312 for alter table (duration: 01m 06s) [05:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:31] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [06:58:00] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:18:00] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0 [07:18:10] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0 [07:20:16] that is probably the scheduled maintenance^? [07:30:47] jynus: right, on cr2-ulsfo, xe-1/3/0 is down and is labeled as "Transport: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO)" [07:31:12] jynus: that matches the zayo hard down window for today [07:31:14] uh, we lost wikibugs at Wed 02:21:58 CEST [07:31:30] vgutierrez: yes, that is why I am worried about tomorrow [07:31:35] jynus: https://librenms.wikimedia.org/device/device=89/tab=port/port=7518/ [07:31:54] jynus: dunno if zayo gives us 3 windows but it's actually only using one [07:32:07] jynus: cause yesterday we didn't see any affectation on that link [07:33:08] in any case, consider a preventive dc depool IF it would help, rather than doing an emergency one :-) [07:33:21] sure :) [07:34:04] legoktm: by any chance are you around? sorry to bother, you're on the contact list for wikibugs :) [07:34:41] ACKNOWLEDGEMENT - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: Vgutierrez Zayo scheduled maintenance [07:35:46] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: Vgutierrez Zayo scheduled maintenance [07:37:28] !log Depool dns4001 for server restart - T198215 [07:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:31] T198215: systemd-logind fails with result 'timeout' in db2093 and dns4001 - https://phabricator.wikimedia.org/T198215 [07:38:20] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4001.wikimedia.org [07:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:34] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4001.wikimedia.org [07:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:38] !log Reinstall acamar & achernar as spare systems [07:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:49] no gerrit notifications? [08:01:35] elukey: yup. I logged that too fast :) [08:03:08] vgutierrez: nono I meant that I just sent a code change for ops/puppet and didn't get the notification in here [08:03:13] oh [08:03:14] the bot might not be here [08:03:16] wikibugs is down [08:03:24] ah okok! [08:03:52] elukey: see above [08:03:55] vgutierrez: anyhow, what I am trying to do is moving the nginx submodule to environments/production/modules with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/442242/ [08:04:13] volans: yep found it thanks :) [08:04:33] vgutierrez: then move it back to modules/nginx [08:04:33] elukey: if you have the setup to fix it, super welcome ;) [08:04:43] see https://www.mediawiki.org/wiki/Wikibugs [08:24:01] PROBLEM - Host 2620:0:860:1:208:80:153:12 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:42] that's me reinstalling acamar [08:26:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3312 after alter table (duration: 00m 57s) [08:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:10] PROBLEM - Host 2620:0:860:1:208:80:153:12 is DOWN: PING CRITICAL - Packet loss = 100% [08:26:30] PROBLEM - Recursive DNS on 208.80.153.12 is CRITICAL: CRITICAL - Plugin timed out while executing system call [08:32:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1074 for alter table (duration: 00m 57s) [08:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:59] !log Stop replication on db1074 to remove triggers from db1125 - T192926 [08:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:01] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [08:35:26] !log Deploy schema change on db1074 with replication, this will generate lag on s2 on labsdb T191316 T192926 T89737 T195193 [08:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:30] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [08:35:31] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [08:35:31] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [09:14:19] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1067 (duration: 00m 56s) [09:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:24] (03PS1) 10Volans: Updated src to v0.1.5 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/442255 (https://phabricator.wikimedia.org/T191299) [09:18:26] (03PS1) 10Volans: Built wheels for v0.1.4 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/442256 (https://phabricator.wikimedia.org/T191299) [09:18:49] (03CR) 10Alexandros Kosiaris: grafana: Add migration script from proxy to LDAP auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [09:19:02] (03PS6) 10Alexandros Kosiaris: grafana: Add migration script from proxy to LDAP auth [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) [09:19:04] (03PS10) 10Alexandros Kosiaris: grafana: Enable grafana LDAP in production [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) [09:19:16] \o/ [09:19:19] arturo: thx! [09:19:30] oh! [09:19:30] (03CR) 10jerkins-bot: [V: 04-1] grafana: Add migration script from proxy to LDAP auth [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [09:19:35] hey wikibugs ! [09:19:47] (I ignore the bot, BTW, too verbose for me) [09:20:00] /o\ [09:21:05] (03PS2) 10Gehel: Fix whitespace in WDQS logging class name [puppet] - 10https://gerrit.wikimedia.org/r/442230 (owner: 10Smalyshev) [09:21:21] (03PS2) 10Volans: Built wheels for v0.1.5 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/442256 (https://phabricator.wikimedia.org/T191299) [09:21:37] (03CR) 10Volans: [V: 032 C: 032] Updated src to v0.1.5 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/442255 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:21:45] (03CR) 10Gehel: [C: 032] Fix whitespace in WDQS logging class name [puppet] - 10https://gerrit.wikimedia.org/r/442230 (owner: 10Smalyshev) [09:22:03] (03CR) 10Volans: [V: 032 C: 032] Built wheels for v0.1.5 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/442256 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:22:29] (03PS7) 10Alexandros Kosiaris: grafana: Add migration script from proxy to LDAP auth [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) [09:22:31] (03PS11) 10Alexandros Kosiaris: grafana: Enable grafana LDAP in production [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) [09:26:54] (03PS3) 10Elukey: cassandra: add another package version to the 2.2 list [puppet] - 10https://gerrit.wikimedia.org/r/442251 (https://phabricator.wikimedia.org/T197062) [09:27:26] (03CR) 10jerkins-bot: [V: 04-1] cassandra: add another package version to the 2.2 list [puppet] - 10https://gerrit.wikimedia.org/r/442251 (https://phabricator.wikimedia.org/T197062) (owner: 10Elukey) [09:28:34] (03CR) 10Alexandros Kosiaris: [C: 032] grafana: Add migration script from proxy to LDAP auth [puppet] - 10https://gerrit.wikimedia.org/r/404651 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [09:32:08] (03CR) 10Elukey: "Pcc looks ok: https://puppet-compiler.wmflabs.org/compiler02/11588/" [puppet] - 10https://gerrit.wikimedia.org/r/442251 (https://phabricator.wikimedia.org/T197062) (owner: 10Elukey) [09:36:13] 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286#4318445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['acamar.wikimedia.org'] ``` and were **ALL** successful. [09:37:49] (03PS1) 10Gehel: maps: isolate maps-test2003 and reimage it to stretch [puppet] - 10https://gerrit.wikimedia.org/r/442258 (https://phabricator.wikimedia.org/T198290) [09:38:02] !log volans@deploy1001 Started deploy [debmonitor/deploy@052a9ea]: Release v0.1.5 [09:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:25] (03CR) 10jerkins-bot: [V: 04-1] maps: isolate maps-test2003 and reimage it to stretch [puppet] - 10https://gerrit.wikimedia.org/r/442258 (https://phabricator.wikimedia.org/T198290) (owner: 10Gehel) [09:38:27] !log volans@deploy1001 Finished deploy [debmonitor/deploy@052a9ea]: Release v0.1.5 (duration: 00m 24s) [09:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:34] (03PS2) 10Gehel: maps: isolate maps-test2003 and reimage it to stretch [puppet] - 10https://gerrit.wikimedia.org/r/442258 (https://phabricator.wikimedia.org/T198290) [09:40:42] 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286#4318483 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` achernar.wikimedia.org ``` The log can be found in `/var/log/wm... [09:46:32] (03CR) 10Alexandros Kosiaris: [C: 032] grafana: Enable grafana LDAP in production [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [09:53:41] (03PS2) 10Jcrespo: mariadb: db1067: disable notifications and reinstall as stretch [puppet] - 10https://gerrit.wikimedia.org/r/442252 (https://phabricator.wikimedia.org/T197069) [09:58:29] (03CR) 10Gehel: "Puppet compiler looks good: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/11589/console" [puppet] - 10https://gerrit.wikimedia.org/r/442258 (https://phabricator.wikimedia.org/T198290) (owner: 10Gehel) [10:06:13] (03CR) 10Jcrespo: [C: 032] mariadb: db1067: disable notifications and reinstall as stretch [puppet] - 10https://gerrit.wikimedia.org/r/442252 (https://phabricator.wikimedia.org/T197069) (owner: 10Jcrespo) [10:10:40] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442264 [10:10:50] (03PS1) 10Volans: CSP header: do not set require-sri-for [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442265 (https://phabricator.wikimedia.org/T191299) [10:11:41] !log stopping db1067 and reimage it [10:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:46] (03CR) 10jerkins-bot: [V: 04-1] CSP header: do not set require-sri-for [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442265 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:12:14] (03CR) 10Vgutierrez: [C: 031] "sorry about all the noise CSP related :(" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442265 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:12:32] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442264 [10:13:22] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442265 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:13:53] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442264 (owner: 10Marostegui) [10:14:18] (03CR) 10jerkins-bot: [V: 04-1] CSP header: do not set require-sri-for [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442265 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:15:07] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442264 (owner: 10Marostegui) [10:16:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1074 after alter table (duration: 00m 57s) [10:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442267 (https://phabricator.wikimedia.org/T191316) [10:18:52] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442264 (owner: 10Marostegui) [10:20:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442267 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:22:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442267 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:22:22] (03PS1) 10Urbanecm: New throttle rule for Wikimania 2018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442268 (https://phabricator.wikimedia.org/T198288) [10:24:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1076 for alter table (duration: 00m 56s) [10:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:09] !log Deploy schema change on db1076 T191316 T192926 T89737 T195193 [10:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:13] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [10:24:13] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [10:24:14] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [10:24:14] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [10:24:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442267 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:24:42] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 [10:24:42] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [10:25:42] (03PS2) 10Volans: CSP header: do not set require-sri-for [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442265 (https://phabricator.wikimedia.org/T191299) [10:26:36] (03CR) 10jerkins-bot: [V: 04-1] CSP header: do not set require-sri-for [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442265 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:27:34] (03CR) 10Vgutierrez: [C: 031] CSP header: do not set require-sri-for (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442265 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:28:01] (03CR) 10Volans: [V: 032 C: 032] CSP header: do not set require-sri-for [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442265 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:30:06] PROBLEM - Check systemd state on krypton is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:30:52] (03PS1) 10Muehlenhoff: Enable microcode for all database roles [puppet] - 10https://gerrit.wikimedia.org/r/442269 (https://phabricator.wikimedia.org/T127825) [10:34:53] (03PS1) 10Volans: Updated src with hotfix for CSP header [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/442270 [10:34:56] (03PS1) 10Volans: Built wheels for the hotfix of the CSP header [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/442271 [10:35:49] (03CR) 10Volans: [V: 032 C: 032] Updated src with hotfix for CSP header [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/442270 (owner: 10Volans) [10:35:49] (03CR) 10Volans: [V: 032 C: 032] Built wheels for the hotfix of the CSP header [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/442271 (owner: 10Volans) [10:36:07] (03PS1) 10Alexandros Kosiaris: grafana: Fix ldap.toml permissions [puppet] - 10https://gerrit.wikimedia.org/r/442272 (https://phabricator.wikimedia.org/T170150) [10:36:20] !log volans@deploy1001 Started deploy [debmonitor/deploy@9536ebf]: CSP header hotfix [10:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:42] !log volans@deploy1001 Finished deploy [debmonitor/deploy@9536ebf]: CSP header hotfix (duration: 00m 22s) [10:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:14] (03CR) 10Jcrespo: [C: 04-1] "I thought sanitarium_multisource role didn't exist anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/442269 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [10:37:16] (03CR) 10Alexandros Kosiaris: [C: 032] "The extra conn.commit() made it in by mistake, but being in the middle of a migration and for a script that is going to be removed afterwa" [puppet] - 10https://gerrit.wikimedia.org/r/442272 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [10:38:56] !log removing maps-test2003 from cluster for reimage - T198290 [10:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:58] T198290: reimage maps-test2003 to test migration of current style to stretch - https://phabricator.wikimedia.org/T198290 [10:40:46] (03PS3) 10Volans: debmonitor: set user home to /nonexistent [puppet] - 10https://gerrit.wikimedia.org/r/442246 (https://phabricator.wikimedia.org/T191300) [10:41:25] akosiaris: can I merge on puppet or do you need it during the migration? [10:43:12] volans: sure, go for it [10:43:14] RECOVERY - Check systemd state on krypton is OK: OK - running: The system is fully operational [10:43:39] thx [10:44:14] (03CR) 10Marostegui: "> I thought sanitarium_multisource role didn't exist anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/442269 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [10:46:02] (03PS1) 10Marostegui: sanitarium_multisource.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/442274 (https://phabricator.wikimedia.org/T196527) [10:46:39] (03CR) 10Marostegui: [C: 032] sanitarium_multisource.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/442274 (https://phabricator.wikimedia.org/T196527) (owner: 10Marostegui) [10:48:11] (03CR) 10Marostegui: [C: 031] "I have removed sanitarium_multisource.yaml as that file was not used anymore: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/44227" [puppet] - 10https://gerrit.wikimedia.org/r/442269 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [10:55:05] (03PS4) 10Volans: debmonitor: fine-tune client user creation [puppet] - 10https://gerrit.wikimedia.org/r/442246 (https://phabricator.wikimedia.org/T191300) [10:55:07] (03PS3) 10Volans: debmonitor: remove CSP header now set upstream [puppet] - 10https://gerrit.wikimedia.org/r/442111 (https://phabricator.wikimedia.org/T191299) [10:55:24] (03PS2) 10Jcrespo: Enable microcode for all database roles [puppet] - 10https://gerrit.wikimedia.org/r/442269 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [10:55:35] (03PS3) 10Jcrespo: Enable microcode for all database roles [puppet] - 10https://gerrit.wikimedia.org/r/442269 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [10:55:58] 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission acamar and achernar - https://phabricator.wikimedia.org/T198286#4318753 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['achernar.wikimedia.org'] ``` and were **ALL** successful. [10:56:42] PROBLEM - Check systemd state on krypton is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:56:51] (03CR) 10Jcrespo: "Not blocking the deployment of this, but I think only labsdb would be pending, I think." [puppet] - 10https://gerrit.wikimedia.org/r/442269 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [10:57:42] RECOVERY - Check systemd state on krypton is OK: OK - running: The system is fully operational [10:59:11] (03PS5) 10Volans: debmonitor: fine-tune client user creation [puppet] - 10https://gerrit.wikimedia.org/r/442246 (https://phabricator.wikimedia.org/T191300) [11:00:24] (03CR) 10Muehlenhoff: [C: 031] debmonitor: fine-tune client user creation [puppet] - 10https://gerrit.wikimedia.org/r/442246 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:00:56] (03CR) 10Volans: [C: 032] debmonitor: fine-tune client user creation [puppet] - 10https://gerrit.wikimedia.org/r/442246 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:04:33] (03PS4) 10Volans: debmonitor: remove CSP header now set upstream [puppet] - 10https://gerrit.wikimedia.org/r/442111 (https://phabricator.wikimedia.org/T191299) [11:05:40] (03CR) 10Volans: [C: 032] debmonitor: remove CSP header now set upstream [puppet] - 10https://gerrit.wikimedia.org/r/442111 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [11:07:49] (03CR) 10Volans: [V: 032 C: 032] Upstream release v0.1.5 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/442249 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [11:10:59] (03PS4) 10Giuseppe Lavagetto: [WIP] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) [11:11:22] (03CR) 10Muehlenhoff: "Ah, right. I'll amend for labsdb*" [puppet] - 10https://gerrit.wikimedia.org/r/442269 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [11:12:21] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [11:13:41] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3420747 (10GoranSMilovanovic) Hi, - I can see the change on https://grafana.wikimedia.org/login - However, I cannot login w. my LDAP usern... [11:18:22] volans: am now, but looks like it's been fixed [11:19:16] legoktm: yeah, sorry for the ping, thanks for checking [11:23:14] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1067 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442275 [11:23:31] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1067 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442275 [11:26:52] !log Deploy schema change on s8 codfw master (db2045) with replication, this will generate lag on s8 codfw T191316 T192926 T89737 T195193 [11:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:56] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [11:26:57] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [11:26:57] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [11:26:57] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [11:28:01] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442277 [11:29:20] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442277 (owner: 10Marostegui) [11:30:33] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442277 (owner: 10Marostegui) [11:31:23] (03PS1) 10Muehlenhoff: Also override override_dh_auto_clean [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/442278 [11:31:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1076 after alter table (duration: 00m 57s) [11:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:56] PROBLEM - grafana-admin.wikimedia.org on krypton is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 200 OK [11:32:23] (03CR) 10jerkins-bot: [V: 04-1] Also override override_dh_auto_clean [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/442278 (owner: 10Muehlenhoff) [11:39:31] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/442251 (https://phabricator.wikimedia.org/T197062) (owner: 10Elukey) [11:40:58] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1067 [puppet] - 10https://gerrit.wikimedia.org/r/442279 (https://phabricator.wikimedia.org/T197069) [11:44:04] (03PS1) 10Alexandros Kosiaris: grafana: Cleanup for various settings [puppet] - 10https://gerrit.wikimedia.org/r/442281 [11:47:58] (03PS2) 10Alexandros Kosiaris: grafana: Cleanup for various settings [puppet] - 10https://gerrit.wikimedia.org/r/442281 [11:50:34] (03CR) 10Alexandros Kosiaris: [C: 032] grafana: Cleanup for various settings [puppet] - 10https://gerrit.wikimedia.org/r/442281 (owner: 10Alexandros Kosiaris) [11:55:08] (03CR) 10Volans: [C: 031] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/442278 (owner: 10Muehlenhoff) [11:55:57] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#4319059 (10akosiaris) >>! In T170150#4318810, @GoranSMilovanovic wrote: > Hi, > > - I can see the change on https://grafana.wikimedia.org/... [11:57:22] (03CR) 10Muehlenhoff: [V: 032 C: 032] Also override override_dh_auto_clean [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/442278 (owner: 10Muehlenhoff) [11:59:54] 10Operations, 10SRE-Access-Requests: WMF-NDA-Request for User:Braveheart - https://phabricator.wikimedia.org/T198190#4319077 (10Aklapper) [12:05:55] (03PS1) 10Alexandros Kosiaris: grafana: Remove reference to grafana-admin from home page [puppet] - 10https://gerrit.wikimedia.org/r/442284 (https://phabricator.wikimedia.org/T170150) [12:11:29] (03PS4) 10Muehlenhoff: Enable microcode for all database roles [puppet] - 10https://gerrit.wikimedia.org/r/442269 (https://phabricator.wikimedia.org/T127825) [12:12:07] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442277 (owner: 10Marostegui) [12:18:34] (03PS1) 10Urbanecm: Create group eventparticipant on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442286 (https://phabricator.wikimedia.org/T198167) [12:19:17] (03PS2) 10Urbanecm: Create TemplateEditor group on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) [12:41:42] (03PS1) 10Rush: openstack: labnet100[34] 10g nic for install_server [puppet] - 10https://gerrit.wikimedia.org/r/442290 (https://phabricator.wikimedia.org/T193196) [12:42:39] (03CR) 10Rush: [C: 032] openstack: labnet100[34] 10g nic for install_server [puppet] - 10https://gerrit.wikimedia.org/r/442290 (https://phabricator.wikimedia.org/T193196) (owner: 10Rush) [12:49:30] !log Upgrade librdkafka1 and restart varnishkafka-webrequest in cache::upload nodes - T182993 [12:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:32] T182993: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993 [12:49:49] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507#4259081 (10Cmjohnson) @bstorm Is this sever fully functional? I wanted to wait until it's working and the connectivity issues were resolved before tackling the next set of issues. Thanks! [12:51:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1019 IPMI alert - https://phabricator.wikimedia.org/T196751#4319194 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson This may very well have been partially unplugged during the 10G issues. Resolving the task. If it returns we can open a... [12:52:43] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4162403 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` labnet1003.eqiad.wmnet ``` The l... [12:56:58] 10Operations, 10ops-eqiad, 10cloud-services-team: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252#4251148 (10Cmjohnson) @Andrew We need thermal paste. I have created a procurement task https://phabricator.wikimedia.org/T198326. Once it arrives I will ping you regarding a good day/time to... [12:57:25] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477#4319230 (10MoritzMuehlenhoff) Could you open a maintenance shell and attach a screenshot of the output for "Network controller" of "lspci -v"? We need to figure out whether it misses a dri... [12:58:15] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1020 - https://phabricator.wikimedia.org/T194855#4319233 (10Cmjohnson) @bstorm after reinstall please let me know if this is still an issue. [12:59:26] 10Operations, 10ops-eqiad: anaytics1032's BBU is not working correctly - https://phabricator.wikimedia.org/T194234#4193088 (10Cmjohnson) @elukey is this still an issue. I do have a spare bbu I can install. If so, please let me know when you would like to schedule this to happen [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180627T1300). Please do the needful. [13:00:05] Thiemo_WMDE and dcausse: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:22] o/ [13:00:48] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319237 (10chasemp) ```Attempting Boot From NIC QLogic UNDI PXE-2.1 v7.14.5 Copyright (C) 2016 QLogic Corporation Copyright (C) 1997-2000 Intel Cor... [13:01:31] o/ [13:01:40] I can SWAT today [13:01:51] dcausse: want to deploy your own commit? [13:01:57] zeljkof: sure [13:02:05] * Thiemo_WMDE is here [13:02:14] dcausse: go ahead, I'll review Thiemo_WMDE's commit [13:02:24] zeljkof: mine will take a long time [13:02:34] dcausse: to deploy, to test? [13:02:47] to test&deploy it's a 3 steps deploy [13:02:56] in that case, I'll go first, you second :D ok? [13:03:02] sure :) [13:03:13] ok, I'll let you know when I'm done [13:03:18] thanks [13:05:07] (03PS1) 10Rush: openstack: labnet1003 installer match labvirt1019 [puppet] - 10https://gerrit.wikimedia.org/r/442295 [13:05:16] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442146 (https://phabricator.wikimedia.org/T198050) (owner: 10WMDE-Fisch) [13:06:29] I am around as well if needed :) [13:06:47] (03CR) 10Rush: [C: 032] openstack: labnet1003 installer match labvirt1019 [puppet] - 10https://gerrit.wikimedia.org/r/442295 (owner: 10Rush) [13:07:52] !log piwik upgraded to 3.2.1 on bohrium + started the db migration procedure (will last 2/3h probably) [13:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:20] (03PS2) 10Zfilipin: Change FileImporter config data location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442146 (https://phabricator.wikimedia.org/T198050) (owner: 10WMDE-Fisch) [13:08:35] (03CR) 10Zfilipin: Change FileImporter config data location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442146 (https://phabricator.wikimedia.org/T198050) (owner: 10WMDE-Fisch) [13:08:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442146 (https://phabricator.wikimedia.org/T198050) (owner: 10WMDE-Fisch) [13:10:05] (03Merged) 10jenkins-bot: Change FileImporter config data location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442146 (https://phabricator.wikimedia.org/T198050) (owner: 10WMDE-Fisch) [13:10:22] (03CR) 10jenkins-bot: Change FileImporter config data location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442146 (https://phabricator.wikimedia.org/T198050) (owner: 10WMDE-Fisch) [13:11:05] (03PS23) 10DCausse: Add cirrussearch settings for wikibase (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [13:11:07] (03PS8) 10DCausse: Add cirrussearch settings for wikibase (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) [13:11:09] (03PS7) 10DCausse: Add cirrussearch settings for wikibase (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) [13:11:26] Thiemo_WMDE: 442146 is at mwdebug1002, please test and let me know if I can deploy it [13:12:29] Thiemo_WMDE: sorry, it's not there yet, will be in a minute [13:13:32] I'm afraid I don't know where this machine is. [13:13:40] Thiemo_WMDE: it's at mwdebug1002 now, for reayl [13:13:41] reall [13:13:44] argh, real [13:13:58] Thiemo_WMDE: ok, there are docs... looking... [13:14:31] Thiemo_WMDE: docs https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Staging_changes [13:14:40] have you done a swat deploy before? [13:15:40] I was told I need to stick around here and respond to questions. [13:15:59] The instructions you send me look like it will take me a few hours. [13:15:59] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319252 (10chasemp) trying https://gerrit.wikimedia.org/r/c/operations/puppet/+/442295 to see if there is any change [13:16:11] Thiemo_WMDE: :D [13:16:50] no, it's just a browser extension you need to install, then you enable it and select mwdebug1002 in the extension dropdown [13:17:20] then you just try to verify that the patch works, you can go to any wikimedia site and your traffic will go to mwdebug1002 [13:17:37] when done testing, disable the extension [13:18:21] 10Operations, 10ops-eqiad: anaytics1032's BBU is not working correctly - https://phabricator.wikimedia.org/T194234#4319262 (10elukey) Hi @Cmjohnson! It is yes, we can try to swap it any time, just give me a 20/30mins heads up to drain the node and shut it down! [13:18:21] Thiemo_WMDE: is there anything to test with the patch? creating/editing something? [13:19:46] zeljkof: I'm testing, give me 2min please. (mwdebug1002 seems to be slow...) [13:20:07] Thiemo_WMDE: sure, let me know if you need more time, or if you need help [13:20:33] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319268 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` labnet1003.eqiad.wmnet ``` The l... [13:20:36] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319269 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labnet1003.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['labnet1003.eqia... [13:22:16] zeljkof: The codepath that works with the config change in the patch works as expected when I try with mwdebug1002 enabled. [13:22:37] Thiemo_WMDE: so ok to deploy? [13:22:41] Ok [13:23:14] Thiemo_WMDE: ok, deploying [13:23:26] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319271 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` labnet1003.eqiad.wmnet ``` The l... [13:23:30] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319272 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labnet1003.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['labnet1003.eqia... [13:24:01] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319285 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` labnet1003.eqiad.wmnet ``` The l... [13:24:04] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319286 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labnet1003.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['labnet1003.eqia... [13:24:26] !log zfilipin@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:442146|Change FileImporter config data location (T198050)]] (duration: 00m 57s) [13:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:28] T198050: Change location of Config-pages in the code - https://phabricator.wikimedia.org/T198050 [13:24:59] Thiemo_WMDE: it's deployed, disable the extension and check if things still work :D [13:25:17] dcausse: swat is yours! [13:25:23] zeljkof: Confirmed. [13:25:38] Thiemo_WMDE: thanks for deploying with #releng! :) [13:25:48] Thank YOU! [13:27:02] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319292 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` labnet1003.eqiad.wmnet ``` The l... [13:27:06] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319293 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labnet1003.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['labnet1003.eqia... [13:27:15] zeljkof: thanks, swating [13:27:44] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:28:34] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` labnet1003.eqiad.wmnet ``` The l... [13:28:59] (03Merged) 10jenkins-bot: Add cirrussearch settings for wikibase (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:29:14] (03CR) 10jenkins-bot: Add cirrussearch settings for wikibase (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:31:27] (03PS2) 10Alexandros Kosiaris: grafana: Remove reference to grafana-admin from home page [puppet] - 10https://gerrit.wikimedia.org/r/442284 (https://phabricator.wikimedia.org/T170150) [13:31:29] (03PS1) 10Alexandros Kosiaris: grafana: Test out something [puppet] - 10https://gerrit.wikimedia.org/r/442298 [13:32:03] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3420747 (10GoranSMilovanovic) @akosiaris All is superfine now. Thanks! [13:34:27] zeljkof: is it ok to sync the whole wmf-config dir if my patch has no deps between files? [13:34:49] or should I go with one deploy per file as I usually do? [13:35:14] dcausse: each patch should be deployed at once, that's a recent rule [13:35:33] so deploying the entire wmf-config folder is ok, and it's even fast [13:35:40] ok [13:35:43] last time I looked it wasn't a big folder anyway [13:36:01] and I have done it recently, nothing broke, nobody shouted at me :D [13:36:18] !log dcausse@deploy1001 Started scap: wmf-config Add cirrussearch settings for wikibase (1/3) [13:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:20] tldr: should be good ;) [13:38:45] it says Started l10n-update, Updating ExtensionMessages-1.32.0-wmf.8.php, Updating LocalisationCache for 1.32.0-wmf.8 using 30 thread(s) [13:38:50] is this expected? [13:39:43] dcausse: uh, not sure, it should be a normal output, what did you do? [13:40:02] scap sync wmf-config "Add cirrussearch settings for wikibase (1/3)" [13:40:44] !log uploaded debmonitor 0.1.5 to apt.wikimedia.org [13:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:36] this is what I do: `scap sync-file wmf-config 'message'` [13:42:00] !log dcausse@deploy1001 Finished scap: wmf-config Add cirrussearch settings for wikibase (1/3) (duration: 05m 41s) [13:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:02] dcausse: why are you using `sync` instead of `sync-file`? [13:42:14] are those the same thing? :) [13:42:30] zeljkof: because sync-file is to deploy one file no? [13:42:44] dcausse: no, it deploys folders too [13:42:44] https://wikitech.wikimedia.org/wiki/Scap#scap_sync [13:43:00] hm, well, at least that's how I have used it many times :D [13:43:40] dcausse: the docs might be out of date, I think we use sync-file for all these days [13:43:52] ok this is done, I'll try sync-file ./wmf-config for the next [13:44:50] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:45:17] dcausse: huh, even this page seems out of date https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Full_deployment [13:45:32] :/ [13:45:51] I remember a mail saying that sync-file/sync-dir were gone and we should always use sync [13:45:56] but it's old [13:46:09] (03Merged) 10jenkins-bot: Add cirrussearch settings for wikibase (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:46:15] dcausse: could you create a task? just to make it clear what is the current convention? [13:46:20] sure [13:46:24] thanks [13:46:46] we can discuss it there, I'm really not sure if I was doing it wrong until now :D [13:46:59] 10Operations, 10ops-eqiad: anaytics1032's BBU is not working correctly - https://phabricator.wikimedia.org/T194234#4193088 (10Cmjohnson) @elukey let's do this tomorrow morning. I will ping you when I get to the data center in the morning. [13:48:38] (03CR) 10jenkins-bot: Add cirrussearch settings for wikibase (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:50:40] (03PS1) 10Filippo Giunchedi: WIP grafana: host overview dashboard as code [puppet] - 10https://gerrit.wikimedia.org/r/442301 [13:51:08] (03CR) 10jerkins-bot: [V: 04-1] WIP grafana: host overview dashboard as code [puppet] - 10https://gerrit.wikimedia.org/r/442301 (owner: 10Filippo Giunchedi) [13:51:47] (03PS1) 10DCausse: Revert "Add cirrussearch settings for wikibase (2/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442302 [13:51:49] (03PS2) 10Filippo Giunchedi: WIP grafana: host overview dashboard as code [puppet] - 10https://gerrit.wikimedia.org/r/442301 (https://phabricator.wikimedia.org/T178690) [13:51:57] my deploy failed, reverting :( [13:52:17] (03CR) 10jerkins-bot: [V: 04-1] WIP grafana: host overview dashboard as code [puppet] - 10https://gerrit.wikimedia.org/r/442301 (https://phabricator.wikimedia.org/T178690) (owner: 10Filippo Giunchedi) [13:52:26] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#4319336 (10fgiunchedi) I coded a strawman using grafanalib at https://gerrit.wikimedia.org/r/c/operations/puppet/+/442301 and looks good to me so far... [13:52:46] jenkins is sad [13:52:53] I see Warning: Invalid argument supplied for foreach() on mwdebug1002, will figure this out and reschedule another deploy [13:53:41] (03CR) 10DCausse: "SWAT (reverted because of errors detected on mwdebug1002)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442302 (owner: 10DCausse) [13:54:55] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442302 (owner: 10DCausse) [13:56:08] (03Merged) 10jenkins-bot: Revert "Add cirrussearch settings for wikibase (2/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442302 (owner: 10DCausse) [13:57:34] I'm done, reverted broken patch and cleaned up mwdebug1002 [13:57:39] !log EU swat done [13:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:06] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715#4319365 (10Joe) 05Resolved>03Open [13:59:24] (03CR) 10jenkins-bot: Revert "Add cirrussearch settings for wikibase (2/3)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442302 (owner: 10DCausse) [14:00:17] PROBLEM - Disk space on bohrium is CRITICAL: DISK CRITICAL - free space: / 3758 MB (3% inode=98%) [14:00:57] elukey: ^ [14:01:02] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715#4319368 (10Joe) Reopened as this is still not fixed, see https://wikitech.wikimedia.org/wiki/Incident_documentation/20180626-LoadBalancers [14:01:08] (03PS2) 10Alexandros Kosiaris: grafana: Double quote correctly ldap.toml parameters [puppet] - 10https://gerrit.wikimedia.org/r/442298 (https://phabricator.wikimedia.org/T170150) [14:02:02] (03CR) 10jerkins-bot: [V: 04-1] grafana: Double quote correctly ldap.toml parameters [puppet] - 10https://gerrit.wikimedia.org/r/442298 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [14:02:12] zeljkof: I think you are right about sync-file, it's sync-dir that was deprecated in favor of sync-file for everything (reading some old mails) [14:02:50] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319374 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labnet1003.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['labnet1003.eqia... [14:03:33] dcausse: ah, yes, now I remember too [14:03:47] (03PS3) 10Alexandros Kosiaris: grafana: Double quote correctly ldap.toml parameters [puppet] - 10https://gerrit.wikimedia.org/r/442298 (https://phabricator.wikimedia.org/T170150) [14:03:56] rereading https://wikitech.wikimedia.org/wiki/Scap#scap_sync it makes sense now [14:04:04] I've just run a full scap :/ [14:04:57] (03PS4) 10Alexandros Kosiaris: grafana: Double quote correctly ldap.toml parameters [puppet] - 10https://gerrit.wikimedia.org/r/442298 (https://phabricator.wikimedia.org/T170150) [14:08:48] (03CR) 10Mholloway: [C: 031] maps: isolate maps-test2003 and reimage it to stretch [puppet] - 10https://gerrit.wikimedia.org/r/442258 (https://phabricator.wikimedia.org/T198290) (owner: 10Gehel) [14:09:46] (03CR) 10Alexandros Kosiaris: [C: 032] grafana: Remove reference to grafana-admin from home page [puppet] - 10https://gerrit.wikimedia.org/r/442284 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [14:09:56] (03CR) 10Alexandros Kosiaris: [C: 032] grafana: Double quote correctly ldap.toml parameters [puppet] - 10https://gerrit.wikimedia.org/r/442298 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [14:11:45] (03PS1) 10Alexandros Kosiaris: Remove grafana-admin.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/442306 (https://phabricator.wikimedia.org/T170150) [14:15:19] (03PS1) 10Alexandros Kosiaris: grafana: Also add Array to the ldap.toml.erb excludes [puppet] - 10https://gerrit.wikimedia.org/r/442308 (https://phabricator.wikimedia.org/T170150) [14:15:57] PROBLEM - Check systemd state on krypton is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:16:37] PROBLEM - grafana.wikimedia.org on krypton is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:17:18] ignore this ^ [14:17:22] my fault, fixing [14:17:32] cleanup ended up causing an outage [14:18:07] (03CR) 10Alexandros Kosiaris: [C: 032] grafana: Also add Array to the ldap.toml.erb excludes [puppet] - 10https://gerrit.wikimedia.org/r/442308 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [14:19:17] RECOVERY - Check systemd state on krypton is OK: OK - running: The system is fully operational [14:19:57] RECOVERY - grafana.wikimedia.org on krypton is OK: HTTP OK: HTTP/1.1 200 OK - 31353 bytes in 0.007 second response time [14:21:24] (03PS2) 10Jcrespo: mariadb: Reenable notifications on db1067 [puppet] - 10https://gerrit.wikimedia.org/r/442279 (https://phabricator.wikimedia.org/T197069) [14:27:03] (03CR) 10Gilles: webperf: Make performance::site apache config more dynamic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/442232 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [14:27:44] (03PS1) 10Volans: debmonitor: set shell for system user [puppet] - 10https://gerrit.wikimedia.org/r/442310 (https://phabricator.wikimedia.org/T191300) [14:29:46] (03CR) 10Muehlenhoff: [C: 031] debmonitor: set shell for system user [puppet] - 10https://gerrit.wikimedia.org/r/442310 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [14:30:05] (03CR) 10Volans: [C: 032] debmonitor: set shell for system user [puppet] - 10https://gerrit.wikimedia.org/r/442310 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [14:34:46] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db1067 [puppet] - 10https://gerrit.wikimedia.org/r/442279 (https://phabricator.wikimedia.org/T197069) (owner: 10Jcrespo) [14:34:54] (03PS3) 10Jcrespo: mariadb: Reenable notifications on db1067 [puppet] - 10https://gerrit.wikimedia.org/r/442279 (https://phabricator.wikimedia.org/T197069) [14:40:10] (03PS3) 10Jcrespo: Revert "mariadb: Depool db1067 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442275 [14:41:39] (03PS1) 10Alexandros Kosiaris: grafana: Readd grafana-admin group as editors [puppet] - 10https://gerrit.wikimedia.org/r/442311 (https://phabricator.wikimedia.org/T170150) [14:41:41] (03PS1) 10Alexandros Kosiaris: grafana-admin: Remove from production [puppet] - 10https://gerrit.wikimedia.org/r/442312 (https://phabricator.wikimedia.org/T170150) [14:41:43] (03PS1) 10Alexandros Kosiaris: grafana: Allow skipping instantiation of grafana-admin [puppet] - 10https://gerrit.wikimedia.org/r/442313 (https://phabricator.wikimedia.org/T170150) [14:42:22] ACKNOWLEDGEMENT - grafana-admin.wikimedia.org on krypton is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 200 OK alexandros kosiaris Being deprecated and soon to be removed. T170150 [14:44:20] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1067 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442275 (owner: 10Jcrespo) [14:45:34] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1067 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442275 (owner: 10Jcrespo) [14:45:42] (03PS1) 10Rush: openstack: labnet1004 to match labnet1003 [puppet] - 10https://gerrit.wikimedia.org/r/442314 [14:46:06] (03PS2) 10Rush: openstack: labnet1004 to match labnet1003 [puppet] - 10https://gerrit.wikimedia.org/r/442314 [14:46:57] 10Operations, 10ops-eqiad: tungsten disk 1 and 8 SMART failure - https://phabricator.wikimedia.org/T193628#4174494 (10Cmjohnson) Is there a plan to decommission this server soon? [14:47:04] (03CR) 10Rush: [C: 032] openstack: labnet1004 to match labnet1003 [puppet] - 10https://gerrit.wikimedia.org/r/442314 (owner: 10Rush) [14:47:32] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4319497 (10akosiaris) Things are definitely going way better now. I only see 1 alert in the last 24 hours. ``` [2018-06-27 13:47:37] SERVICE... [14:48:11] 10Operations, 10ops-eqiad, 10Release-Engineering-Team (Watching / External): tin has a failing hdd - https://phabricator.wikimedia.org/T174449#4319499 (10Cmjohnson) 05stalled>03Resolved a:03Cmjohnson This server now has a decom task https://phabricator.wikimedia.org/T196175 [14:48:21] (03PS2) 10Alexandros Kosiaris: grafana: Readd grafana-admin group as editors [puppet] - 10https://gerrit.wikimedia.org/r/442311 (https://phabricator.wikimedia.org/T170150) [14:48:23] (03PS2) 10Alexandros Kosiaris: grafana-admin: Remove from production [puppet] - 10https://gerrit.wikimedia.org/r/442312 (https://phabricator.wikimedia.org/T170150) [14:48:25] (03PS2) 10Alexandros Kosiaris: grafana: Allow skipping instantiation of grafana-admin [puppet] - 10https://gerrit.wikimedia.org/r/442313 (https://phabricator.wikimedia.org/T170150) [14:48:36] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] grafana: Readd grafana-admin group as editors [puppet] - 10https://gerrit.wikimedia.org/r/442311 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [14:49:41] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1067 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442275 (owner: 10Jcrespo) [14:51:03] 10Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287#4319508 (10Cmjohnson) Most of the servers are decommissioned. Are you still have problems with mw1221.eqiad.wmnet mw1222.eqiad.wmnet mw1225.eqiad.wmnet mw1226.eqiad.wmnet mw1227.eqiad.wmnet mw1229.e... [14:51:06] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319507 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` labnet1004.eqiad.wmnet ``` The l... [14:51:44] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1067 (duration: 00m 56s) [14:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:57] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone bootstrap: make seed script more c&p friendly [puppet] - 10https://gerrit.wikimedia.org/r/442316 (https://phabricator.wikimedia.org/T196633) [14:55:38] 10Operations, 10ops-eqiad, 10DC-Ops: Remove all out of warranty unused cp10xx's from A2 - https://phabricator.wikimedia.org/T120856#4319514 (10Cmjohnson) [] dbproxy1001 Replacement task https://phabricator.wikimedia.org/T196690 [] dbproxy1002 Replacement task https://phabricator.wikimedia.org/T196690 [] dbp... [14:56:00] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: keystone bootstrap: make seed script more c&p friendly [puppet] - 10https://gerrit.wikimedia.org/r/442316 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [14:57:04] 10Operations, 10ops-eqiad, 10cloud-services-team: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252#4319526 (10Andrew) @Cmjohnson Sounds good, thanks for the update. That server seems to be holding steady for now. [14:58:55] (03PS8) 10DCausse: Add cirrussearch settings for wikibase (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) [14:58:58] (03PS1) 10DCausse: Add cirrussearch settings for wikibase (1.5/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442317 (https://phabricator.wikimedia.org/T182717) [14:58:59] (03PS1) 10DCausse: Add cirrussearch settings for wikibase (2/3) (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442318 (https://phabricator.wikimedia.org/T182717) [14:59:20] RECOVERY - Disk space on bohrium is OK: DISK OK [15:01:56] 10Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287#4319546 (10MoritzMuehlenhoff) mw1221, mw1230 and mw1235 are fine, the others are still showing the mentioned symptoms. [15:05:29] (03PS1) 10Muehlenhoff: Cleanup .eggs/README.txt after build [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/442319 [15:06:27] (03CR) 10jerkins-bot: [V: 04-1] Cleanup .eggs/README.txt after build [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/442319 (owner: 10Muehlenhoff) [15:11:21] 10Operations, 10Wikimedia-Mailing-lists, 10User-herron: Spam to -owner mailing lists from *@qq.com emails - https://phabricator.wikimedia.org/T189957#4319563 (10herron) 05Open>03declined >>! In T189957#4303756, @Aklapper wrote: > At least for `cep-owner@` this stopped a while ago and I don't have any suc... [15:14:49] 10Operations, 10SRE-Access-Requests: WMF-NDA-Request for User:Braveheart - https://phabricator.wikimedia.org/T198190#4319573 (10RobH) a:03Braveheart This sounds like a shell request for something in one of the analytics or statistics user groups. However, we will need a few things met before we can roll out... [15:14:57] 10Operations, 10SRE-Access-Requests: WMF-NDA-Request for User:Braveheart - https://phabricator.wikimedia.org/T198190#4319576 (10RobH) [15:15:35] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319577 (10Cmjohnson) @chasemp labnet1004 the cable in eth4 is connected the correct port and according the bios the mac address is E0:07:1B:EF:1... [15:17:02] (03CR) 10Muehlenhoff: [V: 032 C: 032] Cleanup .eggs/README.txt after build [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/442319 (owner: 10Muehlenhoff) [15:18:21] 10Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287#4319579 (10Cmjohnson) Thanks Moritz. I have a procurement task for more thermal paste. Once it arrives, we can schedule a time to take care of these. procurement task https://phabricator.wikimedia.org/... [15:20:23] (03PS1) 10Rush: openstack: update install MAC for labnet1004 [puppet] - 10https://gerrit.wikimedia.org/r/442323 [15:20:30] PROBLEM - Disk space on bohrium is CRITICAL: DISK CRITICAL - free space: / 3777 MB (3% inode=98%) [15:20:36] (03PS2) 10Rush: openstack: update install MAC for labnet1004 [puppet] - 10https://gerrit.wikimedia.org/r/442323 [15:21:24] checking bohrium [15:21:24] (03CR) 10Eevans: [C: 031] cassandra: add another package version to the 2.2 list [puppet] - 10https://gerrit.wikimedia.org/r/442251 (https://phabricator.wikimedia.org/T197062) (owner: 10Elukey) [15:21:54] (03CR) 10Rush: [C: 032] openstack: update install MAC for labnet1004 [puppet] - 10https://gerrit.wikimedia.org/r/442323 (owner: 10Rush) [15:22:40] RECOVERY - Disk space on bohrium is OK: DISK OK [15:24:12] (03PS1) 10Andrew Bogott: nova: set PYTHONIOENCODING in our env scripts [puppet] - 10https://gerrit.wikimedia.org/r/442324 [15:25:42] (03PS1) 10Muehlenhoff: Add trusty-wikimedia to known-dists [puppet] - 10https://gerrit.wikimedia.org/r/442325 [15:25:57] (03CR) 10Arturo Borrero Gonzalez: [C: 031] nova: set PYTHONIOENCODING in our env scripts [puppet] - 10https://gerrit.wikimedia.org/r/442324 (owner: 10Andrew Bogott) [15:26:09] (03CR) 10Krinkle: webperf: Make performance::site apache config more dynamic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/442232 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [15:26:25] (03CR) 10Volans: [C: 031] "LGTM thanks for fixing!" [puppet] - 10https://gerrit.wikimedia.org/r/442325 (owner: 10Muehlenhoff) [15:26:37] (03CR) 10Andrew Bogott: [C: 032] nova: set PYTHONIOENCODING in our env scripts [puppet] - 10https://gerrit.wikimedia.org/r/442324 (owner: 10Andrew Bogott) [15:27:15] thcipriani: here? saw your task for scap, I can do the upgrade now [15:27:28] godog: that'd be great! Thank you [15:28:29] thcipriani: kk, building/uploading/etc, the puppet patch is already uploaded? [15:28:48] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319603 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` labnet1004.eqiad.wmnet ``` The l... [15:28:53] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319604 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labnet1004.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['labnet1004.eqia... [15:28:58] godog: puppet patch is https://gerrit.wikimedia.org/r/c/operations/puppet/+/442226 [15:29:05] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319605 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` labnet1004.eqiad.wmnet ``` The l... [15:29:46] thcipriani: kk, looks like the bug number is wrong on the patch btw [15:30:19] !log upload scap 3.8.3 - T198277 [15:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:21] T198277: Update Debian Package for Scap3 to 3.8.3-1 - https://phabricator.wikimedia.org/T198277 [15:30:33] (03PS2) 10Thcipriani: Scap: Bump version to 3.8.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/442226 (https://phabricator.wikimedia.org/T198277) [15:30:51] updated [15:31:01] (03PS3) 10Filippo Giunchedi: Scap: Bump version to 3.8.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/442226 (https://phabricator.wikimedia.org/T198277) (owner: 10Thcipriani) [15:31:21] (03CR) 10Filippo Giunchedi: [C: 032] Scap: Bump version to 3.8.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/442226 (https://phabricator.wikimedia.org/T198277) (owner: 10Thcipriani) [15:31:26] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: allow mysql connection for keystone [puppet] - 10https://gerrit.wikimedia.org/r/442328 (https://phabricator.wikimedia.org/T196633) [15:32:00] (03CR) 10jerkins-bot: [V: 04-1] openstack: eqiad1: allow mysql connection for keystone [puppet] - 10https://gerrit.wikimedia.org/r/442328 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:32:02] thcipriani: thanks! I'll update the deploy servers [15:33:10] sounds good, this removes one available subcommand and one available option, so when that's done I'll review the help output to make sure those are gone to check that the update worked ok [15:33:20] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: allow mysql connection for keystone [puppet] - 10https://gerrit.wikimedia.org/r/442328 (https://phabricator.wikimedia.org/T196633) [15:33:53] (03CR) 10jerkins-bot: [V: 04-1] openstack: eqiad1: allow mysql connection for keystone [puppet] - 10https://gerrit.wikimedia.org/r/442328 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:35:14] thcipriani: deploy1001 upgraded [15:36:04] !log Stop replication on db2094:3318 to update triggers on archive table [15:36:05] godog: help output looks like command was deleted. I'll go ahead and run a noop sync-file as a sanity check as well [15:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:52] thcipriani: kk, thanks! [15:37:39] (03PS2) 10Andrew Bogott: deployment-prep: Add new deployment host [puppet] - 10https://gerrit.wikimedia.org/r/442229 (https://phabricator.wikimedia.org/T192561) (owner: 10Alex Monk) [15:38:10] !log thcipriani@deploy1001 Synchronized README: Scap 3.8.3-1 noop test sync-file (duration: 00m 56s) [15:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:23] 10Operations, 10Scap (Scap3-MediaWiki-MVP), 10Wikimedia-Incident: Scap sync --restart not working - https://phabricator.wikimedia.org/T198185#4319638 (10fgiunchedi) [15:38:29] 10Operations, 10Scap, 10Patch-For-Review, 10Wikimedia-Incident: Update Debian Package for Scap3 to 3.8.3-1 - https://phabricator.wikimedia.org/T198277#4319635 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Done! [15:38:33] (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: allow mysql connection for keystone [puppet] - 10https://gerrit.wikimedia.org/r/442328 (https://phabricator.wikimedia.org/T196633) [15:38:40] godog: all looks good, thanks for the quick turnaround! [15:38:46] (03CR) 10Andrew Bogott: [C: 032] deployment-prep: Add new deployment host [puppet] - 10https://gerrit.wikimedia.org/r/442229 (https://phabricator.wikimedia.org/T192561) (owner: 10Alex Monk) [15:40:20] PROBLEM - Disk space on bohrium is CRITICAL: DISK CRITICAL - free space: / 3640 MB (3% inode=98%) [15:40:31] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban), 10Wikimedia-Incident: Scap sync --restart not working - https://phabricator.wikimedia.org/T198185#4319649 (10thcipriani) 05Open>03Resolved a:03thcipriani Removed in Scap 3.8.3-1 which was just made live in production. [15:42:29] elukey: bohrium is expected I guess due to the upgrade [15:42:50] can be silenced even (?) [15:43:20] (03PS4) 10Arturo Borrero Gonzalez: openstack: eqiad1: allow mysql connection for keystone [puppet] - 10https://gerrit.wikimedia.org/r/442328 (https://phabricator.wikimedia.org/T196633) [15:43:38] 10Operations, 10fundraising-tech-ops, 10netops: new pfw policy for monitor server - https://phabricator.wikimedia.org/T198237#4319659 (10ayounsi) a:03cwdent That config adds the two policies `prometheus2_node_exporters` and `prometheus2_misc` after the global `deny_and_log`. They need to be moved before. [15:43:41] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] "Compiler is happy:" [puppet] - 10https://gerrit.wikimedia.org/r/442328 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:45:02] godog: ouch I thought it was ok, lemme check it [15:45:08] (03PS5) 10Arturo Borrero Gonzalez: openstack: eqiad1: allow mysql connection for keystone [puppet] - 10https://gerrit.wikimedia.org/r/442328 (https://phabricator.wikimedia.org/T196633) [15:45:10] (03CR) 10Rush: [C: 031] "I don't oppose but it does come with teh comedy of if 1004 is gone 1005 has nothing to connect to :D but yes all for testing/staging tx " [puppet] - 10https://gerrit.wikimedia.org/r/442328 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:45:51] (03CR) 10Rush: [C: 031] "replace 1004/1005 in previous attempt at humor with 1003/1004" [puppet] - 10https://gerrit.wikimedia.org/r/442328 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:46:06] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] openstack: eqiad1: allow mysql connection for keystone [puppet] - 10https://gerrit.wikimedia.org/r/442328 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:53:02] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1020 - https://phabricator.wikimedia.org/T194855#4319679 (10Bstorm) This is currently still some kind of an issue on both servers. The thing is that I'm not sure if it is a problem or just describing reality (embedded controller has no disk and installed cont... [15:54:51] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [15:56:06] godog: FYI ^^^ [15:58:35] 10Operations, 10Proton, 10SRE-Access-Requests, 10Patch-For-Review: Add @pmiazga @Niedzielski and @phuedx to the deploy-service group - https://phabricator.wikimedia.org/T197857#4319690 (10faidon) 05Open>03Resolved a:03faidon Sure, that's fine :) [15:58:37] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4319693 (10faidon) [16:00:00] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: fix keystone local db ferm rul [puppet] - 10https://gerrit.wikimedia.org/r/442332 (https://phabricator.wikimedia.org/T196633) [16:00:52] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: eqiad1: fix keystone local db ferm rul [puppet] - 10https://gerrit.wikimedia.org/r/442332 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [16:01:11] RECOVERY - Disk space on bohrium is OK: DISK OK [16:04:25] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4319707 (10ayounsi) eth1 was in the wrong vlan: ```lang=diff [edit interfaces interface-range cloud-instance-ports] member xe-4/0/33... [16:09:10] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4319724 (10Bstorm) 05Open>03Resolved Looking good! The VM is doing a puppet run. I think the network is working on these things now. [16:12:23] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507#4319726 (10Bstorm) This server appears to be fully functional from all views I can see. However, the monitor for RAID would disagree and think it is critical. I believe it reports that there are no drives on o... [16:18:00] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507#4319730 (10Bstorm) From https://h20195.www2.hpe.com/v2/getpdf.aspx/c04346301.pdf?ver=2 > QuickSpecs > HP Smart ArrayP840 > Controller > ... > NOTE: > HP Smart ArrayP840/4GB FBWC controller option kit **doe... [16:20:21] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:24:55] (03PS7) 10Hagar Shilo: CORS whitelist chapter wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441096 (https://phabricator.wikimedia.org/T181165) [16:30:59] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: eqiad1: fix value of nova_controller_standby [puppet] - 10https://gerrit.wikimedia.org/r/442340 (https://phabricator.wikimedia.org/T196633) [16:33:19] (03CR) 10Arturo Borrero Gonzalez: [C: 032] hieradata: openstack: eqiad1: fix value of nova_controller_standby [puppet] - 10https://gerrit.wikimedia.org/r/442340 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [16:33:27] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319767 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labnet1004.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['labnet1004.eqia... [16:34:08] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319770 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labnet1004.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['labnet1004.eqia... [16:37:21] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4319773 (10mobrovac) I will deploy the new fixes that got merged in the source repo today, and then tomorrow we could put it behind LVS. [16:39:23] !log Deploy schema change on dbstore1002:s8 T191316 T192926 T89737 T195193 [16:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:28] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [16:39:28] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [16:39:28] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [16:39:28] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [16:46:13] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507#4319790 (10Bstorm) @RobH Any thoughts on that battery issue above? I'm going to see if the first controller that isn't being used can be disabled in the BIOS or something. [16:50:32] (03PS1) 10Urbanecm: Enable SandboxLink on eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442345 (https://phabricator.wikimedia.org/T198335) [16:54:45] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4319810 (10chasemp) We moved past the DHCP/NIC issue and now are failing with ```Loading Linux 4.9.0-0.bpo.6-amd64 ... Loading initial ramdisk ...... [17:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180627T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:06:56] is there a SWAT window right now? [17:07:20] there is a patch we would like to add and SWAT it right now if possible (sorry for being soo late) [17:07:34] late arrival https://wikitech.wikimedia.org/wiki/Deployments#Week_of_June_25th [17:09:17] addshore, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, zeljkof? whats the protocol? [17:25:13] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4319891 (10Mholloway) [17:34:25] " Time to snap out of that daydream"😂 [17:37:22] (03PS4) 10Alex Monk: deployment-prep-logstash2: replace deployment-tin server [puppet] - 10https://gerrit.wikimedia.org/r/438001 (https://phabricator.wikimedia.org/T192071) (owner: 10Dzahn) [17:38:14] (03CR) 10Alex Monk: "Copied file across, fixing commit message" [puppet] - 10https://gerrit.wikimedia.org/r/438001 (https://phabricator.wikimedia.org/T192071) (owner: 10Dzahn) [17:38:59] (03PS5) 10Alex Monk: deployment-prep logstash: replace deployment-tin reference [puppet] - 10https://gerrit.wikimedia.org/r/438001 (https://phabricator.wikimedia.org/T192071) (owner: 10Dzahn) [17:39:13] (03CR) 10Alex Monk: [C: 031] "go for it" [puppet] - 10https://gerrit.wikimedia.org/r/438001 (https://phabricator.wikimedia.org/T192071) (owner: 10Dzahn) [17:47:20] (03PS1) 10Valerie: Turning on page creation log for all other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442356 (https://phabricator.wikimedia.org/T196400) [17:48:55] (03PS2) 10Valerie: Turning on page creation log for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442356 (https://phabricator.wikimedia.org/T196400) [17:52:33] !log mobrovac@deploy1001 Started deploy [proton/deploy@cd6ed94]: Update proton to 491e966 - T186748 T197856 [17:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:37] T197856: Proton should reject erroneous requests straightaway - https://phabricator.wikimedia.org/T197856 [17:52:37] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [17:53:08] !log mobrovac@deploy1001 Finished deploy [proton/deploy@cd6ed94]: Update proton to 491e966 - T186748 T197856 (duration: 00m 35s) [17:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:18] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4319971 (10mobrovac) [17:56:02] (03CR) 10Imarlier: "Just a general question: what's the value of this? Beta site? Something else?" [puppet] - 10https://gerrit.wikimedia.org/r/442232 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [17:56:24] raynor: just saw this, sorry, too late for me, I'm usually around for EU SWAT [17:56:38] nah, no worries, we will do it tomorrow [17:57:02] I could do it by myself but tbh I was bit afraid that if something goes south, I won't be able to quickly fix it [17:57:46] !log updating NTP servers on network devices [17:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180627T1800) [18:15:50] 10Operations, 10RESTBase, 10Traffic, 10Patch-For-Review, 10Services (later): Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387#4320004 (10Mholloway) [18:17:12] 10Operations, 10SRE-Access-Requests: WMF-NDA-Request for User:Braveheart - https://phabricator.wikimedia.org/T198190#4320005 (10RobH) I'm also not 100% sure this is for shell access, since I'm not clear on exactly what data (and where it is housed) that @braveheart is requesting. This could be some web interf... [18:20:37] 10Operations, 10SRE-Access-Requests: WMF-NDA-Request for User:Braveheart - https://phabricator.wikimedia.org/T198190#4320008 (10MarcoAurelio) Sounds sort of , but not sure; hence not adding #sre-access-requests instead of #wmf-nda-requests. [18:23:24] 10Operations, 10SRE-Access-Requests, 10netops: Get Papaul access to network equipment - https://phabricator.wikimedia.org/T198344#4320010 (10faidon) p:05Triage>03Normal [18:24:17] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3420747 (10Quiddity) Re: announcement email of completion - https://lists.wikimedia.org/pipermail/wikitech-l/2018-June/090251.html Are all... [18:25:27] (03CR) 10Krinkle: "Yeah, mainly so that we can inject these through Hiera so that XHGui can work in beta using the same config. A number of other puppet modu" [puppet] - 10https://gerrit.wikimedia.org/r/442232 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [18:26:31] (03CR) 10Krinkle: "As for why I'm doing it now in particular, that's so that the mwlog1001 backend which currently serves xenon data, can be changed to perf2" [puppet] - 10https://gerrit.wikimedia.org/r/442232 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [18:32:00] PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.50 seconds [18:32:01] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.62 seconds [18:44:38] (03PS4) 10MarcoAurelio: Create site striker.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/441817 (https://phabricator.wikimedia.org/T189637) [18:45:28] jouncebot: next [18:45:28] In 0 hour(s) and 14 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180627T1900) [18:45:34] jouncebot: now [18:45:34] For the next 0 hour(s) and 14 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180627T1800) [18:49:55] (03CR) 10MarcoAurelio: [C: 031] "I've scheduled this patch for today's evening SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441422 (https://phabricator.wikimedia.org/T195675) (owner: 10C. Scott Ananian) [19:00:04] marxarelli: Your horoscope predicts another unfortunate MediaWiki train deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180627T1900). [19:02:00] RECOVERY - MariaDB Slave Lag: m3 on db2042 is OK: OK slave_sql_lag Replication lag: 21.13 seconds [19:02:01] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 18.08 seconds [19:02:15] > may zuul be nice with you <-- nice, afaics we have some issues? [19:09:22] (03PS1) 10Dduvall: group1 wikis to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442368 [19:09:24] (03CR) 10Dduvall: [C: 032] group1 wikis to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442368 (owner: 10Dduvall) [19:11:15] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442368 (owner: 10Dduvall) [19:11:48] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442368 (owner: 10Dduvall) [19:12:24] ^ thcipriani: deploy script seems to have changed the php symlink [19:12:39] that's correct for group1 [19:12:47] ah, good [19:12:53] :) [19:12:58] i was worried it was another bug on account of the wmf.999 branch [19:13:05] right [19:13:33] we should file a task for the symlink swap yesterday so we don't forget to investigate that [19:14:04] the symlink should be swapped during group1 and the deploy-promote script will do: update-wikiversions + sync the new symlink [19:14:23] oh, right. ok, the sync makes more sense now [19:15:05] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.10 [19:15:13] * marxarelli rolls on [19:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:04] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.10 (duration: 00m 58s) [19:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:12] !log seeing rising "Wikimedia\Rdbms\DBQueryError from line 1443 of /srv/mediawiki/php-1.32.0-wmf.10/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema update..." errors [19:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:19] !log errors seem due to "INSERT INTO `revision_comment_temp`" statements and lock wait timeout [19:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:20] o/ [19:23:27] marxarelli: issue with the train? [19:23:37] addshore: howdy! [19:23:40] yeah, seems so [19:23:48] Is there a ticket? [19:23:57] not yet. i'm about to file it [19:24:16] ack :) [19:29:05] addshore: https://phabricator.wikimedia.org/T198350 [19:32:05] (03CR) 10Imarlier: [C: 031] "> As for why I'm doing it now in particular, that's so that the" [puppet] - 10https://gerrit.wikimedia.org/r/442232 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [19:32:57] (03CR) 10Imarlier: [C: 031] "Also: dzahn is on vacation. He suggested Alex, Ariel, and Giuseppe as good alternatives." [puppet] - 10https://gerrit.wikimedia.org/r/442232 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [19:37:08] addshore: is this mcr-related afayct? [19:37:11] addshore: does that make sense to you? [19:38:04] seems to be a lot of these errors [19:38:19] my concern as well [19:38:23] definitely a lot for group1 [19:39:09] revision_comment_temp is technically not MCR, but rather from the work to make edit summaries (comments) longer than 255 bytes [19:39:36] but that was added a year ago in 11cf01dd9a8512ad4d9bded43cf22ebd38af8818. so the real culprit must be something else [19:40:17] several users have already reported it on phab and on #wikimedia-tech. it might be worth reverting the train [19:40:28] any new indices added to that table? [19:40:50] seems like a lock wait timeout on insert would be due to lock contention (slower inserts) maybe? [19:41:45] revert seems wise at this point [19:41:46] * marxarelli does [19:44:58] o/ thcipriani marxarelli, sorry just popped in while eating dinner to see how the train was going, went back to eating dinner then as the explosion didn't look too big [19:45:35] It's probably related to the MCR patches, they moved logic around which will have included come of the comment insertion stuff, would have to have a look at the stack trace after [19:45:43] probably worth reverting for now I guess if it is causing issues [19:46:44] MatmaRex: yes, the culprit is probably the alterations in RevisionStore for MCR [19:49:11] (03PS1) 10Dduvall: Revert "group1 wikis to 1.32.0-wmf.10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442373 [19:49:17] (any idea why i can't lookup the error id from https://phabricator.wikimedia.org/T198353 in logstash? it's probably the same issue as well, but i'd like to verify before i dupe it) [19:50:09] (03CR) 10Rush: [C: 031] nova: move glance_host into hiera so it can be configured per-deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440147 (https://phabricator.wikimedia.org/T191791) (owner: 10Andrew Bogott) [19:51:19] MatmaRex: that may have to do with the url i posted. i'm not entirely sure how to properly share a single document from kibana [19:51:32] (03CR) 10Dduvall: [C: 032] Revert "group1 wikis to 1.32.0-wmf.10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442373 (owner: 10Dduvall) [19:52:12] !log Rolling back group1 due to rise in error rate (T198350) [19:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:15] T198350: Rising lock wait timeout SQL errors upon 1.32.0-wmf.10 group1 deployment - https://phabricator.wikimedia.org/T198350 [19:52:20] (03CR) 10Imarlier: [C: 031] "Aaron, thoughts on Timo's last comment? Do we need to change naming around before merging this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440469 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [19:52:42] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.32.0-wmf.10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442373 (owner: 10Dduvall) [19:52:58] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.32.0-wmf.10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442373 (owner: 10Dduvall) [19:53:56] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#4320260 (10ayounsi) [19:54:21] https://phabricator.wikimedia.org/T198350#4320258 <- this is annoying [19:54:36] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: Group1 rolled back to 1.32.0-wmf.8 [19:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:50] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 58 probes of 323 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:57:03] I can't edit Commons now... what's going on? [19:59:03] hmmm, not seeing a drop in those errors in logstash marxarelli [19:59:08] unless its just slowq [20:00:04] addshore: strange... the rollback appears to have been successful on my end [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor I � Unicode. All rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180627T2000). [20:00:50] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 0 probes of 323 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [20:01:08] marxarelli: it looks like it is just the same errors but happening on .8 ..... [20:01:19] [{exception_id}] {exception_url} Wikimedia\Rdbms\DBQueryError from line 1443 of /srv/mediawiki/php-1.32.0-wmf.8/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema [20:01:36] since 19:55 which was the rollback time [20:02:00] looks like as of 20:00 the rate might have decreased a bit, *continues watching* [20:02:06] !log applied fix for T197447 to eqiad wdqs cluster, which involved restart of the services [20:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:08] T197447: Default Blazegraph configuration confuses strings with and without RTL mark - https://phabricator.wikimedia.org/T197447 [20:02:25] yannf: it's broken, we don't know why yet. folks are working on it [20:02:28] !log dduvall@deploy1001 Synchronized php: (no justification provided) (duration: 00m 57s) [20:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:53] addshore: if there were slow queries already queued, is it possible there would be a delay in recovery due to lock contention? [20:04:35] marxarelli: could make sense! [20:04:42] https://usercontent.irccloud-cdn.com/file/432dWV56/image.png [20:04:42] looks like they have dropped off :) [20:05:02] oh good! :) [20:05:24] i already have a "i broke Wikipedia" tshirt. i don't need a closet full of them [20:05:26] :D [20:06:52] 10Operations, 10fundraising-tech-ops, 10netops: new pfw policy for monitor server - https://phabricator.wikimedia.org/T198237#4320365 (10cwdent) 05Open>03Resolved works! [20:08:22] https://commons.wikimedia.org/wiki/File:Marcelle_Lender,_par_Jean_Reutlinger,_btv1b85969082-p013.jpg [20:09:36] 6,400 × 5,853 pixels, while the real size is 6,400 × 2,877 pixels ^ [20:09:53] https://upload.wikimedia.org/wikipedia/commons/f/f6/Marcelle_Lender%2C_par_Jean_Reutlinger%2C_btv1b85969082-p013.jpg [20:10:32] and the original file disappeared [20:11:35] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Investigate HTTP 500 on POST request to WDQS - https://phabricator.wikimedia.org/T198055#4310350 (10Smalyshev) I suspect that was the cause of 500's? Do we have them anymore? If not, we can resolve this. [20:15:23] yannf: that looks like a separate issue from the outage, but seems similar to https://phabricator.wikimedia.org/T198177 , can you comment there about it? [20:16:59] (03PS6) 10Smalyshev: Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) [20:17:43] (03CR) 10jerkins-bot: [V: 04-1] Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) (owner: 10Smalyshev) [20:17:43] done [20:18:51] I reimported the original from the source, so we can see the issue with the first version https://commons.wikimedia.org/wiki/File:Marcelle_Lender,_par_Jean_Reutlinger,_btv1b85969082-p013.jpg [20:33:04] 10Operations, 10SRE-Access-Requests: WMF-NDA-Request for User:Braveheart - https://phabricator.wikimedia.org/T198190#4320444 (10Halfak) I'd recommend working with the `recentchanges` to geotag editors unless #analytics has a reporting UI that will work for @Braveheart. [20:33:05] (03PS1) 10Rush: openstack: basic net role to labnet100[34] [puppet] - 10https://gerrit.wikimedia.org/r/442737 (https://phabricator.wikimedia.org/T196633) [20:33:23] (03PS1) 10Jdlrobson: Limit wgMathEnableWikibaseDataType to wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442738 (https://phabricator.wikimedia.org/T173949) [20:34:42] (03CR) 10Rush: [C: 032] openstack: basic net role to labnet100[34] [puppet] - 10https://gerrit.wikimedia.org/r/442737 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [20:35:20] 10Operations, 10SRE-Access-Requests: WMF-NDA-Request for User:Braveheart - https://phabricator.wikimedia.org/T198190#4320464 (10Halfak) Word on the IRC is that @milimetric is working on something for this right now :) [20:36:49] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4320479 (10chasemp) 05Open>03Resolved dug up an old task that said rootdelay is the way to address this in jessie, with permanent fixes having l... [20:58:42] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#4320520 (10akosiaris) 05Open>03stalled >>! In T170150#4320025, @Quiddity wrote: > Re: announcement email of completion - https://lists.... [21:02:23] thcipriani, addshore: thanks for your eyes on the train today. since the window has ended, i'll plan on sending an email about the rollback to wikitech/engineering shortly [21:02:30] 10Operations, 10netops: Rack/cable/configure mr1-eqiad - https://phabricator.wikimedia.org/T187820#4320527 (10ayounsi) [21:02:33] 10Operations, 10netops: Rack/cable/configure mr1-eqiad - https://phabricator.wikimedia.org/T187820#3986943 (10ayounsi) [21:02:36] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#4320530 (10ayounsi) [21:02:39] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908273 (10ayounsi) [21:02:49] this seems like an issue we'd want dba eyes on as well, but i'm not sure who's around at the moment [21:03:15] marxarelli: ack! [21:04:04] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171#3908273 (10ayounsi) a:05ayounsi>03Cmjohnson Assigning to Chris for the wipe/unrack/decom/etc. [21:04:20] (03PS2) 10Andrew Bogott: nova: move glance_host into hiera so it can be configured per-deploy [puppet] - 10https://gerrit.wikimedia.org/r/440147 (https://phabricator.wikimedia.org/T191791) [21:04:39] (03CR) 10Aaron Schulz: "I don't want to yak shave too much here. The memcached-pecl name will go away once it's not used anymore. If anyone wants to just rename i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440469 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [21:05:14] (03CR) 10Andrew Bogott: [C: 032] nova: move glance_host into hiera so it can be configured per-deploy [puppet] - 10https://gerrit.wikimedia.org/r/440147 (https://phabricator.wikimedia.org/T191791) (owner: 10Andrew Bogott) [21:06:37] marxarelli: sounds good! [21:06:55] 10Operations, 10SRE-Access-Requests, 10netops: Get Papaul access to network equipment - https://phabricator.wikimedia.org/T198344#4320539 (10ayounsi) a:03ayounsi Taking the task for the actual account creation. [21:10:41] https://commons.wikimedia.org/wiki/File:Vmmlogo.png [21:10:52] now we have files not mentioned in the uploader's log :(( ^ [21:12:37] 10Operations, 10fundraising-tech-ops, 10netops: new pfw policy for monitor server - https://phabricator.wikimedia.org/T198237#4320545 (10ayounsi) For the record, the issue was that I was doing a `load merge` instead of a `load replace` [21:30:26] jouncebot: next [21:30:26] In 1 hour(s) and 29 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180627T2300) [21:37:42] * chasemp waves to marxarelli [21:38:22] 10Operations, 10SRE-Access-Requests: WMF-NDA-Request for User:Braveheart - https://phabricator.wikimedia.org/T198190#4314835 (10Nuria) I do not think you need access to data but rather to the geoeditor reports (aggreggated counts of edits per country per wiki) so no ssh keys should be needed, in fact besides t... [21:42:27] !log setting BFD of the Zayo eqiad-codfw link to standard of 300 [21:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:23] !log piwik maintenance on bohrium completed [21:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:48] (03CR) 1020after4: [C: 031] Scap clean: remove remote cache directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441920 (https://phabricator.wikimedia.org/T157030) (owner: 10Thcipriani) [22:31:19] * marxarelli waves back at chasemp :) [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180627T2300). Please do the needful. [23:00:04] kaldari and RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:11] here [23:03:18] I can SWAT [23:04:08] (03PS3) 10Thcipriani: Turning on page creation log for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442356 (https://phabricator.wikimedia.org/T196400) (owner: 10Valerie) [23:04:32] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442356 (https://phabricator.wikimedia.org/T196400) (owner: 10Valerie) [23:05:54] I have a lot of patches and wouldn't mind deploying them myself [23:06:44] (03Merged) 10jenkins-bot: Turning on page creation log for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442356 (https://phabricator.wikimedia.org/T196400) (owner: 10Valerie) [23:06:48] RoanKattouw: okie doke, I'll get kaldari squared away and then get out of your way :) [23:06:56] mine should be fast [23:07:21] kaldari: your change is on mwdebug1002, check please [23:07:26] looking [23:08:22] (03CR) 10jenkins-bot: Turning on page creation log for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442356 (https://phabricator.wikimedia.org/T196400) (owner: 10Valerie) [23:08:43] thcipriani: Looks good. Feel free to sync! [23:08:50] syncing [23:10:02] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:442356|Turning on page creation log for most wikis]] T196400 (duration: 00m 58s) [23:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:04] T196400: Deploy new page creation log - https://phabricator.wikimedia.org/T196400 [23:10:11] ^ kaldari should be live everywhere [23:10:15] looking... [23:11:08] works! Thanks! https://en.wikipedia.org/wiki/Special:Log/create [23:11:22] cool, thanks for checking! [23:11:26] RoanKattouw: deployment server is all yours! [23:14:52] Thanks! [23:31:21] ...still waiting for Jenkins :| [23:31:36] thcipriani: What are the chances of this week's train being completed some time this week or next? [23:33:22] unsure at this point. There's only one blocker at this point, so it mostly depends on how deep that problem runs. [23:38:47] still possible but outlook unknowable at this poitn :) [23:57:05] !log phabricator deployment is coming up in just a couple of minutes. There will be downtime while I run database migrations. [23:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:33] maintenance is scheduled in icinga, this shouldn't take too long