[00:28:21] PROBLEM - Host db2064 is DOWN: PING CRITICAL - Packet loss = 100% [01:16:42] RECOVERY - Memory correctable errors -EDAC- on rdb2002 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=rdb2002&var-datasource=codfw%2520prometheus%252Fops [01:18:24] (03PS1) 10Ottomata: Increase Kafka MirrorMaker max.request.size to 5.5Mb [puppet] - 10https://gerrit.wikimedia.org/r/434290 (https://phabricator.wikimedia.org/T189464) [01:19:34] (03CR) 10Ottomata: [C: 032] Increase Kafka MirrorMaker max.request.size to 5.5Mb [puppet] - 10https://gerrit.wikimedia.org/r/434290 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [01:20:52] !log bouncing main -> jumbo MirrorMaker with increased max.request.size - T189464 [01:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:57] T189464: Fix Mirror Maker erratic behavior when replicating from main-eqiad to jumbo - https://phabricator.wikimedia.org/T189464 [02:41:32] PROBLEM - Long running screen/tmux on snapshot1001 is CRITICAL: CRIT: Long running SCREEN process. (user: springle PID: 20976, 1735333s 1728000s). [03:08:26] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.3) (duration: 14m 23s) [03:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:23] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.4) (duration: 15m 28s) [04:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:03] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon May 21 04:14:03 UTC 2018 (duration 7m 40s) [04:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:13] (03PS1) 10Marostegui: db-codfw.php: Depool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434295 (https://phabricator.wikimedia.org/T195228) [05:07:28] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434295 (https://phabricator.wikimedia.org/T195228) (owner: 10Marostegui) [05:08:57] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434295 (https://phabricator.wikimedia.org/T195228) (owner: 10Marostegui) [05:09:39] !log Deploy schema change on s7 primary master (db1062) - T191519 T188299 T1901482 [05:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:45] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:09:46] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:09:50] (03CR) 10jenkins-bot: db-codfw.php: Depool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434295 (https://phabricator.wikimedia.org/T195228) (owner: 10Marostegui) [05:10:41] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2064 crashed - https://phabricator.wikimedia.org/T195228#4220110 (10Marostegui) [05:11:39] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2064 - T195228 (duration: 01m 44s) [05:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:44] T195228: db2064 crashed - https://phabricator.wikimedia.org/T195228 [05:23:16] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2064 crashed - https://phabricator.wikimedia.org/T195228#4220115 (10Marostegui) a:03Papaul Can you take a look at this server? Maybe power drain it? I am not even able to power it on: ``` hpiLO-> power status=0 status_tag=COMMAND COMPLETED Mon... [05:24:19] (03PS1) 10Marostegui: db2064: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/434297 (https://phabricator.wikimedia.org/T195228) [05:25:05] (03CR) 10Marostegui: [C: 032] db2064: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/434297 (https://phabricator.wikimedia.org/T195228) (owner: 10Marostegui) [05:27:08] !log Stop MySQL and reboot db1067 - T194852 [05:27:09] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4220120 (10Marostegui) After the weekend, everything looks fine: ``` root@db1067:~# megacli -AdpBbuCmd -a0 | grep Temper Temperature: 47 C Temperature... [05:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:12] T194852: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852 [05:35:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434298 (https://phabricator.wikimedia.org/T190148) [05:37:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434298 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:38:14] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4220125 (10Marostegui) After the reboot: ``` root@db1067:~# megacli -AdpBbuCmd -a0 | grep Temper Temperature: 48 C Temperature : OK ``` [05:39:00] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434298 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:39:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434298 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:40:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1123 for alter table (duration: 01m 20s) [05:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:56] !log Deploy schema change on db1123 - T191519 T188299 T1901482 [05:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:01] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:41:01] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:48:01] (03PS1) 10Marostegui: db1067.yaml: Enable notifications on db1067 [puppet] - 10https://gerrit.wikimedia.org/r/434299 (https://phabricator.wikimedia.org/T194852) [05:49:00] (03CR) 10Marostegui: [C: 032] db1067.yaml: Enable notifications on db1067 [puppet] - 10https://gerrit.wikimedia.org/r/434299 (https://phabricator.wikimedia.org/T194852) (owner: 10Marostegui) [05:50:30] !log Restart MySQL on db1116 and db1120 for testing [05:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:30] !log Restart MySQL on db2075 and db2092 for testing [05:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:07] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/keystone/logging.conf] [06:31:06] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [06:31:06] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:32:06] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/smartmontools/run.d/20logger] [06:32:07] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:41:17] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:42:26] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:54:47] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:56:57] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:57] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:26:36] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1966 bytes in 0.119 second response time [07:39:09] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434300 [07:51:50] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434300 (owner: 10Marostegui) [07:53:09] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434300 (owner: 10Marostegui) [07:53:23] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434300 (owner: 10Marostegui) [07:54:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1123 after alter table (duration: 01m 21s) [07:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:21] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434302 (https://phabricator.wikimedia.org/T190148) [07:58:56] PROBLEM - MegaRAID on db1054 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [07:59:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434302 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [08:00:45] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434302 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [08:01:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434302 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [08:02:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1077 for alter table (duration: 01m 22s) [08:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:38] !log Deploy schema change on db1077 with replication, this will generate lags on labs - T191519 T188299 T1901482 [08:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:43] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [08:02:43] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [08:03:26] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1950 bytes in 0.135 second response time [08:14:38] (03PS3) 10Elukey: Swap zookeeper on conf1002 with conf1005 [puppet] - 10https://gerrit.wikimedia.org/r/433322 (https://phabricator.wikimedia.org/T182924) [08:17:02] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394#4167825 (10faidon) The RAID still shows as degraded -- @RobH -or someone else- could you have a look? Thanks! [08:18:35] (03CR) 10Kaldari: [C: 031] Raise the rate limits for Commons to higher values than global 90 edits/minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) (owner: 10Urbanecm) [08:20:00] (03CR) 10Elukey: "new pcc https://puppet-compiler.wmflabs.org/compiler02/11255/" [puppet] - 10https://gerrit.wikimedia.org/r/433322 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [08:32:06] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1977 bytes in 0.089 second response time [08:37:17] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1942 bytes in 0.098 second response time [08:53:26] (03PS1) 10Jcrespo: mariadb: Pool db1105 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434303 [08:57:32] (03PS3) 10Jcrespo: Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433703 [08:58:48] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1105 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434303 (owner: 10Jcrespo) [09:00:12] (03Merged) 10jenkins-bot: mariadb: Pool db1105 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434303 (owner: 10Jcrespo) [09:00:27] (03CR) 10jenkins-bot: mariadb: Pool db1105 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434303 (owner: 10Jcrespo) [09:00:46] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1980 bytes in 0.108 second response time [09:05:25] (03PS4) 10Jcrespo: Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433703 [09:09:15] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433703 (owner: 10Jcrespo) [09:10:43] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433703 (owner: 10Jcrespo) [09:12:45] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433703 (owner: 10Jcrespo) [09:14:31] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1085 with full weight, repool db1105 with low weight (duration: 01m 21s) [09:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:15] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Disable wikidiff2 inline moved paragraphs by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434158 (https://phabricator.wikimedia.org/T194271) (owner: 10WMDE-Fisch) [09:25:47] (03PS1) 10Jcrespo: mariadb: Repool db1105:s1 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434308 [09:33:31] 10Operations, 10Analytics, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#4220461 (10elukey) [09:34:26] 10Operations, 10Analytics, 10Analytics-Cluster, 10Packaging: libcglib3-java replaces libcglib-java in Jessie - https://phabricator.wikimedia.org/T137791#4220464 (10elukey) 05Open>03declined [09:51:35] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1105:s1 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434308 (owner: 10Jcrespo) [09:51:42] (03CR) 10ArielGlenn: [C: 031] "Snapshot hosts are done. There's one lab instance to worry about, striker-deploy03.striker.eqiad.wmnet and I've left a message for the pro" [puppet] - 10https://gerrit.wikimedia.org/r/430912 (owner: 10Muehlenhoff) [09:52:58] (03Merged) 10jenkins-bot: mariadb: Repool db1105:s1 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434308 (owner: 10Jcrespo) [09:53:13] (03CR) 10jenkins-bot: mariadb: Repool db1105:s1 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434308 (owner: 10Jcrespo) [09:57:36] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1105 with full weight (duration: 01m 21s) [09:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:36] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434310 (https://phabricator.wikimedia.org/T128546) [10:06:56] PROBLEM - MariaDB Slave Lag: s3 on db2092 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 797.24 seconds [10:07:55] ^ checking [10:08:26] expected, as it is the alter table [10:09:31] (03CR) 10Elukey: "First pass of the code looks awesome, thanks a lot for doing this work. As far as I understood this new code can be used with Varnish 5.1," (032 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/430069 (owner: 10R4q3NWnUx2CEhVyr) [10:11:17] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434310 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:12:42] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434310 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:12:56] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434310 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:13:23] 10Operations, 10Cloud-VPS: Cannot add or update records under DNS zones in Horizon - https://phabricator.wikimedia.org/T195059#4220510 (10ArielGlenn) p:05Triage>03Normal [10:14:16] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1950 bytes in 0.107 second response time [10:15:00] 10Operations, 10Discovery, 10Maps: Track more detailed disk usage - https://phabricator.wikimedia.org/T194997#4220513 (10ArielGlenn) p:05Triage>03Normal [10:15:20] (03PS1) 10Mobrovac: Proton: Apply the role to proton hosts [puppet] - 10https://gerrit.wikimedia.org/r/434312 (https://phabricator.wikimedia.org/T186748) [10:15:52] (03CR) 10jerkins-bot: [V: 04-1] Proton: Apply the role to proton hosts [puppet] - 10https://gerrit.wikimedia.org/r/434312 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [10:17:39] 10Operations, 10Wikimedia-Mailing-lists: Provide a mean to mass discard/reject subscription requests on Wikimedia mailing lists - https://phabricator.wikimedia.org/T194669#4220519 (10ArielGlenn) p:05Triage>03Normal [10:18:55] !log jdrewniak@tin Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:434310|Bumping portals to master (T128546)]] (duration: 01m 22s) [10:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:00] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:20:16] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:434310|Bumping portals to master (T128546)]] (duration: 01m 20s) [10:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:52] (03PS2) 10Mobrovac: Proton: Apply the role to proton hosts [puppet] - 10https://gerrit.wikimedia.org/r/434312 (https://phabricator.wikimedia.org/T186748) [10:24:56] (03CR) 10Mobrovac: "PCC: https://puppet-compiler.wmflabs.org/compiler02/11256/" [puppet] - 10https://gerrit.wikimedia.org/r/434312 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [10:25:25] if anyone is at BCN, I'm airside near Breadway [10:44:25] krenair@deployment-certcentral:~$ sudo service uwsgi-le-central sttaus [10:44:25] uwsgi-le-central: unrecognized service [10:44:25] krenair@deployment-certcentral:~$ sudo service uwsgi-le-central status [10:44:25] ● uwsgi-le-central.service - uwsgi-le-central uwsgi app [10:44:48] I don't know whether to blame this on upstart, systemd, or system V [10:44:53] but something is messed up there [10:48:06] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:48:06] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:48:17] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:48:26] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:48:47] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:48:56] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:48:56] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:58:39] oom [11:00:04] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180521T1100). [11:00:07] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [11:00:16] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [11:00:37] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [11:00:38] nagios-nrpe-server was killed for oom and it's already restarted over there [11:00:46] RECOVERY - DPKG on stat1005 is OK: All packages OK [11:00:56] RECOVERY - Disk space on stat1005 is OK: DISK OK [11:01:06] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [11:04:37] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:23:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434318 [11:23:06] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434318 [11:25:51] ACKNOWLEDGEMENT - Check systemd state on snapshot1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. arielglenn smartd fails because device is not smartd capable host is scheduled for replacement anyways, see T184616 [11:27:46] RECOVERY - MariaDB Slave Lag: s3 on db2092 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:26:40] (03PS1) 10Merlijn van Deen: labs/db: maintain-meta_p restructuring [puppet] - 10https://gerrit.wikimedia.org/r/434323 [12:27:31] (03CR) 10jerkins-bot: [V: 04-1] labs/db: maintain-meta_p restructuring [puppet] - 10https://gerrit.wikimedia.org/r/434323 (owner: 10Merlijn van Deen) [12:39:33] (03PS5) 10Merlijn van Deen: Do not connect to SQL server for a dry run [puppet] - 10https://gerrit.wikimedia.org/r/432532 [12:39:35] (03PS3) 10Merlijn van Deen: labs/db: create basic integration test for maintain-meta_p [puppet] - 10https://gerrit.wikimedia.org/r/432698 [12:39:37] (03PS2) 10Merlijn van Deen: labs/db: maintain-meta_p restructuring [puppet] - 10https://gerrit.wikimedia.org/r/434323 [12:40:38] (03CR) 10jerkins-bot: [V: 04-1] labs/db: maintain-meta_p restructuring [puppet] - 10https://gerrit.wikimedia.org/r/434323 (owner: 10Merlijn van Deen) [12:40:57] apergos: thanks for checking! [12:41:07] yw [12:51:18] !log stop and reimage db1077 [12:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180521T1300). [13:00:04] bawolff: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] \o/ [13:01:08] (03PS1) 10Jcrespo: mariadb: Allow reimage of db1077 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/434327 [13:01:59] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of db1077 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/434327 (owner: 10Jcrespo) [13:08:34] Umm, can I just deploy that myself? [13:11:08] *crickets* going to take that as a yes [13:11:28] (03CR) 10Brian Wolff: [C: 032] Raise the rate limits for Commons to higher values than global 90 edits/minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) (owner: 10Urbanecm) [13:12:56] (03Merged) 10jenkins-bot: Raise the rate limits for Commons to higher values than global 90 edits/minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) (owner: 10Urbanecm) [13:13:12] (03CR) 10jenkins-bot: Raise the rate limits for Commons to higher values than global 90 edits/minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) (owner: 10Urbanecm) [13:14:37] bawolff: yes :) [13:14:42] lol [13:14:47] Woo [13:22:07] (03PS3) 10Merlijn van Deen: labs/db: maintain-meta_p restructuring [puppet] - 10https://gerrit.wikimedia.org/r/434323 [13:22:27] !log bawolff@tin Synchronized wmf-config/InitialiseSettings.php: Increase edit rate limits on commons (T194864) (duration: 01m 24s) [13:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:34] T194864: Raise the rate limit for autopatrollers on Commons - https://phabricator.wikimedia.org/T194864 [13:27:05] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.125 second response time [13:32:05] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1960 bytes in 0.114 second response time [13:33:31] (03PS1) 10Vgutierrez: Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) [13:33:41] \o/ [13:34:24] (03CR) 10jerkins-bot: [V: 04-1] Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [13:34:46] jenkins is always a party breaker [13:35:13] (03PS3) 10Jcrespo: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434318 (owner: 10Marostegui) [13:35:14] nah... just a missing dep [13:35:59] (03PS2) 10Vgutierrez: Implement kubernetes configuration observer [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) [13:37:34] :) [13:37:54] !log stop and reimage db2035 [13:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:01] (03CR) 10Elukey: Change to use VUT (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/430069 (owner: 10R4q3NWnUx2CEhVyr) [14:03:11] (03CR) 10Jcrespo: [C: 032] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434318 (owner: 10Marostegui) [14:04:43] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434318 (owner: 10Marostegui) [14:08:26] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1077 with low weight (duration: 01m 20s) [14:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:31] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434318 (owner: 10Marostegui) [14:22:31] (03PS1) 10Jcrespo: mariadb: Depool db1074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434339 (https://phabricator.wikimedia.org/T194870) [14:26:04] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434339 (https://phabricator.wikimedia.org/T194870) (owner: 10Jcrespo) [14:28:24] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4220796 (10Marostegui) Still looking fine ``` root@db1067:~# megacli -AdpBbuCmd -a0 | grep Temper Temperature: 47 C Temperature : OK ``` If by tom... [14:32:23] (03PS2) 10Jcrespo: mariadb: Depool db1074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434339 (https://phabricator.wikimedia.org/T194870) [14:34:30] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434339 (https://phabricator.wikimedia.org/T194870) (owner: 10Jcrespo) [14:35:55] (03Merged) 10jenkins-bot: mariadb: Depool db1074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434339 (https://phabricator.wikimedia.org/T194870) (owner: 10Jcrespo) [14:39:23] (03CR) 10jenkins-bot: mariadb: Depool db1074 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434339 (https://phabricator.wikimedia.org/T194870) (owner: 10Jcrespo) [14:41:14] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.28 (duration: 03m 54s) [14:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:49] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1074 and pool db1077 with 100% weight (duration: 01m 20s) [14:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:53] (03PS1) 10Jcrespo: mariadb: Repool db1109 fully after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434342 [14:46:42] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.29 (duration: 03m 29s) [14:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:27] (03PS1) 10Jcrespo: mariadb: Reimage db1074 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/434343 [14:50:19] !log stop and reimage db1074 [14:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:45] 10Operations: Update prometheus-varnish-exporter on debian to 1.4 - https://phabricator.wikimedia.org/T195252#4220835 (10Paladox) [14:56:19] (03CR) 10Jcrespo: [C: 032] mariadb: Reimage db1074 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/434343 (owner: 10Jcrespo) [14:57:29] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.30 (duration: 03m 19s) [14:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:33] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434344 [15:00:12] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1109 fully after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434342 (owner: 10Jcrespo) [15:01:35] (03Merged) 10jenkins-bot: mariadb: Repool db1109 fully after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434342 (owner: 10Jcrespo) [15:01:50] (03CR) 10jenkins-bot: mariadb: Repool db1109 fully after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434342 (owner: 10Jcrespo) [15:05:29] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool fully db1109 (duration: 01m 18s) [15:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:06] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54679 MB (3% inode=99%) [15:17:29] RECOVERY - Disk space on maps1001 is OK: DISK OK [15:19:58] PROBLEM - MariaDB Slave Lag: s2 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1761.46 seconds [15:20:18] ^that should go away in a second [15:20:26] :) [15:23:27] !log demon@tin Pruned MediaWiki: 1.32.0-wmf.2 [keeping static files] (duration: 01m 56s) [15:23:29] RECOVERY - MariaDB Slave Lag: s2 on db1102 is OK: OK slave_sql_lag Replication lag: 40.48 seconds [15:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:52] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434344 [15:27:07] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434344 (owner: 10Jcrespo) [15:27:32] (03CR) 10Vgutierrez: [C: 031] facter: refactor the net_driver fact [puppet] - 10https://gerrit.wikimedia.org/r/434032 (owner: 10Volans) [15:28:37] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434344 (owner: 10Jcrespo) [15:29:28] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1074 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434344 (owner: 10Jcrespo) [15:29:42] (03PS2) 10Dzahn: dumps: replace base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433685 (https://phabricator.wikimedia.org/T194724) [15:29:45] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/11257/" [puppet] - 10https://gerrit.wikimedia.org/r/433685 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [15:30:18] (03CR) 10Dzahn: [C: 032] dumps: replace base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433685 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [15:32:13] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1074 with low load (duration: 01m 18s) [15:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:19] (03CR) 10Dzahn: [C: 032] "noop on dumpsdata1001/1002" [puppet] - 10https://gerrit.wikimedia.org/r/433685 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [15:53:34] (03PS1) 10Elukey: druid: set the default druid.storage.storageDirectory even without cdh [puppet] - 10https://gerrit.wikimedia.org/r/434350 (https://phabricator.wikimedia.org/T193712) [15:54:43] (03PS2) 10Ottomata: Expose the revision-score event publically [puppet] - 10https://gerrit.wikimedia.org/r/433746 (owner: 10Ppchelko) [15:55:54] (03PS2) 10Elukey: druid: set the default druid.storage.storageDirectory even without cdh [puppet] - 10https://gerrit.wikimedia.org/r/434350 (https://phabricator.wikimedia.org/T193712) [15:57:12] (03CR) 10Ottomata: [C: 032] Expose the revision-score event publically [puppet] - 10https://gerrit.wikimedia.org/r/433746 (owner: 10Ppchelko) [16:00:15] (03CR) 10Elukey: "pcc https://puppet-compiler.wmflabs.org/compiler02/11258/" [puppet] - 10https://gerrit.wikimedia.org/r/434350 (https://phabricator.wikimedia.org/T193712) (owner: 10Elukey) [16:03:41] (03CR) 10Elukey: [C: 032] druid: set the default druid.storage.storageDirectory even without cdh [puppet] - 10https://gerrit.wikimedia.org/r/434350 (https://phabricator.wikimedia.org/T193712) (owner: 10Elukey) [16:03:45] (03PS3) 10Elukey: druid: set the default druid.storage.storageDirectory even without cdh [puppet] - 10https://gerrit.wikimedia.org/r/434350 (https://phabricator.wikimedia.org/T193712) [16:04:52] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11259/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/433745 (owner: 10Ppchelko) [16:04:57] (03PS2) 10Ottomata: Provide EventBus URI to change-prop profile [puppet] - 10https://gerrit.wikimedia.org/r/433745 (owner: 10Ppchelko) [16:05:11] (03CR) 10Ottomata: [V: 032 C: 032] Provide EventBus URI to change-prop profile [puppet] - 10https://gerrit.wikimedia.org/r/433745 (owner: 10Ppchelko) [16:05:43] (03CR) 10Vgutierrez: "check inline comments" (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/430339 (owner: 10Mark Bergsma) [16:18:25] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.100 second response time [16:23:45] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1963 bytes in 0.091 second response time [16:25:02] 10Operations, 10Traffic, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4221028 (10Krenair) >>! In T194962#4217296, @BBlack wrote: > * The reference earlier to emulating the puppet fileserver protocol means something like this JSON... [16:27:57] 10Operations, 10ops-eqdfw, 10netops: eqdfw: Patch GTT cross-connect - https://phabricator.wikimedia.org/T194515#4221041 (10Papaul) cable ID 11389 [16:32:16] ACKNOWLEDGEMENT - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0: Ayounsi T194515 [16:35:18] 10Operations, 10ops-eqdfw, 10netops: eqdfw: Patch GTT cross-connect - https://phabricator.wikimedia.org/T194515#4221059 (10ayounsi) > Laser rx power : 0.0008 mW / -30.97 dBm Couldn't get light. We tried different ports/optics/patch and rolled the fiber for each combinations. Will... [16:47:58] !log fdans@tin Started deploy [analytics/refinery@16cb3be]: deploying new jar for upgraded ua parsing [16:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:38] !log fdans@tin Finished deploy [analytics/refinery@16cb3be]: deploying new jar for upgraded ua parsing (duration: 05m 40s) [16:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:46] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:55:15] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:55:35] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:56:46] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:57:25] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:57:55] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:00:04] gehel: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180521T1700). [17:00:15] jouncebot: o/ [17:04:25] !log gehel@tin Started deploy [wdqs/wdqs@e01dd03]: new WDQS GUI version [17:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:52] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to stat1006 for Go Fish Digital - https://phabricator.wikimedia.org/T194287#4221085 (10mpopov) 05Open>03declined Cancelling this request as I will be uploading the data to them instead. [17:13:09] !log gehel@tin Finished deploy [wdqs/wdqs@e01dd03]: new WDQS GUI version (duration: 08m 44s) [17:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:03] SMalyshev: deployment completed, tests are green [17:40:30] (03PS5) 10Dzahn: ganeti: add interactive script to create VMs [puppet] - 10https://gerrit.wikimedia.org/r/433296 [17:41:29] (03PS1) 10Ottomata: no-op: Set inter_broker_ssl_enabled false [puppet] - 10https://gerrit.wikimedia.org/r/434358 (https://phabricator.wikimedia.org/T193778) [17:42:07] (03CR) 10Dzahn: [C: 032] ganeti: add interactive script to create VMs [puppet] - 10https://gerrit.wikimedia.org/r/433296 (owner: 10Dzahn) [17:42:15] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1966 bytes in 0.109 second response time [17:43:46] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11260/" [puppet] - 10https://gerrit.wikimedia.org/r/434358 (https://phabricator.wikimedia.org/T193778) (owner: 10Ottomata) [17:43:48] (03CR) 10Ottomata: [C: 032] no-op: Set inter_broker_ssl_enabled false [puppet] - 10https://gerrit.wikimedia.org/r/434358 (https://phabricator.wikimedia.org/T193778) (owner: 10Ottomata) [17:43:50] (03PS2) 10Ottomata: no-op: Set inter_broker_ssl_enabled false [puppet] - 10https://gerrit.wikimedia.org/r/434358 (https://phabricator.wikimedia.org/T193778) [17:43:52] (03CR) 10Ottomata: [V: 032 C: 032] no-op: Set inter_broker_ssl_enabled false [puppet] - 10https://gerrit.wikimedia.org/r/434358 (https://phabricator.wikimedia.org/T193778) (owner: 10Ottomata) [17:47:35] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1954 bytes in 0.103 second response time [17:47:41] (03Draft1) 10Zoranzoki21: Enable "File mover" flag on zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434359 [17:51:29] (03PS2) 10Zoranzoki21: Enable "File mover" flag on zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434359 (https://phabricator.wikimedia.org/T195247) [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180521T1800). [18:00:06] No GERRIT patches in the queue for this window AFAICS. [18:00:27] (03PS1) 10Ottomata: Enable Kafka SSL listener for main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/434361 (https://phabricator.wikimedia.org/T193778) [18:00:29] (03PS1) 10Ottomata: Enable SSL inter.broker communication for Kafka main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/434362 (https://phabricator.wikimedia.org/T193778) [18:01:13] (03PS1) 10Elukey: profile::druid::common: add the use_cdh_hadoop_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/434363 (https://phabricator.wikimedia.org/T193712) [18:03:07] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11261/" [puppet] - 10https://gerrit.wikimedia.org/r/434363 (https://phabricator.wikimedia.org/T193712) (owner: 10Elukey) [18:03:24] (03CR) 10Elukey: [C: 032] profile::druid::common: add the use_cdh_hadoop_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/434363 (https://phabricator.wikimedia.org/T193712) (owner: 10Elukey) [18:16:25] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1961 bytes in 0.099 second response time [18:24:35] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54254 MB (3% inode=99%) [18:32:06] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1948 bytes in 0.120 second response time [18:39:55] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1968 bytes in 0.108 second response time [18:55:45] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1949 bytes in 0.101 second response time [19:01:01] (03PS1) 10MaxSem: Add a comment about GlobalPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434367 [19:01:03] (03PS1) 10MaxSem: Enable GlobalPreferences on non-Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434368 (https://phabricator.wikimedia.org/T189806) [19:06:29] elukey: If you're around and happen to be able to roll out an mtail change, would appreciate review of https://gerrit.wikimedia.org/r/#/c/432712/ [19:07:16] (03PS8) 10Krinkle: mtail: Add xcachestatus to varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/432712 (https://phabricator.wikimedia.org/T190978) [19:07:25] (03CR) 10Krinkle: "(Rebase)" [puppet] - 10https://gerrit.wikimedia.org/r/432712 (https://phabricator.wikimedia.org/T190978) (owner: 10Krinkle) [19:11:28] (03PS1) 10Ottomata: Use separate consumer group names for EventLogging log files [puppet] - 10https://gerrit.wikimedia.org/r/434369 [19:11:57] (03CR) 10jerkins-bot: [V: 04-1] Use separate consumer group names for EventLogging log files [puppet] - 10https://gerrit.wikimedia.org/r/434369 (owner: 10Ottomata) [19:13:36] (03PS2) 10Ottomata: Use separate consumer group names for EventLogging log files [puppet] - 10https://gerrit.wikimedia.org/r/434369 [19:14:05] (03CR) 10Ottomata: [C: 032] Use separate consumer group names for EventLogging log files [puppet] - 10https://gerrit.wikimedia.org/r/434369 (owner: 10Ottomata) [19:16:51] (03CR) 10Chad: "You're correct, I wasn't thinking (so yes, obviously untested). Suggestions?" [puppet] - 10https://gerrit.wikimedia.org/r/429447 (owner: 10Chad) [19:18:55] RECOVERY - Disk space on maps1001 is OK: DISK OK [19:19:52] !log clearing cassandra snaphosts on maps* nodes to regain some space - T194966 [19:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:57] T194966: disk usage increase on maps servers - https://phabricator.wikimedia.org/T194966 [19:25:06] (03PS1) 10Ottomata: Debian release 0.8.0-1 [debs/python-ua-parser] (debian) - 10https://gerrit.wikimedia.org/r/434372 (https://phabricator.wikimedia.org/T192529) [19:25:20] (03CR) 10Ottomata: [V: 032 C: 032] Debian release 0.8.0-1 [debs/python-ua-parser] (debian) - 10https://gerrit.wikimedia.org/r/434372 (https://phabricator.wikimedia.org/T192529) (owner: 10Ottomata) [19:38:18] (03PS2) 10Herron: ELK: change elasticsearch index prefix to logstash-syslog for syslog type [puppet] - 10https://gerrit.wikimedia.org/r/431860 (https://phabricator.wikimedia.org/T193766) [19:38:51] (03CR) 10jerkins-bot: [V: 04-1] ELK: change elasticsearch index prefix to logstash-syslog for syslog type [puppet] - 10https://gerrit.wikimedia.org/r/431860 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [19:40:54] (03PS3) 10Herron: ELK: change elasticsearch index prefix to logstash-syslog for syslog type [puppet] - 10https://gerrit.wikimedia.org/r/431860 (https://phabricator.wikimedia.org/T193766) [19:46:45] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:50:46] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1966 bytes in 0.085 second response time [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear deployers, time to do the Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180521T2000). [20:01:25] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1959 bytes in 0.109 second response time [20:02:31] no parsoid deploy today [20:14:25] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1969 bytes in 0.099 second response time [20:18:13] (03PS2) 10Herron: add gh-mail.wikimedia.org (greenhouse.io) spf/mx records [dns] - 10https://gerrit.wikimedia.org/r/417350 (https://phabricator.wikimedia.org/T189065) [20:18:28] (03CR) 10jerkins-bot: [V: 04-1] add gh-mail.wikimedia.org (greenhouse.io) spf/mx records [dns] - 10https://gerrit.wikimedia.org/r/417350 (https://phabricator.wikimedia.org/T189065) (owner: 10Herron) [20:19:41] (03PS3) 10Herron: add gh-mail.wikimedia.org (greenhouse.io) spf/mx records [dns] - 10https://gerrit.wikimedia.org/r/417350 (https://phabricator.wikimedia.org/T189065) [20:20:33] (03PS4) 10Herron: add gh-mail.wikimedia.org (greenhouse.io) spf/mx records [dns] - 10https://gerrit.wikimedia.org/r/417350 (https://phabricator.wikimedia.org/T189065) [20:21:04] (03CR) 10Herron: [C: 032] add gh-mail.wikimedia.org (greenhouse.io) spf/mx records [dns] - 10https://gerrit.wikimedia.org/r/417350 (https://phabricator.wikimedia.org/T189065) (owner: 10Herron) [20:25:27] (03PS1) 1020after4: group1 wikis to 1.32.0-wmf.4 refs T191050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434418 [20:25:29] (03CR) 1020after4: [C: 032] group1 wikis to 1.32.0-wmf.4 refs T191050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434418 (owner: 1020after4) [20:27:14] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.4 refs T191050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434418 (owner: 1020after4) [20:31:20] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.4 refs T191050 [20:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:26] T191050: 1.32.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T191050 [20:32:39] !log twentyafterfour@tin Synchronized php: group1 wikis to 1.32.0-wmf.4 refs T191050 (duration: 01m 18s) [20:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:00] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.4 refs T191050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434418 (owner: 1020after4) [20:47:22] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#4221400 (10gh87) From Safari 11.1 (13605.1.33.1.4) at mac OS X High Sierra 10.13.4: Below happens when going from one article to another:... [21:00:04] bawolff and Reedy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180521T2100). [21:01:44] * bawolff not deploying anything today [21:03:23] (03PS2) 10Dzahn: xenon: replace base::service_unit, rm upstart template [puppet] - 10https://gerrit.wikimedia.org/r/433684 (https://phabricator.wikimedia.org/T194724) [21:18:08] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/11262/" [puppet] - 10https://gerrit.wikimedia.org/r/433684 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [21:22:33] (03CR) 10Dzahn: [C: 032] "noop on mwlog1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/433684 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [21:22:45] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1977 bytes in 0.238 second response time [21:24:40] !log granted User:CN=kafka_fundraising_client read permissions for group fundraising* on kafka-jumbo (for kafkatee webrequest consumption: kafka acls --add --allow-principal User:CN=kafka_fundraising_client --consumer --topic '*' --group 'fundraising*' [21:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:16] krinkle/marlier: hi! the new webperf machines have IPv6 but the old ones don't have it yet. i would like to add that there as well. sounds ok? [21:27:32] so the "mapped IPv6" address and the puppet snippet [21:28:04] unless that causes trouble for existing services of course [21:29:22] mutante: Would this change the IP from 4 to 6, or add it in addition? Are there known issues with Kafka, Statsd or Graphite with IPv6? [21:30:35] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1973 bytes in 0.116 second response time [21:33:03] Krinkle: it would add it in addition. i checked if those are already using it and kafka brokers does.. but graphite does not [21:33:35] 10Operations, 10DNS, 10Mail, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4221469 (10herron) SPF and MX records for greenhouse have been configured using the subdomain `gh-mail.wikimedia.org` and verified through the greenhouse web admin interface.... [21:34:51] mutante: Okay, I suppose that should just work fine and it uses the right address protocol automatically. If not, we'll find out soon enough. [21:35:17] mutante: The webperfx1 machines consume from kafka and talk to statsd (webperf/navtiming) and graphite (webperf/coal). [21:35:20] Krinkle: i see that graphite upstreadm once had a "add IPv6 support" bug (for pure IPv6 environments which this wouldnt be) and that has been closed resolved [21:35:32] The navtiming one does not resume with offsets, which means downtime creates permanent gaps in data. [21:35:45] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1969 bytes in 0.102 second response time [21:35:49] So we should do this in a way that reduces downtime as much as possible, and ready to revert if needed. [21:36:27] it would normally try IPv6 first and if that fails fallback to v4, ack [21:36:35] ipv6 should be fine with kafka [21:36:43] cool, thanks ottomata [21:37:21] i can add the AAAA records in DNS with 5 min TTL and revert would be quick if needed [21:37:38] mutante: Thanks [21:37:49] mutante: we're currently working on adding etcd support so that we can switch over to codfw [21:37:52] but that's not live yet. [21:38:06] and also, now that I think about it, we should change that so that it doesn't re-use mediawiki's etcd master-dc key [21:38:53] so that we can switch over on our own account. Maybe there's a way that we can orchestrate with puppet somewhere that in case of an en-mass switch over, we'd follow mediawiki and graphite etc. but still allow separate switching. [21:39:01] cool, that sounds good [21:40:28] yes, so there is the $mw_primary variable to follow [21:40:56] $mw_primary = $app_routes['mediawiki'] [21:41:09] mutante: Do we use that to also change graphite-in from eqiad to codfw? [21:41:14] and statsd? [21:41:49] trying to find that out.. so mw_primary gets it from $app_routes and then: [21:41:56] $app_routes = hiera('discovery::app_routes') [21:43:19] i don't think we do yet but this new system is what you would want to use in the future: [21:43:28] # Application routes [21:43:28] # MediaWiki should be reached to the master datacenter, [21:43:28] # while other services should be polled locally [21:43:28] discovery::app_routes: [21:43:28] mediawiki: 'eqiad' [21:43:30] aqs: 'eqiad' [21:43:35] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.100 second response time [21:48:04] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4221485 (10Krinkle) @Imarlier I was just thinking about whether the etcd code is live or not for webperf/navtiming. The puppet invocation hasn... [21:48:35] mutante: Cool. I guess somethings maybe don't automatically get it from Puppet with mw_primary, but could still switch at the same time using the switchover script we have [21:48:49] So either way, worst case scenario we can just add navtiming's etcd key to the switchover script. [21:49:11] Anyway, I'll be here for about 1 hour before logging off, so I'm okay with doing this now [21:49:15] Otherwise tomorrow's good too [21:49:58] oh, ok, let's do it now. amending :) [21:53:22] (03PS3) 10Dzahn: webperf: add IPv6 mapped addresses to webperf1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/433299 [21:53:28] yea, so there is an issue with the wmf-style-guide lint check but we ignore that for right now [21:53:41] currently you cant do it on node level or in role level [21:53:52] but that will be fixed separately and will be no-op later [21:53:59] (03CR) 10jerkins-bot: [V: 04-1] webperf: add IPv6 mapped addresses to webperf1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/433299 (owner: 10Dzahn) [21:54:36] (03PS4) 10Dzahn: webperf: add IPv6 mapped addresses to webperf1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/433299 [21:55:03] the interesting part for now is just that we are making all webperf* machines the same [21:55:20] (03PS5) 10Dzahn: webperf: add IPv6 mapped addresses to webperf1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/433299 [21:58:42] (03PS1) 10Dzahn: add IPv6 records for webperf1001/webperf2001 [dns] - 10https://gerrit.wikimedia.org/r/434423 [21:58:54] (03CR) 10jerkins-bot: [V: 04-1] add IPv6 records for webperf1001/webperf2001 [dns] - 10https://gerrit.wikimedia.org/r/434423 (owner: 10Dzahn) [21:59:34] (03PS1) 10Chelsyx: Blacklisting new iOS eventlogging schemas on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/434424 (https://phabricator.wikimedia.org/T192819) [21:59:45] !log adding email to User:Quuxplusone - T194929 [21:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:49] T194929: User has saved account login but he forgot his password - https://phabricator.wikimedia.org/T194929 [21:59:59] (03CR) 10jerkins-bot: [V: 04-1] Blacklisting new iOS eventlogging schemas on MySQL [puppet] - 10https://gerrit.wikimedia.org/r/434424 (https://phabricator.wikimedia.org/T192819) (owner: 10Chelsyx) [22:00:06] heh, DNS lint: 21:58:53 # error: Name 'webperf1001.eqiad.wmnet.': All TTLs for A and/or AAAA records at the same name should agree (using 3600) [22:00:16] well ok, then i have to set A record to 5 min as well [22:02:11] (03PS2) 10Dzahn: add IPv6 records for webperf1001/webperf2001 [dns] - 10https://gerrit.wikimedia.org/r/434423 [22:03:47] (03CR) 10Dzahn: [C: 032] webperf: add IPv6 mapped addresses to webperf1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/433299 (owner: 10Dzahn) [22:04:10] Krinkle: so the next puppet run will add a new IP on the interface [22:04:21] then i will check against the DNS change one last time and merge that too [22:06:46] puppet run stops when that happens.. then i can run it again. done [22:07:36] (03PS3) 10Dzahn: add IPv6 records for webperf1001/webperf2001 [dns] - 10https://gerrit.wikimedia.org/r/434423 [22:07:48] the mapped IPs we got match what i am adding in DNS . going ahead [22:08:13] (03CR) 10Dzahn: [C: 032] add IPv6 records for webperf1001/webperf2001 [dns] - 10https://gerrit.wikimedia.org/r/434423 (owner: 10Dzahn) [22:09:41] [webperf2002:~] $ host webperf1001.eqiad.wmnet [22:09:42] webperf1001.eqiad.wmnet has address 10.64.0.215 [22:09:42] webperf1001.eqiad.wmnet has IPv6 address 2620:0:861:101:10:64:0:215 [22:09:48] [webperf2002:~] $ ping6 webperf1001.eqiad.wmnet [22:09:48] PING webperf1001.eqiad.wmnet(webperf1001.eqiad.wmnet (2620:0:861:101:10:64:0:215)) 56 data bytes [22:09:55] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1963 bytes in 0.147 second response time [22:11:05] Krinkle: from 1001 to 2001, from 2002 to 1001 .. i can use ping6 now. lgtm [22:13:45] !log enabled IPv6 on webperf machines [22:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:38] apache2 listens on tcp6 :::80 [22:16:54] didnt restart any services so far, not sure we have to [22:19:36] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54713 MB (3% inode=99%) [22:37:49] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4221531 (10Dzahn) added mapped IPv6 addresses on the interface for webperf1001/2001: https://gerrit.wikimedia.org/r/#/c/433299/ after applyi... [22:38:31] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4221532 (10Dzahn) If there are any unexpected issue then reverting is reverting the 2 patches above. I used a TTL of 5 min in DNS just in case... [22:43:07] Krinkle: yea, so.. there was no server reboot or anything but i would still say done [22:47:58] mutante: Hm.. I didn't know this was possible without reboot. So what does that typically mean for processes on linux if a new network ability comes up. Presumably that doesn't apply to existing connections. [22:48:30] I can restart the services, I assume that would make it do the same it would do after a reboot, right? [22:50:28] !log webperf1001: restart navtiming service, to test with new ipv6 capabilities [22:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:54] Krinkle: so in this case .. we already had an inet6 interface and an IP on it [22:51:11] what we did not have was the "nice" mapped address that refers to the v4 address [22:51:17] and the DNS records didnt exist [22:51:36] but it wasnt new for Linux that an inet6 interface exists [22:51:47] https://graphite.wikimedia.org/render?target=frontend.navtiming2.responseStart.overall.sample_rate&from=-2h&width=800&height=400 [22:52:09] Looks like there was a 10min gap in data from 23:05 to 23:15 [22:52:10] 30min ago [22:52:26] for example on analytics1001 you dont have the puppet code or the DNS records and: [22:52:28] The service uptime was 3 weeks before I restared it [22:52:29] inet6 2620:0:861:106:f21f:afff:fee8:af06 [22:52:48] which is derived from the MAC [22:53:06] Rather unexpected that adding the ipv6 would appear to have made the connection fail in a silent way for 10min? [22:53:30] it doesn't fully add up.. 30 min ago [22:53:35] RECOVERY - Disk space on maps1001 is OK: DISK OK [22:54:14] mutante: 23:05 is exactly when the puppet patch went out (40min ago now) [22:54:20] 50min even [22:54:31] 22:05 in UTC [22:54:32] sorry [22:55:25] hmm. and then for 10 min? well ok, 5 minutes TTL and how long it took me to merge the DNS change and authdns-update ran [22:55:59] Aye, but is it normal that during that time, the host cannot send any packets on existing connections? [22:56:13] no, i didn't expect that at least [22:56:25] Right, okay, just checking :) [22:56:42] I don't see anything in logs. [22:56:51] well, i hope this wasnt too bad and it's ok now [22:56:55] It looks like at least some of the data got clogged somewhere and then got out in a burst after 10min [22:57:18] even though the application doesn't do that, so must be in the OS or hardware somewhere [22:57:36] The medians weren't too skewed and the gap is relatively small. [22:57:41] at least we now won't run into some obscure issue in the future that is because some webperf* machines have this and some don't [22:58:15] But good to keep in mind that adding IPv6 on a live host causes data loss, as well as buffering/delay of several minutes, which can work in unexpected ways on metric systems that rely on timestamps at least with minute precision (e.g. statsd/graphite) [22:58:41] yes, ack! [22:58:51] e.g. data for 22:05-22:07 is lost and data for 22:08-22:12 was replayed all at once around 22:15 [22:59:03] somehow :/ [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180521T2300). [23:00:04] MaxSem: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:12] hmm, yea.. this order seemed safer than first adding the DNS record before the new IP is actually bound to the interface [23:00:26] but you really want it in the same moment [23:00:50] I'll swat myself [23:01:45] There aren't really any (important) incoming connections to this machine (aside from maybe puppet and such). So I'm curious if maybe it'd be safe to add ahead of time. Or would that also affect responses on outgoing connections? Afaik those don't involve DNS. [23:03:02] Anyway, in the future maintenance issues, we'll switch over to codfw (hopefully navtiming is multi-dc by then). [23:05:49] yes, the puppet connection itself gets interrupted in that moment and you have to stop the run and restart it. i think you are right about outgoing and adding it first. yes, switching over is much nicer [23:06:14] (03PS2) 10MaxSem: Add a comment about GlobalPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434367 [23:06:19] (03CR) 10MaxSem: [C: 032] Add a comment about GlobalPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434367 (owner: 10MaxSem) [23:07:56] (03Merged) 10jenkins-bot: Add a comment about GlobalPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434367 (owner: 10MaxSem) [23:08:41] (03PS2) 10MaxSem: Enable GlobalPreferences on non-Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434368 (https://phabricator.wikimedia.org/T189806) [23:08:57] (03CR) 10MaxSem: [C: 032] Enable GlobalPreferences on non-Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434368 (https://phabricator.wikimedia.org/T189806) (owner: 10MaxSem) [23:09:41] (03CR) 10jenkins-bot: Add a comment about GlobalPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434367 (owner: 10MaxSem) [23:10:27] (03Merged) 10jenkins-bot: Enable GlobalPreferences on non-Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434368 (https://phabricator.wikimedia.org/T189806) (owner: 10MaxSem) [23:11:17] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4221555 (10Krinkle) > 22:50 Krinkle: webperf1001: restart navtiming service, to test with new ipv6 capabilities The service remained fine aft... [23:15:42] (03CR) 10jenkins-bot: Enable GlobalPreferences on non-Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434368 (https://phabricator.wikimedia.org/T189806) (owner: 10MaxSem) [23:17:19] (03PS1) 10Dzahn: zuul: replase base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434427 (https://phabricator.wikimedia.org/T194724) [23:18:43] (03PS2) 10Dzahn: zuul: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434427 (https://phabricator.wikimedia.org/T194724) [23:19:47] (03PS1) 10Dzahn: tcpircbot: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434428 (https://phabricator.wikimedia.org/T194724) [23:20:45] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1970 bytes in 0.102 second response time [23:29:26] (03PS1) 10Dzahn: mariadb: update TODO about base:service_unit [puppet] - 10https://gerrit.wikimedia.org/r/434429 (https://phabricator.wikimedia.org/T194724) [23:29:54] (03PS2) 10Dzahn: mariadb: update TODO about base:service_unit [puppet] - 10https://gerrit.wikimedia.org/r/434429 (https://phabricator.wikimedia.org/T194724) [23:31:22] (03CR) 10Dzahn: [C: 032] "just a comment" [puppet] - 10https://gerrit.wikimedia.org/r/434429 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [23:33:24] (03PS3) 10Dzahn: otrs: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433491 (https://phabricator.wikimedia.org/T194724) [23:36:35] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1978 bytes in 0.101 second response time [23:47:13] (03PS1) 10Dzahn: snapshot/dumps-monitor: replace base:service_unit [puppet] - 10https://gerrit.wikimedia.org/r/434431 (https://phabricator.wikimedia.org/T194724) [23:54:46] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.225 second response time