[00:01:33] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [00:01:43] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [00:03:33] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [00:03:43] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [00:08:45] (03PS1) 10BBlack: Remove ipsec from kafka jumbo nodes [puppet] - 10https://gerrit.wikimedia.org/r/447004 (https://phabricator.wikimedia.org/T182993) [00:10:44] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [00:11:03] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [00:11:53] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [00:12:03] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [00:12:58] !log disabled puppet temporary on cp* and kafka-jumbo* for ipsec unconfiguration [00:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:16] (03CR) 10BBlack: [C: 032] Remove ipsec from kafka jumbo nodes [puppet] - 10https://gerrit.wikimedia.org/r/447004 (https://phabricator.wikimedia.org/T182993) (owner: 10BBlack) [00:26:28] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993 (10BBlack) 05Open>03Resolved [00:27:22] * Krinkle staging on deploy1001 and mwdebug1002 [00:31:25] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.13/includes/filerepo/: I58706b5610 and I40f6ad2a3d - T200026 (duration: 00m 56s) [00:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:29] T200026: RepoGroup exceptions due to "false" being passed as a key to MapCacheLRU - https://phabricator.wikimedia.org/T200026 [00:40:12] 10Operations, 10Discovery-Search, 10Wikimedia-Logstash, 10Services (watching): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 (10EBernhardson) [00:56:55] 10Operations, 10Discovery-Search, 10Wikimedia-Logstash, 10Services (watching): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 (10EBernhardson) This is (unsurprisingly) still going on. Over the last 30 days we have ~10k unique fields. A quick pa... [01:28:42] Rolling out another fix for log-errors: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralAuth/+/447012/ [01:38:47] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.13/extensions/CentralAuth/includes/CentralAuthUser.php: T170971 (duration: 00m 55s) [01:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:52] T170971: iconv(): Detected an illegal character in CentralAuthUser - https://phabricator.wikimedia.org/T170971 [02:05:51] 10Operations, 10Performance-Team, 10vm-requests: Increase webperf1002/webperf2002 space from 50GB to 500 GB (Ganeti) - https://phabricator.wikimedia.org/T199853 (10Krinkle) @fgiunchedi Cool, that sounds good. Regarding storage use and scaling, I assume it gets distributed among different backend servers as... [03:30:02] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 860.18 seconds [03:41:22] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 265.53 seconds [04:55:26] 10Operations, 10ops-codfw, 10DBA: db2061 disk with predictive failure - https://phabricator.wikimedia.org/T200059 (10Marostegui) [04:55:39] 10Operations, 10ops-codfw, 10DBA: db2061 disk with predictive failure - https://phabricator.wikimedia.org/T200059 (10Marostegui) p:05Triage>03Normal [04:55:53] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2061 is CRITICAL: cluster=mysql device=cciss,9 instance=db2061:9100 job=node site=codfw Marostegui T200059 - The acknowledgement expires at: 2018-07-31 04:55:34. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2061&var-datasource=codfw%2520prometheus%252Fops [04:56:43] (03CR) 10Krinkle: dbtree: move dbtree outside of mwmaint hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [05:02:59] (03CR) 10Krinkle: Add wikimania.wikimedia.org to apache ServerAlias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) (owner: 10Reedy) [05:03:08] (03CR) 10Krinkle: "+Infinity." [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) (owner: 10Reedy) [05:08:30] !log Start to remove some unused files on db1067 - T200039 [05:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:34] T200039: db1067 /srv usage is at 82% - https://phabricator.wikimedia.org/T200039 [05:35:13] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447020 (https://phabricator.wikimedia.org/T199368) [05:41:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447020 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [05:42:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447020 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [05:42:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447020 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [05:44:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1076 for alter table (duration: 00m 56s) [05:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:05] (03PS1) 10Marostegui: site.pp: Specify db1120 future location [puppet] - 10https://gerrit.wikimedia.org/r/447021 (https://phabricator.wikimedia.org/T196376) [05:48:59] (03CR) 10Marostegui: [C: 032] site.pp: Specify db1120 future location [puppet] - 10https://gerrit.wikimedia.org/r/447021 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [05:49:30] !log Deploy schema change on db1076 T144010 T51190 T199368 [05:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:36] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:49:36] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:49:37] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [05:51:35] If I want to change site logo, should I replace the original logo in repo or add it? [05:52:31] (03CR) 10Elukey: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/447004 (https://phabricator.wikimedia.org/T182993) (owner: 10BBlack) [05:55:16] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447022 [06:04:33] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447022 (owner: 10Marostegui) [06:05:50] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447022 (owner: 10Marostegui) [06:07:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1076 after alter table (duration: 00m 54s) [06:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:07] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447022 (owner: 10Marostegui) [06:08:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447026 (https://phabricator.wikimedia.org/T199368) [06:10:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447026 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [06:11:44] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447026 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [06:12:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447026 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [06:12:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1074 for alter table (duration: 00m 53s) [06:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:08] !log Deploy schema change on db1074 with replication, this will generate lag on labsdb:s2 T144010 T51190 T199368 [06:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:13] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [06:13:14] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [06:13:14] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [06:31:43] (03PS1) 10ꘟ耀ę™Øꛦ: Change zhwikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447030 (https://phabricator.wikimedia.org/T199863) [06:32:19] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447031 [06:58:03] (03PS7) 10Jcrespo: mariadb: Repool es1019 fully after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446755 (https://phabricator.wikimedia.org/T197073) [06:59:20] (03CR) 10Jcrespo: [C: 032] mariadb: Repool es1019 fully after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446755 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [07:01:00] (03Merged) 10jenkins-bot: mariadb: Repool es1019 fully after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446755 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [07:01:00] (03CR) 10jenkins-bot: mariadb: Repool es1019 fully after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446755 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [07:03:08] (03PS2) 10ꘟ耀ę™Øꛦ: Change zhwikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447030 (https://phabricator.wikimedia.org/T199863) [07:06:29] (03PS2) 10Jcrespo: mariadb: Repool db1099 fully after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446902 (https://phabricator.wikimedia.org/T197073) [07:09:58] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1099 fully after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446902 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [07:11:04] (03Merged) 10jenkins-bot: mariadb: Repool db1099 fully after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446902 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [07:11:20] (03CR) 10jenkins-bot: mariadb: Repool db1099 fully after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446902 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [07:11:42] !log Stop replication in sync on db1074 and db1125:3312 [07:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:36] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099, es1019 fully (duration: 00m 54s) [07:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:23] RECOVERY - swift-account-server on ms-be1040 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:29:43] RECOVERY - swift-container-server on ms-be1040 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [07:30:03] RECOVERY - swift-object-updater on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [07:30:12] RECOVERY - swift-container-replicator on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [07:30:13] RECOVERY - swift-object-auditor on ms-be1040 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [07:32:03] that's me ^ expected [07:39:23] 10Operations, 10Performance-Team, 10vm-requests: Increase webperf1002/webperf2002 space from 50GB to 500 GB (Ganeti) - https://phabricator.wikimedia.org/T199853 (10fgiunchedi) >>! In T199853#4440040, @Krinkle wrote: > @fgiunchedi Cool, that sounds good. > > Regarding storage use and scaling, I assume it get... [07:43:22] !log stopping db1095 mariadb and cloning it to db1102 [07:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:01] !log Stop replicaiton on s8 codfw master - T200061 [07:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:05] T200061: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 [07:48:57] (03PS2) 10Jcrespo: transfer.py: Make checksum optional [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/446871 (https://phabricator.wikimedia.org/T156462) [07:49:38] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447031 [07:55:11] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) 05Open>03Resolved The reported negative disk space issue has been repaired, we have alerting on the condition itself if it happens again. Also... [07:59:09] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447031 (owner: 10Marostegui) [08:00:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447031 (owner: 10Marostegui) [08:00:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447031 (owner: 10Marostegui) [08:01:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1074 after alter table (duration: 00m 55s) [08:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:03] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 48.09, 37.30, 22.53 [08:17:38] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: ignore stderr in labs-ip-alias-dump.py [puppet] - 10https://gerrit.wikimedia.org/r/447042 [08:18:46] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: ignore stderr in labs-ip-alias-dump.py [puppet] - 10https://gerrit.wikimedia.org/r/447042 (owner: 10Arturo Borrero Gonzalez) [08:31:22] RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 6.64, 11.66, 23.68 [08:41:25] !log reset email for User:Nomden [08:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:11] !log stopping populateChangeTagDef.php (T200061) [08:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:15] T200061: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 [08:50:38] !log Stop replication on s7 codfw master to check arwiki.change_tag - T200061 [08:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:27] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: use cloudcontrol1004 as primary control server [puppet] - 10https://gerrit.wikimedia.org/r/447046 (https://phabricator.wikimedia.org/T200068) [09:15:57] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: eqiad1: use cloudcontrol1004 as primary control server [puppet] - 10https://gerrit.wikimedia.org/r/447046 (https://phabricator.wikimedia.org/T200068) (owner: 10Arturo Borrero Gonzalez) [09:20:52] PROBLEM - keystone public endoint port 5000 on labcontrol1003 is CRITICAL: connect to address 208.80.154.23 and port 5000: Connection refused [09:20:53] PROBLEM - keystone admin endpoint port 35357 on labcontrol1003 is CRITICAL: connect to address 208.80.154.23 and port 35357: Connection refused [09:21:18] arturo: ^^ that's expected? [09:21:22] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:21:52] PROBLEM - Check systemd state on labcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:24:09] mmm [09:28:41] (03PS1) 10Arturo Borrero Gonzalez: Revert "cloudvps: eqiad1: use cloudcontrol1004 as primary control server" [puppet] - 10https://gerrit.wikimedia.org/r/447048 (https://phabricator.wikimedia.org/T200068) [09:29:20] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Revert "cloudvps: eqiad1: use cloudcontrol1004 as primary control server" [puppet] - 10https://gerrit.wikimedia.org/r/447048 (https://phabricator.wikimedia.org/T200068) (owner: 10Arturo Borrero Gonzalez) [09:30:44] RECOVERY - Check systemd state on labcontrol1003 is OK: OK - running: The system is fully operational [09:31:14] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational [09:32:03] PROBLEM - Check systemd state on db1073 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:35:39] moritzm: ferm is failing? [09:36:26] it is not you [09:36:32] it is could people [09:36:51] cloudcontrol1003.wikimedia.org < arturo this is affecting production services [09:37:04] please revert [09:37:28] He reverted already no? As per the above message at 11:29 [09:37:30] may cause an outage on your own services [09:37:39] it's already reverted [09:37:46] then I will run puppet and see if it clear [09:38:18] it did, after I ran puppet, thanks [09:38:39] let's identify what failed and why [09:38:56] because this is an under-development system and shouldn't affect anything in production at all [09:39:00] you can check (among others) puppet and journalctl on db1073 [09:39:13] RECOVERY - Check systemd state on db1073 is OK: OK - running: The system is fully operational [09:39:18] technically m5 is production, but part of your infra (db) [09:39:31] so it gets some code from cloud-related netwok constants [09:39:42] nothing from cloudXXXX is yet touching m5 AFAIK [09:40:21] ok, then no outage, but some config breakage :-) [09:40:51] I see now the error in db1073 [09:43:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:46:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:46:24] !log Stop replication on db2063 to fix db2095 - T200061 [09:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:28] T200061: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 [09:49:57] (03CR) 10Jcrespo: [C: 032] transfer.py: Make checksum optional [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/446871 (https://phabricator.wikimedia.org/T156462) (owner: 10Jcrespo) [09:50:02] (03PS3) 10Jcrespo: transfer.py: Make checksum optional [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/446871 (https://phabricator.wikimedia.org/T156462) [09:50:06] (03CR) 10Jcrespo: [V: 032 C: 032] transfer.py: Make checksum optional [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/446871 (https://phabricator.wikimedia.org/T156462) (owner: 10Jcrespo) [10:01:04] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3312 and db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447050 (https://phabricator.wikimedia.org/T200061) [10:02:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090:3312 and db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447050 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [10:04:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3312 and db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447050 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [10:06:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3312 and db1103:3312 (duration: 00m 55s) [10:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090:3312 and db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447050 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [10:14:26] !log Stop replication in sync on db1090:3312 and db1103:3312 - T200061 [10:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:30] T200061: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 [10:16:02] (03PS14) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [10:16:29] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [10:17:46] !log Stop replication in sync on db1090:3312 and db2035 - T200061 [10:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:07] (03PS1) 10Marostegui: db-eqiad.php: Repool db1103:3312, depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447051 (https://phabricator.wikimedia.org/T200061) [10:25:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1103:3312, depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447051 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [10:28:53] (03PS15) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [10:29:11] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1103:3312, depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447051 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [10:29:20] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [10:29:27] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1103:3312, depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447051 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [10:30:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103:3312, depool db1105:3312 (duration: 00m 54s) [10:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:35] !log Stop replication in sync on db1090:3312 and db1105:3312 - T200061 [10:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:40] T200061: Recent duplicate entries on change_tag on sanitarium hosts - https://phabricator.wikimedia.org/T200061 [10:33:12] PROBLEM - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [10:34:02] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1015 bytes in 0.025 second response time [10:35:10] (03PS1) 10Marostegui: db-eqiad.php: Repool db1090:3312, db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447053 (https://phabricator.wikimedia.org/T200061) [10:39:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1090:3312, db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447053 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [10:40:34] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1090:3312, db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447053 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [10:40:46] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1090:3312, db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447053 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [10:41:18] (03PS16) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [10:41:46] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [10:42:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3312, db1105:3312 (duration: 00m 53s) [10:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:42] (03PS1) 10Mobrovac: JobQueue: Signal JobQueueEventBus is never read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447055 (https://phabricator.wikimedia.org/T199594) [10:45:53] (03PS17) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [10:46:04] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: add cloudcontrol1003.wikimedia.org FQDN [dns] - 10https://gerrit.wikimedia.org/r/447056 (https://phabricator.wikimedia.org/T200068) [10:46:15] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [10:46:34] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: add cloudcontrol1003.wikimedia.org FQDN [dns] - 10https://gerrit.wikimedia.org/r/447056 (https://phabricator.wikimedia.org/T200068) (owner: 10Arturo Borrero Gonzalez) [10:53:21] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: rename labnet1003.wikimedia.org to cloudnet1003.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/447057 (https://phabricator.wikimedia.org/T200068) [10:58:37] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: rename labnet1003.wikimedia.org to cloudnet1003.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/447057 (https://phabricator.wikimedia.org/T200068) (owner: 10Arturo Borrero Gonzalez) [10:59:23] jynus: I'm merging a similar patch right now. The FQDN is now created in the DNS server and there should be no issues this time [10:59:32] cool [11:11:19] (03PS18) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [11:11:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [11:16:17] (03PS19) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [11:16:42] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [11:22:37] (03PS3) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [11:23:12] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:23:19] (03CR) 10jerkins-bot: [V: 04-1] WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [11:23:33] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:23:33] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:23:41] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:23:52] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:24:01] PROBLEM - SSH on stat1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:01] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:24:02] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:24:20] hello stat1005! [11:24:29] I guess that you are burning a bit atm [11:25:54] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&panelId=8&fullscreen&var-server=stat1005&var-datasource=eqiad%20prometheus%2Fops&from=now-1h&to=now-1m [11:27:11] RECOVERY - SSH on stat1005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [11:29:38] RECOVERY - DPKG on stat1005 is OK: All packages OK [11:29:47] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [11:30:07] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [11:30:17] RECOVERY - Disk space on stat1005 is OK: DISK OK [11:30:18] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [11:31:08] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [11:31:16] killed the scripts running, informed the user [11:32:49] 10Operations, 10Traffic, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979 (10ema) A note about ATS: support for brotli [[https://github.com/apache/trafficserver/pull/1557|has been added]] in version 7.1.0. However, libbrotli-dev is not available in jessie. Gi... [11:48:49] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [11:58:08] !log mobrovac@deploy1001 Started deploy [eventstreams/deploy@01fac88]: Lower librdkafka settings related to fetching messages - T199813 [11:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:13] T199813: EventStreams accumulates too much memory on SCB nodes in CODFW - https://phabricator.wikimedia.org/T199813 [12:00:20] !log mobrovac@deploy1001 Finished deploy [eventstreams/deploy@01fac88]: Lower librdkafka settings related to fetching messages - T199813 (duration: 02m 11s) [12:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:35] (03PS20) 10Jcrespo: Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (https://phabricator.wikimedia.org/T199224) [12:27:02] (03CR) 10jerkins-bot: [V: 04-1] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [12:28:39] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3316,db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447061 (https://phabricator.wikimedia.org/T200061) [12:30:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3316,db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447061 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [12:31:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3316,db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447061 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [12:31:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3316,db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447061 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [12:31:52] !log marostegui@deploy1001 sync-file aborted: Depool db1085 and db1096:33156 (duration: 00m 01s) [12:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:10] (03PS21) 10Jcrespo: Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (https://phabricator.wikimedia.org/T199224) [12:32:18] !log Stop replication in sync db1085 and db1096:3316 [12:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:32] (03CR) 10jerkins-bot: [V: 04-1] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [12:32:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1085 and db1096:3316 (duration: 00m 56s) [12:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:46] (03PS22) 10Jcrespo: Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (https://phabricator.wikimedia.org/T199224) [12:37:50] (03PS1) 10Marostegui: db-eqiad.php: Repool db1096:3316, depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447062 (https://phabricator.wikimedia.org/T200061) [12:45:01] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: cleanup labcontrol1003 resources [dns] - 10https://gerrit.wikimedia.org/r/447063 (https://phabricator.wikimedia.org/T200068) [12:45:28] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: cleanup labcontrol1003 resources [dns] - 10https://gerrit.wikimedia.org/r/447063 (https://phabricator.wikimedia.org/T200068) (owner: 10Arturo Borrero Gonzalez) [12:49:15] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Epic: Relabel labcontrol1003.wikimedia.org as cloudcontrol1003.wikimedia.org - https://phabricator.wikimedia.org/T200080 (10aborrero) p:05Triage>03Normal [12:50:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1096:3316, depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447062 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [12:52:05] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1096:3316, depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447062 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [12:52:21] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1096:3316, depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447062 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [12:53:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1096:3316, depool db1098:3316 (duration: 00m 54s) [12:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:58] !log Stop replication in sync db1085 and db1096:3318 [12:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:05] (03CR) 10Reedy: "I just lazily copied the file between the two locations. Will partially revert" [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) (owner: 10Reedy) [12:59:31] (03PS2) 10Reedy: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) [13:02:40] (03PS3) 10Reedy: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) [13:02:45] (03PS1) 10Legoktm: throttle: Update Wikimania IP address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447065 (https://phabricator.wikimedia.org/T198288) [13:03:03] jouncebot: now [13:03:04] No deployments scheduled for the next 69 hour(s) and 56 minute(s) [13:03:06] :D [13:03:07] k [13:03:11] do it! [13:03:29] (03CR) 10Addshore: [C: 031] throttle: Update Wikimania IP address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447065 (https://phabricator.wikimedia.org/T198288) (owner: 10Legoktm) [13:03:38] srsly [13:03:40] (03CR) 10Legoktm: [C: 032] throttle: Update Wikimania IP address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447065 (https://phabricator.wikimedia.org/T198288) (owner: 10Legoktm) [13:03:45] Reedy: shhh [13:04:32] Reedy: I literally just sat down to eat lunch too [13:04:48] (03Merged) 10jenkins-bot: throttle: Update Wikimania IP address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447065 (https://phabricator.wikimedia.org/T198288) (owner: 10Legoktm) [13:05:04] legoktm: i would do it but im so tired id end up breaking everything [13:05:08] * addshore is with siebrand [13:06:30] !log legoktm@deploy1001 Synchronized wmf-config/throttle.php: Update Wikimania IP address - T198288 (duration: 00m 54s) [13:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:34] T198288: Increase account creation at Wikimania 2018 July 18-22 - https://phabricator.wikimedia.org/T198288 [13:08:22] (03CR) 10jenkins-bot: throttle: Update Wikimania IP address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447065 (https://phabricator.wikimedia.org/T198288) (owner: 10Legoktm) [13:10:35] (03PS1) 10Marostegui: db-eqiad.php: Repool db1098:3316, depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447069 (https://phabricator.wikimedia.org/T200061) [13:11:46] legoktm: There should be a script to clear the throttling key too.. [13:12:46] oh, do we need to do that? [13:13:01] Yeah, otherwise the key stays in memcached and nothing more happens :P [13:13:04] lol [13:13:07] https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold is out of date [13:13:41] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1098:3316, depool db1113:3316 (duration: 00m 54s) [13:13:42] !log Stop replication in sync db1085 and db1113:3316 [13:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:56] (03CR) 10Marostegui: [V: 032 C: 032] db-eqiad.php: Repool db1098:3316, depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447069 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [13:13:58] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1098:3316, depool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447069 (https://phabricator.wikimedia.org/T200061) (owner: 10Marostegui) [13:14:45] Oh [13:14:45] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/435974/ [13:14:47] It wasn't merged :( [13:15:43] uh, what is terbium now? [13:15:59] mwmaint1001 [13:16:03] just do it from deploy1001 [13:16:56] (03PS1) 10Ema: Initial WMF packaging [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T199720) [13:18:38] legoktm@deploy1001:~$ mwscript ../../../../../../../home/legoktm/resetAuthenticationThrottle.php --wiki=enwiki --signup --ip "197.101.76.150" [13:18:38] Clearing signup throttle... done [13:19:45] Reedy: oh, you already merged it too. Awesome :D [13:19:53] Yeah [13:19:57] I thought it had been done already :( [13:22:37] (03PS1) 10Marostegui: db-eqiad.php: Repool db1085, db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447076 [13:22:53] (03PS2) 10Ema: Initial WMF packaging [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/447074 (https://phabricator.wikimedia.org/T199720) [13:30:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1085, db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447076 (owner: 10Marostegui) [13:30:56] (03PS3) 10Urbanecm: Change zhwikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447030 (https://phabricator.wikimedia.org/T199863) (owner: 10ꘟ耀ę™Øꛦ) [13:31:58] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1085, db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447076 (owner: 10Marostegui) [13:32:15] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1085, db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447076 (owner: 10Marostegui) [13:33:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1085, db1113:3316 (duration: 00m 55s) [13:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:03] (03CR) 10Urbanecm: [C: 031] "Looks good to me from the technical side. I did not check if correct logos are added." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447030 (https://phabricator.wikimedia.org/T199863) (owner: 10ꘟ耀ę™Øꛦ) [13:42:43] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:43:53] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:44:32] * elukey takes over mwdebug1002 to test https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/446786/ [13:53:08] (03CR) 10Jcrespo: [C: 032] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (https://phabricator.wikimedia.org/T199224) (owner: 10Jcrespo) [14:03:40] (03PS1) 10Reedy: Rollback wikidatawiki to .12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447078 (https://phabricator.wikimedia.org/T199983) [14:05:09] (03PS2) 10ArielGlenn: generate and email a few dump-related stats each month [puppet] - 10https://gerrit.wikimedia.org/r/303707 (https://phabricator.wikimedia.org/T142435) [14:05:52] (03CR) 10jerkins-bot: [V: 04-1] generate and email a few dump-related stats each month [puppet] - 10https://gerrit.wikimedia.org/r/303707 (https://phabricator.wikimedia.org/T142435) (owner: 10ArielGlenn) [14:06:17] Reedy: can put the rollback on mwdebug? [14:06:23] I suspect this might be a parser cache issue [14:06:26] (03CR) 10Reedy: [C: 032] Rollback wikidatawiki to .12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447078 (https://phabricator.wikimedia.org/T199983) (owner: 10Reedy) [14:06:27] Can do, yeah [14:07:29] (03Merged) 10jenkins-bot: Rollback wikidatawiki to .12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447078 (https://phabricator.wikimedia.org/T199983) (owner: 10Reedy) [14:07:47] (03PS3) 10ArielGlenn: generate and email a few dump-related stats each month [puppet] - 10https://gerrit.wikimedia.org/r/303707 (https://phabricator.wikimedia.org/T142435) [14:08:54] (03CR) 10jenkins-bot: Rollback wikidatawiki to .12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447078 (https://phabricator.wikimedia.org/T199983) (owner: 10Reedy) [14:08:58] Reedy legoktm I support the parser cache suspicion (but naive attempts locally didn't prove it though) [14:09:26] How do we make it rebuild wikiversions only on one host... [14:10:11] * Reedy cheats [14:10:50] legoktm: leszek_wmde it's on mwdebug1001 back on .12 [14:10:59] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10faidon) Is there any progress and/or timeline for this? Thanks! [14:11:04] !log T199983 wikidata wiki abck to .12 on mwdebug1001 [14:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:08] T199983: Wikidata showing wrong language for page elements - https://phabricator.wikimedia.org/T199983 [14:11:34] https://www.wikidata.org/wiki/Property:P5476 is broken on .13 but fixed on .12 [14:11:48] legoktm: leszek_wmde so is Property:P5451 [14:11:51] Didn't we have some cache corruption stuff with some of Aarons changes recently? [14:11:59] the mcrouter things? [14:12:08] yeah [14:12:28] Ok, so we've narrowed it [14:12:33] Pablo_WMDE legoktm confirming [14:12:39] If it's tested working on .12, shall I deploy that everywhere for ow? [14:13:19] Reedy: yes please [14:13:28] $wgWBSharedCacheKey = 'wikibase_shared/' . str_replace( '.', '_', $wmgVersionNumber ) . '-' . $wmgWikibaseCachePrefix; [14:13:39] I'm guessing that is bypassing the bad cache? [14:13:42] or something like that [14:14:58] legoktm: potentially, although it is not entirely clear to me why this cache would result in the kind of issue we saw [14:15:16] I think I'm just randomly guessing now [14:15:24] sure :) [14:15:40] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: wikidatawiki back to .12 T199983 [14:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:48] that is cache for entity data, while things like "1 reference" were also off [14:18:39] (03PS4) 10Reedy: Add wikimania.wikimedia.org to apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/446786 (https://phabricator.wikimedia.org/T199935) [14:22:04] Reedy legoktm thanks for help! I've pinged Aaron, WMDE from our side will continue looking into this on Monday [14:22:16] Why are the 'pedias on .12 too? [14:22:38] oh [14:23:01] https://phabricator.wikimedia.org/T191059#4438165 [14:25:37] I'm going to go watch some talks now, I'll have IRC on my phone though [14:25:38] o/ [15:16:13] PROBLEM - MD RAID on ms-be1016 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [15:16:14] ACKNOWLEDGEMENT - MD RAID on ms-be1016 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T200092 [15:16:29] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T200092 (10ops-monitoring-bot) [15:16:33] PROBLEM - swift-container-updater on ms-be1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:16:43] PROBLEM - Disk space on ms-be1016 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error [15:16:53] PROBLEM - very high load average likely xfs on ms-be1016 is CRITICAL: CRITICAL - load average: 117.48, 167.69, 111.64 [15:17:04] PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:17:34] so this seems to be one of the issues that happened in the past weeks, that get fixed with a reboot [15:18:16] oof, upon logging in to ms-be1016: -bash: /usr/share/bash-completion/bash_completion: Input/output error [15:18:37] yeah in console I can see [1315651.476089] sd 0:1:0:12: rejecting I/O to offline device [15:18:50] https://www.irccloud.com/pastebin/Xw0RJoRj/ [15:19:02] alright, shall we out of band reboot? [15:19:03] the last time it was the same IIRc on ms-be1036, and it was the controller getting in a wierd state [15:19:16] herron: yeah going to do it [15:19:19] kk [15:19:27] Iā€™m around if you need any help [15:19:41] !log powercycle ms-be1016 - RAID errors, no ssh available, I/O errors in com2 console [15:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:03] PROBLEM - Host ms-be1016 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:43] RECOVERY - Disk space on ms-be1016 is OK: DISK OK [15:23:44] RECOVERY - very high load average likely xfs on ms-be1016 is OK: OK - load average: 14.13, 3.08, 1.01 [15:23:52] thanks elukey ! [15:23:53] RECOVERY - Host ms-be1016 is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [15:23:54] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational [15:24:13] RECOVERY - MD RAID on ms-be1016 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:24:17] godog: prego! It seems like the raid controller went awol [15:24:33] RECOVERY - swift-container-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:26:56] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T200092 (10elukey) 05Open>03Resolved a:03elukey After a powercycle the error went away, probably a temporary issue in the RAID controller? [15:33:05] !log rolling restart of eventstreams on scb2* nodes to reduce the memory pressure before the weekend (still waiting for a permanent fix) [15:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:29] mobrovac: --^ [15:33:53] ah ok elukey, i though i would have done it later :) [15:34:42] done :) [15:45:55] 10Operations, 10monitoring, 10Privacy, 10Security: status.wikimedia.org should have an alternative privacy policy - https://phabricator.wikimedia.org/T189763 (10Reedy) [15:45:57] 10Operations, 10monitoring, 10Patch-For-Review: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10Reedy) [15:46:34] 10Operations, 10monitoring: status.wikimedia.org showing all lights green during major outage - https://phabricator.wikimedia.org/T195530 (10Reedy) [15:46:36] 10Operations: status.wikimedia.org should use some Wikimedia favicon if possible - https://phabricator.wikimedia.org/T134458 (10Reedy) [15:46:40] 10Operations, 10monitoring, 10Patch-For-Review: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10Reedy) [15:47:35] 10Operations, 10monitoring, 10Patch-For-Review: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10Reedy) [15:57:06] 10Operations, 10monitoring: status.wikimedia.org showing all lights green during major outage - https://phabricator.wikimedia.org/T195530 (10Imarlier) @TheDJ Guessing this was obvious from the parent task that Reedy added, but the short story is that the existing status.wikimedia.org page is effectively unmain... [16:01:26] 10Operations, 10monitoring: status.wikimedia.org showing all lights green during major outage - https://phabricator.wikimedia.org/T195530 (10Nemo_bis) status.wikimedia.org has indeed no usefulness whatsoever: #wikimedia-tech on Freenode is our only real venue to report system status. The larger issue has been... [16:19:43] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 305 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [16:24:53] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 305 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [16:39:32] (03CR) 10Aaron Schulz: JobQueue: Signal JobQueueEventBus is never read-only (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447055 (https://phabricator.wikimedia.org/T199594) (owner: 10Mobrovac) [17:08:14] !log enable ospf on eqdfw-knams link (GTT) [17:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:26] !log enable bgp on eqdfw-knams link (GTT) [17:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:03] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 617.04 seconds [17:37:45] (03CR) 10MarcoAurelio: [C: 04-1] "Logo naming is not correct IMHO. Should be zh_hanswiki (cfr. roa_rupwiki, zh_min_nanwiki, zh_yuewiki, etc. on /images/project-logos).The e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447030 (https://phabricator.wikimedia.org/T199863) (owner: 10ꘟ耀ę™Øꛦ) [17:39:24] (03CR) 10MarcoAurelio: "Apologies, I think I misread something. Disregard." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447030 (https://phabricator.wikimedia.org/T199863) (owner: 10ꘟ耀ę™Øꛦ) [17:42:24] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0 [17:44:34] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 49 probes of 305 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [17:46:04] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 30 probes of 326 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [17:49:43] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 15 probes of 305 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [17:51:13] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 1 probes of 326 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [17:53:29] (03PS2) 10Herron: Remove LDAP access for siddarth11 [puppet] - 10https://gerrit.wikimedia.org/r/446225 (owner: 10Muehlenhoff) [17:54:39] (03CR) 10Herron: [C: 032] Remove LDAP access for siddarth11 [puppet] - 10https://gerrit.wikimedia.org/r/446225 (owner: 10Muehlenhoff) [18:00:13] PROBLEM - Disk space on elastic1022 is CRITICAL: DISK CRITICAL - free space: /srv 52443 MB (10% inode=99%) [18:02:24] RECOVERY - Disk space on elastic1022 is OK: DISK OK [18:02:54] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [18:13:37] (03PS1) 10Bstorm: gridengine: Add package information for stretch exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/447089 (https://phabricator.wikimedia.org/T199276) [18:14:14] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 21 probes of 305 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:17:27] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1055 - https://phabricator.wikimedia.org/T194118 (10RobH) [18:18:29] (03PS1) 10RobH: decom db1055 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/447090 (https://phabricator.wikimedia.org/T194118) [18:19:23] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 17 probes of 305 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:20:00] (03CR) 10RobH: [C: 032] decom db1055 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/447090 (https://phabricator.wikimedia.org/T194118) (owner: 10RobH) [18:20:37] (03PS1) 10RobH: decom of db1055 [puppet] - 10https://gerrit.wikimedia.org/r/447091 (https://phabricator.wikimedia.org/T194118) [18:21:13] (03CR) 10RobH: [C: 032] decom of db1055 [puppet] - 10https://gerrit.wikimedia.org/r/447091 (https://phabricator.wikimedia.org/T194118) (owner: 10RobH) [18:22:43] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1055 - https://phabricator.wikimedia.org/T194118 (10RobH) a:05RobH>03Cmjohnson [18:24:42] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736 (10RobH) [18:25:28] (03PS7) 10Bstorm: wiki replicas - prepare for refactored actor storage [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) [18:25:53] (03CR) 10jerkins-bot: [V: 04-1] wiki replicas - prepare for refactored actor storage [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) (owner: 10Bstorm) [18:27:50] (03PS8) 10Bstorm: wiki replicas - prepare for refactored actor storage [puppet] - 10https://gerrit.wikimedia.org/r/431823 (https://phabricator.wikimedia.org/T195747) [18:33:55] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736 (10RobH) [18:35:11] 10Operations: requesting additional production ssh key for jmorgan - https://phabricator.wikimedia.org/T200103 (10Capt_Swing) [18:36:26] (03PS1) 10RobH: decom of db1056 [puppet] - 10https://gerrit.wikimedia.org/r/447094 (https://phabricator.wikimedia.org/T193736) [18:37:11] (03PS1) 10RobH: decom db1056 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/447095 (https://phabricator.wikimedia.org/T193736) [18:37:22] (03CR) 10RobH: [C: 032] decom of db1056 [puppet] - 10https://gerrit.wikimedia.org/r/447094 (https://phabricator.wikimedia.org/T193736) (owner: 10RobH) [18:37:44] (03CR) 10RobH: [C: 032] decom db1056 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/447095 (https://phabricator.wikimedia.org/T193736) (owner: 10RobH) [18:39:25] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736 (10RobH) a:03Cmjohnson [18:49:03] PROBLEM - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [18:49:25] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1060 - https://phabricator.wikimedia.org/T193732 (10RobH) [18:51:07] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1060 - https://phabricator.wikimedia.org/T193732 (10RobH) [18:52:08] (03PS1) 10RobH: decom db1060 [puppet] - 10https://gerrit.wikimedia.org/r/447097 (https://phabricator.wikimedia.org/T193732) [18:53:13] (03CR) 10RobH: [C: 032] decom db1060 [puppet] - 10https://gerrit.wikimedia.org/r/447097 (https://phabricator.wikimedia.org/T193732) (owner: 10RobH) [18:54:18] (03PS1) 10RobH: decom of db1060 prod dns [dns] - 10https://gerrit.wikimedia.org/r/447098 (https://phabricator.wikimedia.org/T193732) [18:55:14] (03CR) 10RobH: [C: 032] decom of db1060 prod dns [dns] - 10https://gerrit.wikimedia.org/r/447098 (https://phabricator.wikimedia.org/T193732) (owner: 10RobH) [18:55:31] (03PS2) 10RobH: decom elastic1021 prod dns [dns] - 10https://gerrit.wikimedia.org/r/419813 (https://phabricator.wikimedia.org/T189727) [18:55:41] (03Abandoned) 10RobH: decom elastic1021 prod dns [dns] - 10https://gerrit.wikimedia.org/r/419813 (https://phabricator.wikimedia.org/T189727) (owner: 10RobH) [18:55:56] (03Abandoned) 10RobH: return tempdb2001 to spares [dns] - 10https://gerrit.wikimedia.org/r/352045 (owner: 10RobH) [18:56:59] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1060 - https://phabricator.wikimedia.org/T193732 (10RobH) a:05RobH>03Cmjohnson [19:00:03] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1015 bytes in 1.111 second response time [19:09:51] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-Site-requests, 10User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515 (10matej_suchanek) [19:35:04] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 282.04 seconds [20:58:44] 10Operations: requesting additional production ssh key for jmorgan - https://phabricator.wikimedia.org/T200103 (10Reedy) You can make your own gerrit changeset for this ;) [21:38:38] 10Operations, 10Performance-Team, 10vm-requests: Increase webperf1002/webperf2002 space from 50GB to 500 GB (Ganeti) - https://phabricator.wikimedia.org/T199853 (10Krinkle) @herron That would reduce this request to needing ~150 GB (for XHGui's Mongo). Is that doable? I'll slice the Swift support for ArcLamp... [21:49:32] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle) [21:50:36] !log deployed patch T200104 [21:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:06] (03PS1) 10Andrew Bogott: labservices: help pdns talk to the local database [puppet] - 10https://gerrit.wikimedia.org/r/447105 [23:28:33] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 4 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Liuxinyu970226) @Johan I strongly recommend to also announce this task in the [[https://meta.wikimedia.org/...