[00:38:09] PROBLEM - HP RAID on db2061 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11 - Failed: 1I:1:12 - Controller: OK - Battery/Capacitor: OK [00:38:11] ACKNOWLEDGEMENT - HP RAID on db2061 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11 - Failed: 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T199759 [00:38:24] 10Operations, 10ops-codfw: Degraded RAID on db2061 - https://phabricator.wikimedia.org/T199759 (10ops-monitoring-bot) [01:19:30] herron: Could use review on these patches, ideally this week? https://gerrit.wikimedia.org/r/#/q/hashtag:beta-picked+is:open+topic:webperf+owner:Krinkle [01:20:05] I've gone fairly far within what I can test on beta, but to make progress further the stack is getting a bit too big. Would prefer this to land first. [01:21:30] PROBLEM - Device not healthy -SMART- on db2061 is CRITICAL: cluster=mysql device=cciss,11 instance=db2061:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2061&var-datasource=codfw%2520prometheus%252Fops [01:27:45] (03CR) 10Krinkle: "Clean compiler output (some unrelated changes from other commits in this stack, given the compiler compares to HEAD) – https://puppet-comp" [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [01:36:46] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle) [01:48:20] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [01:53:39] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [01:54:12] 10Operations, 10Core-Platform-Team, 10monitoring, 10Wikimedia-Incident: Add alerts for Logstash rates in production - https://phabricator.wikimedia.org/T199479 (10Fjalapeno) Adding Operations, but not sure where this task should actually go [01:54:40] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [02:04:09] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /srv 50522 MB (10% inode=99%) [02:11:40] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:12:59] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [02:33:29] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:33:59] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:34:39] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [02:34:40] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.12) (duration: 13m 51s) [02:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:50] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:37:20] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [02:38:00] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [02:45:01] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Jul 17 02:45:01 UTC 2018 (duration 10m 21s) [02:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:49] RECOVERY - Disk space on elastic1024 is OK: DISK OK [03:07:30] (03CR) 10Krinkle: "Fails compiler currently because xhgui and arclamp both use class 'httpd'." [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [03:08:40] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [03:09:59] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 52279 MB (10% inode=99%) [03:17:50] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 51984 MB (10% inode=99%) [03:26:19] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 943.06 seconds [03:36:30] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [03:37:09] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 52690 MB (10% inode=99%) [03:37:40] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [03:54:19] RECOVERY - Disk space on elastic1025 is OK: DISK OK [03:58:40] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [04:09:19] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 215.08 seconds [04:35:50] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [04:39:18] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2061 - https://phabricator.wikimedia.org/T199759 (10Marostegui) p:05Triage>03Normal a:03Papaul Can we get a replacement? Thanks! [04:40:00] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2061 is CRITICAL: cluster=mysql device=cciss,11 instance=db2061:9100 job=node site=codfw Marostegui T199759 - The acknowledgement expires at: 2018-07-27 04:39:29. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2061&var-datasource=codfw%2520prometheus%252Fops [04:43:30] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [04:44:49] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [04:46:24] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446214 (https://phabricator.wikimedia.org/T199368) [04:47:40] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [04:47:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446214 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [04:48:19] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [04:48:59] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [04:49:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446214 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [04:49:20] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [04:49:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446214 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [04:50:30] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [04:50:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1096:3315 for alter table (duration: 00m 53s) [04:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:21] Deploy schema change on db1096:3315 T144010 T51190 T199368 [04:51:21] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [04:51:22] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [04:51:22] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [04:51:30] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [04:51:40] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [04:51:40] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [04:52:40] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [04:52:40] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [04:52:59] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [04:53:50] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [04:55:09] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [04:55:18] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446215 [04:56:10] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [04:57:00] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446215 (owner: 10Marostegui) [04:58:39] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [04:58:52] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446215 (owner: 10Marostegui) [04:59:08] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446215 (owner: 10Marostegui) [04:59:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1096:3315 after alter table (duration: 00m 49s) [04:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:22] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446217 (https://phabricator.wikimedia.org/T199368) [05:00:40] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:00:49] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:01:50] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:01:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446217 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [05:01:59] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [05:03:09] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [05:03:09] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [05:03:39] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [05:03:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446217 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [05:03:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446217 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [05:04:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446219 [05:04:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097:3315 for alter table (duration: 00m 50s) [05:04:41] Deploy schema change on db1097:3315 T144010 T51190 T199368 [05:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:43] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:04:44] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:04:44] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [05:05:29] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:06:30] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [05:06:39] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [05:08:09] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [05:08:59] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:08:59] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446219 (owner: 10Marostegui) [05:09:30] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:09:49] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:09:59] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:10:09] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [05:10:10] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [05:10:44] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446219 (owner: 10Marostegui) [05:10:49] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [05:10:56] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446219 (owner: 10Marostegui) [05:11:10] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [05:11:20] Deploy schema change on db1113:3315 T144010 T51190 T199368 [05:11:21] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:11:21] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:11:21] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [05:12:00] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [05:12:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097:3315 after alter table (duration: 00m 49s) [05:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:29] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [05:13:29] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [05:14:19] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446220 (https://phabricator.wikimedia.org/T199368) [05:14:30] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [05:14:49] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [05:15:49] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:15:49] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:15:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446220 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [05:16:59] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): /{domain}/v1/translation/articles/{source}{/seed} (bad seed) timed out before a response was received [05:17:34] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446220 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [05:17:40] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:18:00] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:18:09] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [05:18:20] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [05:18:41] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1082 for alter table (duration: 00m 49s) [05:18:43] Deploy schema change on db1082 with replication, this will generate lag on labs:s5 T144010 T51190 T199368 [05:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:44] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:18:45] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:18:45] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [05:18:50] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [05:19:10] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [05:19:10] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [05:19:10] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [05:19:19] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [05:19:19] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [05:19:19] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [05:19:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446221 [05:19:39] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [05:20:31] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446220 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [05:22:09] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446221 (owner: 10Marostegui) [05:23:49] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446221 (owner: 10Marostegui) [05:25:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1082 after alter table (duration: 00m 49s) [05:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:17] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446221 (owner: 10Marostegui) [05:26:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446222 (https://phabricator.wikimedia.org/T199368) [05:28:21] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446222 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [05:29:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446222 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [05:30:03] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446222 (https://phabricator.wikimedia.org/T199368) (owner: 10Marostegui) [05:30:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1100 for alter table (duration: 00m 49s) [05:31:00] Deploy schema change on db1100 T144010 T51190 T199368 [05:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:01] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:31:02] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:31:04] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [05:32:24] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446223 [05:33:23] Deploy schema change on db1070 (s5 primary master) T144010 T51190 T199368 [05:34:04] !log Deploy schema change on db1070 (s5 primary master) T144010 T51190 T199368 [05:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:12] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446223 (owner: 10Marostegui) [05:35:43] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446223 (owner: 10Marostegui) [05:37:13] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1100 after alter table (duration: 00m 49s) [05:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:19] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [05:38:37] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446223 (owner: 10Marostegui) [05:43:42] (03PS6) 10Krinkle: webperf: Split Redis from the rest of the arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) [05:44:04] (03CR) 10Krinkle: "Seems puppet compiler missed a major error that I found on beta:" [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [05:55:57] !log Deploy schema change on db2039 (s6 codfw master) with replication, this will generate lag on codfw T144010 T51190 T199368 [05:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:03] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:56:03] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:56:03] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [05:56:50] (03PS3) 10Krinkle: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) [06:14:29] (03PS7) 10Elukey: turnilo: addi a measure of the bot requests percentage on pageview datasets [puppet] - 10https://gerrit.wikimedia.org/r/445654 (owner: 10Nuria) [06:14:35] (03PS8) 10Elukey: turnilo: addi a measure of the bot requests percentage on pageview datasets [puppet] - 10https://gerrit.wikimedia.org/r/445654 (owner: 10Nuria) [06:15:13] (03PS9) 10Elukey: turnilo: add a measure of the bot requests percentage on pageview datasets [puppet] - 10https://gerrit.wikimedia.org/r/445654 (owner: 10Nuria) [06:20:55] (03PS1) 10Muehlenhoff: Remove LDAP access for siddarth11 [puppet] - 10https://gerrit.wikimedia.org/r/446225 [06:30:00] !log installing postgresql-9.6 updates from stretch point release [06:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:33] !log Drop unused grants on es codfw hosts [06:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:07] !log installing subversion security updates [06:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:25] !log installing patch security updates [06:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:56] !log updated stretch netinst image for 9.5 [07:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:01] ^volans [07:10:30] moritzm: ack, thanks [07:11:41] 10Operations, 10ops-esams, 10Traffic: cp3033 unreacheable since 2018-07-15 11:47:31 - https://phabricator.wikimedia.org/T199677 (10MoritzMuehlenhoff) That sounds like a hang in the NIC, but I doubt we have any useful hardware diagnostics/logging on that level. [07:15:19] RECOVERY - swift-container-server on ms-be1041 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [07:15:20] RECOVERY - swift-account-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [07:15:29] RECOVERY - swift-container-auditor on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:15:30] RECOVERY - swift-container-updater on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:15:30] RECOVERY - swift-account-auditor on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [07:15:30] RECOVERY - swift-account-reaper on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [07:15:30] RECOVERY - swift-account-server on ms-be1041 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:15:49] RECOVERY - swift-object-server on ms-be1041 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [07:15:49] RECOVERY - swift-container-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [07:15:49] RECOVERY - swift-object-updater on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [07:15:49] RECOVERY - swift-object-replicator on ms-be1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [07:16:00] RECOVERY - Check systemd state on ms-be1041 is OK: OK - running: The system is fully operational [07:16:10] RECOVERY - swift-object-auditor on ms-be1041 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [07:17:15] !log Drop unused grants on es eqiad hosts [07:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:40] (03CR) 10Volans: [V: 032 C: 032] "@akosiaris: I don't know if you were planning to review this, but as this has been reviewed by Valentin and Moritz and you're out for a bi" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443361 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [07:18:04] (03CR) 10Volans: [V: 032 C: 032] Allow to delete hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443362 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [07:19:13] (03CR) 10Volans: [V: 032 C: 032] Allow to link hosts to external resources [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443363 (https://phabricator.wikimedia.org/T198590) (owner: 10Volans) [07:19:58] (03CR) 10Volans: [V: 032 C: 032] DataTables: save state for the session [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443364 (https://phabricator.wikimedia.org/T198591) (owner: 10Volans) [07:20:32] (03CR) 10Volans: [V: 032 C: 032] DataTables: cleanup initialization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443365 (https://phabricator.wikimedia.org/T198591) (owner: 10Volans) [07:24:10] !log installing xapian-core security updates [07:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:28] (03CR) 10Joal: [C: 031] "LGTM !" [puppet] - 10https://gerrit.wikimedia.org/r/445654 (owner: 10Nuria) [07:28:24] (03PS1) 10Elukey: Add IPv6 PTR records for Analytics hosts [dns] - 10https://gerrit.wikimedia.org/r/446235 (https://phabricator.wikimedia.org/T199180) [07:28:30] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [07:28:34] (03CR) 10Elukey: [C: 032] turnilo: add a measure of the bot requests percentage on pageview datasets [puppet] - 10https://gerrit.wikimedia.org/r/445654 (owner: 10Nuria) [07:28:37] (03CR) 10jerkins-bot: [V: 04-1] Add IPv6 PTR records for Analytics hosts [dns] - 10https://gerrit.wikimedia.org/r/446235 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey) [07:28:49] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [07:29:31] !log un xfs_repair on filesystems reporting negative space available on ms-be1042 - T199198 [07:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:35] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [07:34:47] (03PS1) 10Volans: Updated src to v0.1.6 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/446236 [07:34:49] (03PS1) 10Volans: Built wheels for v0.1.6 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/446237 [07:35:40] (03PS1) 10Elukey: turnilo: fix indentation of config file [puppet] - 10https://gerrit.wikimedia.org/r/446238 [07:36:02] (03CR) 10Muehlenhoff: [C: 031] Updated src to v0.1.6 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/446236 (owner: 10Volans) [07:36:55] (03CR) 10Elukey: [C: 032] turnilo: fix indentation of config file [puppet] - 10https://gerrit.wikimedia.org/r/446238 (owner: 10Elukey) [07:39:14] (03PS2) 10Elukey: Add IPv6 PTR records for Analytics hosts [dns] - 10https://gerrit.wikimedia.org/r/446235 (https://phabricator.wikimedia.org/T199180) [07:42:19] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:42:39] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [07:42:53] (03CR) 10Volans: [V: 032 C: 032] Updated src to v0.1.6 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/446236 (owner: 10Volans) [07:43:01] (03CR) 10Volans: [V: 032 C: 032] Built wheels for v0.1.6 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/446237 (owner: 10Volans) [07:44:48] 10Operations, 10Analytics, 10procurement, 10User-Elukey: eqiad | (14 + 6) hadoop hardware refresh and expansion - https://phabricator.wikimedia.org/T199673 (10elukey) a:05elukey>03None [07:45:01] 10Operations, 10Analytics, 10procurement, 10User-Elukey: eqiad | (3) Labs Data Lake hardware - https://phabricator.wikimedia.org/T199674 (10elukey) a:05elukey>03None [07:50:14] !log volans@deploy1001 Started deploy [debmonitor/deploy@691d2f8]: Release v0.1.6 [07:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:55] !log volans@deploy1001 Finished deploy [debmonitor/deploy@691d2f8]: Release v0.1.6 (duration: 00m 41s) [07:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:49] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:52:49] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [07:53:40] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [07:54:10] (03PS6) 10Volans: debmonitor: set up maintenance [puppet] - 10https://gerrit.wikimedia.org/r/443371 (https://phabricator.wikimedia.org/T198526) [07:54:36] (03CR) 10jerkins-bot: [V: 04-1] debmonitor: set up maintenance [puppet] - 10https://gerrit.wikimedia.org/r/443371 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [07:56:07] (03PS2) 10Volans: debmonitor: configure external links [puppet] - 10https://gerrit.wikimedia.org/r/443377 (https://phabricator.wikimedia.org/T198590) [07:57:01] (03CR) 10Volans: [C: 032] debmonitor: configure external links [puppet] - 10https://gerrit.wikimedia.org/r/443377 (https://phabricator.wikimedia.org/T198590) (owner: 10Volans) [07:57:59] (03PS1) 10Muehlenhoff: Move declaration of diamond package out of diamond class [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) [08:04:33] (03PS1) 10Volans: debmonitor: fix typo in config [puppet] - 10https://gerrit.wikimedia.org/r/446243 [08:06:27] (03CR) 10Volans: [C: 032] debmonitor: fix typo in config [puppet] - 10https://gerrit.wikimedia.org/r/446243 (owner: 10Volans) [08:08:30] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [08:11:13] (03PS1) 10Volans: debmonitor: fine-tune config for link to Netbox [puppet] - 10https://gerrit.wikimedia.org/r/446245 [08:12:07] (03CR) 10Volans: [C: 032] debmonitor: fine-tune config for link to Netbox [puppet] - 10https://gerrit.wikimedia.org/r/446245 (owner: 10Volans) [08:13:14] (03PS7) 10Volans: debmonitor: set up maintenance [puppet] - 10https://gerrit.wikimedia.org/r/443371 (https://phabricator.wikimedia.org/T198526) [08:14:57] (03CR) 10jerkins-bot: [V: 04-1] debmonitor: set up maintenance [puppet] - 10https://gerrit.wikimedia.org/r/443371 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [08:27:09] 10Operations, 10Core-Platform-Team, 10WMF-JobQueue, 10Services (designing), and 2 others: Exception "Job queue is read-only" - https://phabricator.wikimedia.org/T199594 (10mobrovac) Thanks for unearthing this, @Krinkle . This is probably the last thing we forgot to change in the JobQueue switch. So, both t... [08:31:33] Deploy schema change on db1096:3316 T144010 T51190 T199368 [08:31:34] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [08:31:34] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [08:31:34] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [08:31:44] !log Deploy schema change on db1096:3316 T144010 T51190 T199368 [08:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:19] PROBLEM - eventstreams on scb2006 is CRITICAL: connect to address 10.192.32.20 and port 8092: Connection refused [08:36:00] PROBLEM - cxserver endpoints health on scb2006 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received [08:36:00] PROBLEM - apertium apy on scb2006 is CRITICAL: connect to address 10.192.32.20 and port 2737: Connection refused [08:36:29] RECOVERY - eventstreams on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.078 second response time [08:37:09] RECOVERY - cxserver endpoints health on scb2006 is OK: All endpoints are healthy [08:37:10] RECOVERY - apertium apy on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.074 second response time [08:40:00] PROBLEM - puppet last run on scb2005 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 4 minutes ago with 9 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[cpjobqueue/deploy],Exec[chown /srv/deployment/cpjobqueue for deploy-service],Package[recommendation-api/deploy] [08:44:22] (03PS1) 10Arturo Borrero Gonzalez: Revert "cloud vps: labtest: missing allowed connection" [puppet] - 10https://gerrit.wikimedia.org/r/446255 (https://phabricator.wikimedia.org/T196752) [08:44:24] I see on recent deployments [08:44:39] this could be the memory issue we saw in the last weeks [08:45:14] (03PS1) 10Volans: Show external links only if configured [software/debmonitor] - 10https://gerrit.wikimedia.org/r/446256 [08:45:43] indeed memory issues according to kernel [08:46:08] ^ mobrovac [08:46:09] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Revert "cloud vps: labtest: missing allowed connection" [puppet] - 10https://gerrit.wikimedia.org/r/446255 (https://phabricator.wikimedia.org/T196752) (owner: 10Arturo Borrero Gonzalez) [08:46:40] (03CR) 10Muehlenhoff: [C: 031] Show external links only if configured [software/debmonitor] - 10https://gerrit.wikimedia.org/r/446256 (owner: 10Volans) [08:46:47] (03CR) 10jerkins-bot: [V: 04-1] Show external links only if configured [software/debmonitor] - 10https://gerrit.wikimedia.org/r/446256 (owner: 10Volans) [08:47:32] (03PS1) 10Jforrester: PageImages: Make it possible to add extra namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446257 (https://phabricator.wikimedia.org/T198716) [08:47:35] (03PS1) 10Jforrester: PageImages: Make it possible to add extra namespaces, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446258 [08:47:36] (03PS1) 10Jforrester: PageImages: Add NS_CATEGORY for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446259 [08:47:40] (03CR) 10Volans: [V: 032 C: 032] Show external links only if configured [software/debmonitor] - 10https://gerrit.wikimedia.org/r/446256 (owner: 10Volans) [08:51:14] !log installing libipc-run-perl update from stretch 9.5 point release [08:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:11] I've documented the new failure of the instance at T191199 [08:52:12] T191199: Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199 [08:52:20] *instance of the failure [09:05:39] RECOVERY - puppet last run on scb2005 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:11:27] !log Deploy schema change on db1098:3316 T144010 T51190 T199368 [09:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:33] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [09:11:33] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [09:11:33] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [09:16:44] (03PS3) 10Giuseppe Lavagetto: mediawiki: add vhost define [puppet] - 10https://gerrit.wikimedia.org/r/439893 (https://phabricator.wikimedia.org/T196968) [09:16:49] 10Operations, 10CommRel-Internals, 10Wikimedia-Mailing-lists: Close https://lists.wikimedia.org/mailman/listinfo/cep and keep the archive for now - https://phabricator.wikimedia.org/T155683 (10Elitre) [09:16:59] 10Operations, 10CommRel-Internals, 10Wikimedia-Mailing-lists: Close https://lists.wikimedia.org/mailman/listinfo/cep and keep the archive for now - https://phabricator.wikimedia.org/T155683 (10Elitre) Per the above. [09:18:40] 10Operations, 10CommRel-Specialists-Support: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Elitre) [09:18:51] (03PS1) 10Giuseppe Lavagetto: wmflib: fix spec for newer versions of mocha/rspec [puppet] - 10https://gerrit.wikimedia.org/r/446269 [09:21:27] (03CR) 10Volans: [C: 031] "LGTM, thanks for fixing it!" [puppet] - 10https://gerrit.wikimedia.org/r/446269 (owner: 10Giuseppe Lavagetto) [09:22:50] 10Operations, 10CommRel-Specialists-Support, 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Elitre) a:03Johan Thanks a lot for filing this, Mark! Johan should be able to assist then. @Whatamidoing-WMF will be his backup. [09:38:08] !log Drop unused grants from db2037, db2042, db2078:3323, db2078:3325 [09:38:10] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 (10fgiunchedi) [09:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:50] (03CR) 10Volans: [C: 032] wmflib: fix spec for newer versions of mocha/rspec [puppet] - 10https://gerrit.wikimedia.org/r/446269 (owner: 10Giuseppe Lavagetto) [09:40:29] 10Operations, 10SCB, 10Services (watching): Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199 (10mobrovac) >>! In T191199#4429788, @jcrespo wrote: > I believe this, or something similar related to memory-related stalls happened on scb2006. Indeed, this seems to be the ca... [09:41:07] !log Deploy schema change on db1113:3316 T144010 T51190 T199368 [09:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:15] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [09:41:16] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [09:41:16] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [09:45:15] (03PS1) 10Arturo Borrero Gonzalez: cloud vps: labtestn: allow more connections from labtest [puppet] - 10https://gerrit.wikimedia.org/r/446274 (https://phabricator.wikimedia.org/T196752) [09:45:59] (03CR) 10jerkins-bot: [V: 04-1] cloud vps: labtestn: allow more connections from labtest [puppet] - 10https://gerrit.wikimedia.org/r/446274 (https://phabricator.wikimedia.org/T196752) (owner: 10Arturo Borrero Gonzalez) [09:47:30] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz) >>! In T197242#4428119, @mobrovac wrote: > We first need to make Citoid and v2 of the translation server work together locally, then i... [09:48:07] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11803/ compiler happy" [puppet] - 10https://gerrit.wikimedia.org/r/446274 (https://phabricator.wikimedia.org/T196752) (owner: 10Arturo Borrero Gonzalez) [09:48:49] (03PS8) 10Volans: debmonitor: set up maintenance [puppet] - 10https://gerrit.wikimedia.org/r/443371 (https://phabricator.wikimedia.org/T198526) [09:48:51] (03PS1) 10Volans: debmonitor: set proxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/446275 [09:48:59] (03PS2) 10Arturo Borrero Gonzalez: cloud vps: labtestn: allow more connections from labtest [puppet] - 10https://gerrit.wikimedia.org/r/446274 (https://phabricator.wikimedia.org/T196752) [09:49:56] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloud vps: labtestn: allow more connections from labtest [puppet] - 10https://gerrit.wikimedia.org/r/446274 (https://phabricator.wikimedia.org/T196752) (owner: 10Arturo Borrero Gonzalez) [09:50:59] 10Operations, 10Core-Platform-Team, 10WMF-JobQueue, 10Services (designing), and 2 others: Exception "Job queue is read-only" - https://phabricator.wikimedia.org/T199594 (10mobrovac) >>! In T199594#4429792, @Joe wrote: > I guess that we could in fact decouple enqueueing and executing jobs, but I'm not sure... [09:52:29] (03CR) 10Volans: "Compiler results available here: https://puppet-compiler.wmflabs.org/compiler03/11805/debmonitor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/446275 (owner: 10Volans) [09:55:54] 10Operations, 10ChangeProp, 10Services (designing), 10Wikimedia-Incident: Separate dev Change-Prop from production Kafka cluster - https://phabricator.wikimedia.org/T199427 (10mobrovac) Agreed. Using MirrorMaker we should be able to achieve this smoothly. @Ottomata and @elukey, let's put a timeline together? [10:06:47] jouncebot: next [10:06:47] In 0 hour(s) and 53 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T1100) [10:07:36] !log restarting zookeeper in codfw to pick up Java security updates [10:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:33] !log reimage mw2224 to test the new stretch netboot and wmf-auto-reimage changes [10:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:20] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:13:30] PROBLEM - pdfrender on scb2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:13:39] PROBLEM - mathoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:13:40] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/css/mobile/base (Get base CSS) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:13:49] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) timed out [10:13:49] was received: /{domain}/v1/feed/availability (Retrieve feed content availability from \wikipedia.org\) timed out before a response was received [10:13:50] PROBLEM - eventstreams on scb2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:13:51] mobrovac: again? ^^^ [10:13:59] PROBLEM - changeprop endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:14:09] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:14:16] !log Deploy schema change on db1093 T144010 T51190 T199368 [10:14:19] PROBLEM - apertium apy on scb2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:22] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [10:14:22] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [10:14:22] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [10:14:30] PROBLEM - SSH on scb2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:39] RECOVERY - pdfrender on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.151 second response time [10:14:42] mobrovac: load average: 6765.08, 3925.16, 1609.01 [10:14:49] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [10:14:49] RECOVERY - eventstreams on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.101 second response time [10:14:59] it seems so volans [10:15:00] RECOVERY - changeprop endpoints health on scb2002 is OK: All endpoints are healthy [10:15:09] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy [10:15:19] RECOVERY - apertium apy on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.075 second response time [10:15:27] the host is responsive via ssh now, but it took a while to let me in [10:15:30] RECOVERY - SSH on scb2002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [10:15:30] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [10:15:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [10:15:49] RECOVERY - mathoid endpoints health on scb2002 is OK: All endpoints are healthy [10:15:58] OOM was in action [10:16:08] *oom killer [10:17:20] volans: can you take a look at kernel logs if you see something similar to T191199 please? [10:17:21] T191199: Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199 [10:17:38] mobrovac: I was about to paste [10:17:38] nodejs: page allocation stalls for 10204ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) [10:17:42] so I guess so ;) [10:17:49] shite [10:18:04] this was at 10:14:24 [10:18:18] before that for 4 minutes there was nothing in syslog [10:21:32] mobrovac: only scb200[2,6] have those lines in current syslog (I grepped GFP_HIGHUSER_MOVABLE) [10:21:56] these are the only ones that failed [10:22:06] but 200[1,4] have it for yesterday's log (syslog.1) [10:22:41] I can continue the stat going backward if can be useful :) [10:23:04] there's obviously something going on here [10:23:16] * mobrovac states the obvious [10:23:47] :) [10:23:54] 10Operations, 10SCB, 10Services (watching): Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199 (10Volans) I've found the same logs in syslog for the affected hosts, so yes, definitely the same issue. [10:31:32] !log mobrovac@deploy1001 Started deploy [restbase/deploy@622941d]: Expose the data/mobile/javascript end point, deduplicate most-read results and increase page/related response size to 20 - T199458 T195390 [10:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:38] T199458: Increase number of results returned by /page/related to 20 - https://phabricator.wikimedia.org/T199458 [10:31:39] T195390: Trending shows same article twice - https://phabricator.wikimedia.org/T195390 [10:31:56] !log restarting archiva to pick up OpenJDK security update [10:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:22] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): Update core routers routing for labtest Cloud VPS deployment - https://phabricator.wikimedia.org/T199779 (10aborrero) p:05Triage>03Normal [10:40:22] eventsteams seems to be holding a disproportionally-high amount of mem on scb2006 given that afaik it's not used [10:40:42] half og the workers have 2% of mem each [10:43:36] !log mobrovac@deploy1001 Started deploy [restbase/deploy@622941d]: Expose the data/mobile/javascript end point, deduplicate most-read results and increase page/related response size to 20, take #2 - T199458 T195390 [10:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:40] T199458: Increase number of results returned by /page/related to 20 - https://phabricator.wikimedia.org/T199458 [10:43:41] T195390: Trending shows same article twice - https://phabricator.wikimedia.org/T195390 [10:46:27] hmmm eventstreams in eqiad seems to be using less mem than in codfw [10:46:31] how is that possible? [10:54:37] 10Operations, 10monitoring, 10Patch-For-Review, 10User-herron: Reduce false positive icinga alerts during host reimages - https://phabricator.wikimedia.org/T195423 (10Volans) With the new approach of a delayed, in background, puppet run on the icinga server and icinga downtime most of this should be solved... [10:54:50] 10Operations, 10monitoring, 10User-herron: Reduce false positive icinga alerts during host reimages - https://phabricator.wikimedia.org/T195423 (10Volans) 05Open>03Resolved a:03Volans [10:57:00] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Epic: Relabel labcontrol1004.wikimedia.org as cloudcontrol1004.wikimedia.org - https://phabricator.wikimedia.org/T199782 (10aborrero) p:05Triage>03Normal [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T1100). [11:00:04] Zoranzoki21 and kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:52] zeljkof: FYI mw2224 is being reimaged, it's depooled so should be out of scap targets, but in case there is any issue please let me know [11:00:54] * kart_ around [11:01:06] volans: ok, thanks, will do [11:01:07] I'll wait after the swat to repool it to avoid issues [11:01:17] cook, thanks [11:01:23] I can SWAT today! [11:02:05] kart_: I'll review and merge your commit first, since it takes a while to merge [11:02:16] Sure [11:02:20] while CI is running I'll start deploying Zoranzoki21's patches [11:03:19] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@622941d]: Expose the data/mobile/javascript end point, deduplicate most-read results and increase page/related response size to 20, take #2 - T199458 T195390 (duration: 19m 43s) [11:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:27] T199458: Increase number of results returned by /page/related to 20 - https://phabricator.wikimedia.org/T199458 [11:03:27] T195390: Trending shows same article twice - https://phabricator.wikimedia.org/T195390 [11:03:53] zeljkof: fyi, i'm in the middle of a RB deployment, but that is not related to any of the SWAT items, so we can go in parallel [11:04:15] (03PS12) 10Zoranzoki21: Create Publisher namespace in Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444664 (https://phabricator.wikimedia.org/T199028) [11:04:26] mobrovac: ok, thanks for letting me know [11:04:34] (03CR) 10Zoranzoki21: "@Zfilipin mwscript namespaceDupes.php --wiki=bnwikisource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444664 (https://phabricator.wikimedia.org/T199028) (owner: 10Zoranzoki21) [11:04:42] (03PS4) 10Zoranzoki21: Create Thesaurus NS at thwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446005 (https://phabricator.wikimedia.org/T198585) [11:04:44] !log mobrovac@deploy1001 Started deploy [restbase/deploy@622941d]: Expose the data/mobile/javascript end point, deduplicate most-read results and increase page/related response size to 20, take #3 [11:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:15] kart_: 446212 +2d, CI running, please stand by [11:05:37] (03CR) 10Zoranzoki21: [C: 04-1] "DNM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446005 (https://phabricator.wikimedia.org/T198585) (owner: 10Zoranzoki21) [11:05:58] Zoranzoki21: reviewing your patches, starting with 445953? [11:06:11] Yes zeljkof [11:06:29] (03PS3) 10Zfilipin: Enable ULS webfonts by default at Burmese Wikipedia (mywiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445953 (https://phabricator.wikimedia.org/T196219) (owner: 10Zoranzoki21) [11:07:08] (03PS5) 10Zoranzoki21: Create Thesaurus NS at thwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446005 (https://phabricator.wikimedia.org/T198585) [11:07:53] (03CR) 10Zoranzoki21: "@Zfilipin mwscript namespaceDupes.php --wiki=thwiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446005 (https://phabricator.wikimedia.org/T198585) (owner: 10Zoranzoki21) [11:09:27] santhosh I'm reading your comment at https://phabricator.wikimedia.org/T196219#4429871, do you think it's ok to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/445953 ? [11:10:36] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445953 (https://phabricator.wikimedia.org/T196219) (owner: 10Zoranzoki21) [11:12:13] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@622941d]: Expose the data/mobile/javascript end point, deduplicate most-read results and increase page/related response size to 20, take #3 (duration: 07m 29s) [11:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:19] (03Merged) 10jenkins-bot: Enable ULS webfonts by default at Burmese Wikipedia (mywiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445953 (https://phabricator.wikimedia.org/T196219) (owner: 10Zoranzoki21) [11:12:35] (03CR) 10jenkins-bot: Enable ULS webfonts by default at Burmese Wikipedia (mywiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445953 (https://phabricator.wikimedia.org/T196219) (owner: 10Zoranzoki21) [11:12:57] !log mobrovac@deploy1001 Started deploy [restbase/deploy@622941d]: Expose the data/mobile/javascript end point, deduplicate most-read results and increase page/related response size to 20, take #4 [11:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:02] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@622941d]: Expose the data/mobile/javascript end point, deduplicate most-read results and increase page/related response size to 20, take #4 (duration: 02m 05s) [11:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:26] Zoranzoki21: 445953 is at mwdebug1002 [11:15:38] zeljkof: Ok. Testing [11:17:40] (03PS3) 10Volans: Migrate the server side to Python3 [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/405879 [11:19:43] Zoranzoki21: ok to deploy? [11:19:58] Zoranzoki21: yes [11:20:10] ok, deploying [11:20:17] (03PS6) 10Zoranzoki21: Create Thesaurus NS at thwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446005 (https://phabricator.wikimedia.org/T198585) [11:22:33] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:445953|Enable ULS webfonts by default at Burmese Wikipedia (mywiki) (T196219)]] (duration: 01m 50s) [11:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:37] T196219: Enable ULS webfonts by default at Burmese Wikipedia (mywiki) - https://phabricator.wikimedia.org/T196219 [11:23:19] Zoranzoki21: 445953 deployed, please check, continuing with 446005 [11:24:22] mywiki looks ok [11:24:28] You can continue with next [11:25:20] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446005 (https://phabricator.wikimedia.org/T198585) (owner: 10Zoranzoki21) [11:26:44] (03Merged) 10jenkins-bot: Create Thesaurus NS at thwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446005 (https://phabricator.wikimedia.org/T198585) (owner: 10Zoranzoki21) [11:28:05] (03CR) 10jenkins-bot: Create Thesaurus NS at thwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446005 (https://phabricator.wikimedia.org/T198585) (owner: 10Zoranzoki21) [11:28:51] (03PS1) 10Arturo Borrero Gonzalez: cloud vps: reimage/rename labcontrol1004 to cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/446289 (https://phabricator.wikimedia.org/T199781) [11:28:55] Zoranzoki21: 446005 is at mwdebug1002, please test [11:29:06] * Zoranzoki21 testing [11:29:23] (03PS13) 10Zfilipin: Create Publisher namespace in Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444664 (https://phabricator.wikimedia.org/T199028) (owner: 10Zoranzoki21) [11:31:44] (03PS1) 10Arturo Borrero Gonzalez: cloud vps: reimage/rename labcontrol1004 to cloudcontrol1004 Use the new naming scheme. Bug: T199781 Signed-off-by: Arturo Borrero Gonzalez [dns] - 10https://gerrit.wikimedia.org/r/446290 (https://phabricator.wikimedia.org/T199781) [11:33:24] (03PS1) 10Zoranzoki21: Fix I2b18f0ef8b095b6e3a8072a42bb37df663f5ab9d [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446291 [11:34:23] kart_: 446212 is merged, will ping you in a few minutes when it's at mwdebug1002 [11:34:30] (03PS2) 10Zoranzoki21: Fix I2b18f0ef8b095b6e3a8072a42bb37df663f5ab9d [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446291 [11:34:36] zeljkof: Ok! [11:34:52] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloud vps: reimage/rename labcontrol1004 to cloudcontrol1004 Use the new naming scheme. Bug: T199781 Signed-off-by: Arturo Borrero [dns] - 10https://gerrit.wikimedia.org/r/446290 (https://phabricator.wikimedia.org/T199781) (owner: 10Arturo Borrero Gonzalez) [11:34:56] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloud vps: reimage/rename labcontrol1004 to cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/446289 (https://phabricator.wikimedia.org/T199781) (owner: 10Arturo Borrero Gonzalez) [11:37:10] kart_: 446212 is at mwdebug1002 [11:38:00] kart_: sorry, it's not yet, will be there in a minute, my mistake [11:38:46] ah [11:38:53] I was wondering - why it is not working. [11:43:32] kart_: 446212 is at mwdebug1002, this time for real :) [11:43:52] zeljkof: testing.. [11:44:44] zeljkof: OK. Working fine. Go ahead. [11:44:51] kart_: ok, deploying [11:46:00] !log zfilipin@deploy1001 Synchronized php-1.32.0-wmf.12/extensions/ContentTranslation: SWAT: [[gerrit:446212|Fix colors on CX1 (T199503)]] (duration: 00m 55s) [11:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:04] T199503: CX: Adjust background color in the translation editor view - https://phabricator.wikimedia.org/T199503 [11:46:34] kart_: deployed, please check and thanks for deploying with #releng ;) [11:47:12] cool. Thanks zeljkof [11:49:41] (03PS3) 10Zoranzoki21: Fix problem with namespaces at patch 446005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446291 (https://phabricator.wikimedia.org/T198585) [11:50:10] zeljkof: Production is fine. [11:51:38] kart_: \o/ [11:52:40] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446291 (https://phabricator.wikimedia.org/T198585) (owner: 10Zoranzoki21) [11:54:17] (03Merged) 10jenkins-bot: Fix problem with namespaces at patch 446005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446291 (https://phabricator.wikimedia.org/T198585) (owner: 10Zoranzoki21) [11:55:27] Zoranzoki21: 446291 is at mwdebug1002 [11:55:37] zeljkof: testing [11:57:33] looks good.. lgtd [11:57:45] ok, deplying [11:57:48] (03CR) 10jenkins-bot: Fix problem with namespaces at patch 446005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446291 (https://phabricator.wikimedia.org/T198585) (owner: 10Zoranzoki21) [11:58:51] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:446005|Create Thesaurus NS at thwikt (T198585)]] [[gerrit:446291|Fix problem with namespaces at patch 446005 (T198585)]] (duration: 00m 51s) [11:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:56] T198585: Request for new namespace(s) at thwikt - https://phabricator.wikimedia.org/T198585 [11:59:08] Zoranzoki21: deployed, running script [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T1200) [12:02:32] (03CR) 10Zfilipin: "Script output at T198585#4430224" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446005 (https://phabricator.wikimedia.org/T198585) (owner: 10Zoranzoki21) [12:03:18] Zoranzoki21: scripts executed, please check and thanks for deploying with #releng :D [12:03:25] !log EU SWAT finished [12:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:27] everything is ok. Thanks zeljkof! :) [12:05:14] Zoranzoki21: \o/ [12:22:12] (03PS1) 10Zfilipin: Group0 to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446304 [12:23:03] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10BBlack) It's pretty unclear to me (perhaps I'm failing at reading!) exactly what needs to happen here at v... [12:25:07] !log zfilipin@deploy1001 Started scap: testwiki to php-1.32.0-wmf.13 and rebuild l10n cache [12:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:10] (03PS1) 10BBlack: wikimediafoundation.org: add CAA authorizing LE [dns] - 10https://gerrit.wikimedia.org/r/446306 (https://phabricator.wikimedia.org/T198922) [12:35:24] zeljkof: let me know when could be a good time to repool mw1224, I see EU SWAT has completed but there are other deployments :) [12:35:37] (03PS1) 10BBlack: Add foundation.wikimedia.org hostname [dns] - 10https://gerrit.wikimedia.org/r/446307 (https://phabricator.wikimedia.org/T188776) [12:35:59] s/mw1224/mw2224/ [12:36:00] volans: as far as I am concerned, now is a good time, swat is done, train starts in 30 minutes [12:36:18] I saw Started scap: testwiki to php-1.32.0-wmf.13 just above [12:36:35] so I was wondering if there was something else going on between swat and train [12:36:38] volans: ah, good point, forgot about that [12:37:08] volans: this is happening https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Sync_to_cluster_and_verify_on_testwiki [12:37:24] it takes an hour, so probably will not finish until train window [12:37:47] I'm running this `scap sync "testwiki to php-VERSION and rebuild l10n cache"` [12:38:05] (03PS1) 10Filippo Giunchedi: admin: print host/path on titlebar (filippo) [puppet] - 10https://gerrit.wikimedia.org/r/446308 [12:38:05] ack, then I can repool it later on, as long as it didn't make scap complain during swat ;) [12:38:17] no, all was good during swat [12:38:34] ok, thanks, then after the train it is [12:42:08] (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/446275 (owner: 10Volans) [12:48:29] (03PS2) 10BBlack: wikimediafoundation.org: add CAA authorizing LE [dns] - 10https://gerrit.wikimedia.org/r/446306 (https://phabricator.wikimedia.org/T198922) [12:49:28] (03PS3) 10BBlack: wikimediafoundation.org: add CAA authorizing LE [dns] - 10https://gerrit.wikimedia.org/r/446306 (https://phabricator.wikimedia.org/T198922) [12:51:10] 10Operations, 10Traffic, 10Patch-For-Review: Setup wikimediafoundation.org domain for July 30 launch of new site - https://phabricator.wikimedia.org/T198922 (10BBlack) The key thing we're missing here on our end, by the actual transition date, is an IP address from Automattic to put in our DNS for this domain. [12:56:44] (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/443371 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [12:57:33] (03CR) 10Volans: "Compiler result available here:" [puppet] - 10https://gerrit.wikimedia.org/r/443371 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [13:00:04] zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - European version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T1300). [13:01:14] (03PS9) 10Volans: debmonitor: set up maintenance [puppet] - 10https://gerrit.wikimedia.org/r/443371 (https://phabricator.wikimedia.org/T198526) [13:05:02] (03CR) 10Volans: [C: 032] debmonitor: set up maintenance [puppet] - 10https://gerrit.wikimedia.org/r/443371 (https://phabricator.wikimedia.org/T198526) (owner: 10Volans) [13:05:31] (03PS2) 10Volans: debmonitor: set proxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/446275 [13:06:23] !log Drop unused grants on db1073, db1117:3323, db1117:3325 [13:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:48] (03CR) 10Volans: [C: 032] debmonitor: set proxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/446275 (owner: 10Volans) [13:07:10] jouncebot: o/ [13:10:31] PROBLEM - debmonitor.wikimedia.org on debmonitor2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 500 Internal Server Error [13:10:39] I know I know... fixing [13:10:52] it\s the passive one, the active is ok [13:13:26] (03PS1) 10Volans: debmonitor: fix permissions for config file [puppet] - 10https://gerrit.wikimedia.org/r/446311 [13:14:46] (03CR) 10Volans: [C: 032] debmonitor: fix permissions for config file [puppet] - 10https://gerrit.wikimedia.org/r/446311 (owner: 10Volans) [13:17:01] RECOVERY - debmonitor.wikimedia.org on debmonitor2001 is OK: HTTP OK: Status line output matched HTTP/1.1 301 - 274 bytes in 0.074 second response time [13:17:11] PROBLEM - Disk space on mwdebug2002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=73%) [13:18:20] zeljkof: is it you? ^^^ [13:18:23] volans, herron: just got this from scap: `13:15:59 pull failed: Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/cache/l10n/*.cdb', '--exclude=*.swp', '--no-perms', '--exclude=**/.git', 'mw2255.codfw.wmnet::common', '/srv/mediawiki']' returned non-zero exit status 11` [13:19:03] volans: uh oh, maybe, still figuring out what to delete [13:19:10] zeljkof: /srv/mediawiki is 28GB [13:19:35] it is complaining about / [13:19:47] zeljkof: Looks like there's some l10n caches that should've probably already been nuked [13:19:56] jynus: all same partition [13:19:59] 4.0 GiB [##########] /php-1.32.0-wmf.12 [13:19:59] 4.0 GiB [######### ] /php-1.32.0-wmf.10 [13:19:59] 4.0 GiB [######### ] /php-1.32.0-wmf.999 [13:20:00] 4.0 GiB [######### ] /php-1.32.0-wmf.8 [13:20:01] 4.0 GiB [######### ] /php-1.32.0-wmf.7 [13:20:03] 4.0 GiB [######### ] /php-1.32.0-wmf.6 [13:20:05] 637.7 MiB [# ] /php-1.32.0-wmf.13 [13:20:09] uh oh, ok, running scap to delete stuff [13:20:16] 6, 7 and 8 should be good to go at least [13:20:23] ok, volans, I thought it was separated [13:20:33] I wished it too :) [13:21:51] zeljkof: regarding mw2255 what's the logged error? exit 11 from rsync is a generic I/O [13:21:57] mwdebug2001 and 2002 are full [13:22:04] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=appserver&var-instance=All [13:22:11] Reedy: ok, deleting 6-8 [13:22:58] zeljkof: Just the l10n caches, right? [13:23:05] and/or the non static asset stuff? [13:23:34] Reedy: well, I was planning on running this `scap clean --delete VERSION` [13:23:46] as documented here https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Clean_up_old_stuff [13:24:10] Ok.. [13:24:11] So [13:24:26] You can scap clean (no delete) .10 too [13:24:41] Just check the 30 day thing before running with --delete [13:24:45] ok --delete 6-8, just clean for 10 [13:25:06] Reedy: what do you mean by "check 30 day"? [13:25:16] For all branches more than 30 days old, drop everything. MediaWiki_1.32/Roadmap is a good place to find when a branch was created. [13:25:16] Find old branches: [13:25:16] you@deploy1001:/srv/mediawiki-staging/$ find . -maxdepth 1 -type d -name 'php-1.32.0-wmf.*' -print [13:25:17] to make sure it's older than 30 days? [13:25:24] Yeah [13:25:30] You can definitely scap clean 6-8 [13:25:40] !log pruned obsolete hosts from debmonitor [13:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:43] I'm fuzzy on when they were deployed, so hence checking before running --delete [13:25:44] ok, starting with 6-8 [13:25:59] that's why I added link to https://www.mediawiki.org/wiki/MediaWiki_1.32/Roadmap [13:26:05] not sure if there is a better place to look [13:27:25] !log zfilipin@deploy1001 Finished scap: testwiki to php-1.32.0-wmf.13 and rebuild l10n cache (duration: 62m 17s) [13:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:47] ok, so this does not work for me :/ [13:27:52] scap clean --delete 1.32.0-wmf.6 [13:28:03] 13:27:17 clean failed: Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "zfilipin"; reason is "testwiki to php-1.32.0-wmf.13 and rebuild l10n cache" [13:28:13] Because you were still scapping? [13:28:18] Try again now the other job has finished [13:28:26] hm, I thought it failed... [13:28:28] let me check [13:28:44] oh, just finished, trying again [13:30:04] volans: just saw this [13:30:22] 13:27:23 sudo -u mwdeploy -n -- /usr/bin/rsync -l deploy1001.eqiad.wmnet::common/wikiversions*.{json,php} /srv/mediawiki on mwdebug2002.codfw.wmnet returned [11]: rsync: write failed on "/srv/mediawiki/wikiversions.json": No space left on device (28) [13:30:24] rsync error: error in file IO (code 11) at receiver.c(393) [receiver=3.1.2] [13:30:43] that's expected, mwdebug2002 had no space left [13:31:28] It should be ok when the scap clean stuff runs [13:31:53] deleting .6 [13:32:01] will delete 6-8 [13:32:51] zeljkof: For future, probably do this before you start pushing a new version out :) [13:33:06] Reedy: that's what the docs say too :/ [13:33:41] my second train, last time thcipriani|afk did the cleanup so I did not take it seriously enough this time, my mistake :( [13:34:07] Well, hopefully lesson learned [13:34:11] And no major damage done :) [13:34:16] hopefully :D [13:34:37] and I _am_ glad to hear I did not break anything serious :| [13:35:13] !log zfilipin@deploy1001 Pruned MediaWiki: 1.32.0-wmf.6 (duration: 06m 18s) [13:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:30] RECOVERY - Disk space on mwdebug2002 is OK: DISK OK [13:35:31] ok, .6 cleanup done, but I am still getting some errors... [13:35:43] 13:35:09 pull failed: Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/cache/l10n/*.cdb', '--exclude=*.swp', '--no-perms', '--exclude=**/.git', 'mw2255.codfw.wmnet::common', '/srv/mediawiki']' returned non-zero exit status 11 [13:36:07] some more details zeljkof ? [13:36:12] like output :D [13:36:22] volans: I'll create phab paste [13:36:23] It's not that host [13:36:24] /dev/md1 944G 38G 859G 5% / [13:36:35] I guess it's something pulling from that host as a scap proxy [13:36:47] zeljkof: thans [13:36:50] *thanks [13:37:29] volans: https://phabricator.wikimedia.org/P7369 [13:38:26] zeljkof: see lines 260-262 [13:38:35] it was still mwdebug2002 [13:39:01] Looks alright now [13:39:14] should I re-run the delete command for .6? [13:39:19] or just run it for .7? [13:39:46] it can happen that a server gets too stuffed up for rsync to work [13:40:01] so, `scap clean --delete 1.32.0-wmf.6` or `scap clean --delete 1.32.0-wmf.7`? [13:40:03] something else? [13:40:14] Try 6 again... At worse it'll go lolno [13:40:16] thcipriani|afk: argh, probably my fauld [13:40:20] fault [13:40:20] in which case you can manually remove what you removed on the deployment server on that host [13:40:34] Reedy: :D [13:40:53] thcipriani|afk: um, I don't think I did anything there? [13:40:59] as you can see on mwdebug2002 /srv/mediawiki/php-1.32.0-wmf.6 is still there [13:41:02] I mean, at mwdebug2002 [13:41:31] thcipriani|afk: so, `scap clean --delete 1.32.0-wmf.6` again? [13:41:32] yeah, rsync needs to create tmp files to sync, if there's no space it just bails [13:41:52] zeljkof: probably login to mwdebug2002 and rm -rf /srv/mediawiki/php-1.32.0-wmf.6 [13:41:53] or ssh to the machine and delete stuff there? [13:42:02] thcipriani|afk: ok, let me try that [13:42:05] * zeljkof sweats [13:42:09] and then the clean for wmf.7 should work from deploy1001 [13:42:13] .6 is only 635MB on that host, not 4GB [13:42:50] oh right, my scap clean change last week [13:43:09] goes through and removes the l10n cache after attempting rsync [13:43:26] haha [13:43:27] for the version specified [13:43:44] thcipriani|afk: still ssh to 2002? or something else? [13:44:55] zeljkof: probably a scap pull on mwdebug2002 should be all you need [13:45:02] thcipriani|afk: ok, I cant rm stuff from mwdebug2002, I'm getting Permission denied [13:45:13] scap clean will bail if it can't find the version you've specified on disk [13:45:13] ok, doing scap pull [13:51:53] 10Operations, 10Wikimedia-Logstash, 10Goal: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) An easy target for producers that are not in logstash but should is central syslog and the programs that log t... [13:53:53] thcipriani|afk: `zfilipin@mwdebug2002:~$ scap pull` is taking forever... [13:54:00] am I doing something wrong? [13:54:13] or is it supposed to be that slow? [13:54:26] !log rebooting mw1261 with SSBD-enabled microcode [13:54:26] It'll probably take a while [13:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:39] .6 has been deleted at least [13:55:09] zeljkof: since sync failed initially, cdb rebuild didn't happen for the new version, so that's probably happening now [13:55:18] is my guess [13:55:44] ok, just checking [13:55:51] and thanks [13:57:14] thcipriani|afk, Reedy: ok, scap pull is done [13:57:25] so should I re-run delete .6, just in case? [13:57:30] or should I go to .7? [13:57:53] Shouldn't need to re-run 6... If you cleaned up the only host that was erroring [13:58:15] Reedy: ok, thanks, then running `scap clean --delete 1.32.0-wmf.7` [13:58:16] +1 cleanup of wmf.7 [13:58:22] Yeah [13:58:46] thcipriani|afk: Any idea if .999 is still needed? [13:59:33] Cause it should at least have a clean (no delete) [13:59:51] Reedy: trying to remember what week that was cut. If it's older than a month it can be deleted. If not, yeah, clean would be fine. [14:00:12] drwxr-xr-x 15 mwdeploy mwdeploy 4096 Jun 14 19:43 php-1.32.0-wmf.999 [14:00:17] Doesn't look to have been touched in a month [14:00:41] https://tools.wmflabs.org/sal/log/AWQARORKoDEJc1hAFQmQ [14:00:47] yep, --delete [14:01:28] (03PS1) 10Filippo Giunchedi: wdqs: use syslogidentifier in systemd units [puppet] - 10https://gerrit.wikimedia.org/r/446318 (https://phabricator.wikimedia.org/T198756) [14:01:35] thcipriani, Reedy ok, so 6, 7, 8 and 999 are `scap clean --delete 1.32.0-wmf.VERSION` [14:01:57] (03CR) 10Filippo Giunchedi: [C: 032] admin: print host/path on titlebar (filippo) [puppet] - 10https://gerrit.wikimedia.org/r/446308 (owner: 10Filippo Giunchedi) [14:02:04] (03PS2) 10Filippo Giunchedi: admin: print host/path on titlebar (filippo) [puppet] - 10https://gerrit.wikimedia.org/r/446308 [14:02:09] zeljkof: yep [14:02:15] ok, will do [14:02:59] thcipriani: do I need to re-run `scap sync "testwiki to php-VERSION and rebuild l10n cache"` when done with the cleanup? [14:03:13] It's probably a good idea to ensure consistency [14:03:16] because there were errors the first time [14:03:26] Should be relatively quick to run a second time though [14:03:31] ok, will do as soon as I delete all software :) [14:03:47] or is it destroy all software? [14:03:59] !log zfilipin@deploy1001 Pruned MediaWiki: 1.32.0-wmf.7 (duration: 03m 39s) [14:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:29] /nick zeljkof|murderer [14:04:40] PROBLEM - Host snapshot1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:04:49] apergos: ^ [14:05:00] :D [14:05:06] yes it still is [14:05:08] Reedy: that's probably maintenance by Chris [14:05:15] ah, ok [14:05:19] host is down for quite a while now [14:05:20] didn't see anything in the SAL [14:05:33] moritzm: the system board is being replaced [14:05:37] forgot about mgmt alerts [14:05:38] mute expired? [14:05:42] heh [14:07:37] !log zfilipin@deploy1001 Pruned MediaWiki: 1.32.0-wmf.8 (duration: 03m 11s) [14:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:33] (03PS1) 10Volans: wmf-auto-reimage: remove host from Debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/446323 [14:12:39] !log rebooting wtp1025 with SSBD-enabled microcode [14:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:59] !log zfilipin@deploy1001 Pruned MediaWiki: 1.32.0-wmf.999 (duration: 03m 01s) [14:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:04] (03PS2) 10Filippo Giunchedi: wdqs: use syslogidentifier in systemd units [puppet] - 10https://gerrit.wikimedia.org/r/446318 (https://phabricator.wikimedia.org/T198756) [14:19:06] (03PS1) 10Filippo Giunchedi: dumps/snapshot: use syslogidentifier in systemd units [puppet] - 10https://gerrit.wikimedia.org/r/446324 (https://phabricator.wikimedia.org/T198756) [14:19:08] (03PS1) 10Filippo Giunchedi: mediawiki: use syslogidentifier in systemd units [puppet] - 10https://gerrit.wikimedia.org/r/446325 (https://phabricator.wikimedia.org/T198756) [14:19:17] ok, wmf.6-8 and 999 deleted [14:19:40] (03CR) 10Muehlenhoff: [C: 031] "Seems fine, but one comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/446323 (owner: 10Volans) [14:19:55] !log zfilipin@deploy1001 Started scap: testwiki to php-1.32.0-wmf.13 and rebuild l10n cache [14:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:13] (03CR) 10Volans: "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/446323 (owner: 10Volans) [14:24:11] (03CR) 10Muehlenhoff: [C: 031] wmf-auto-reimage: remove host from Debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/446323 (owner: 10Volans) [14:26:09] !log zfilipin@deploy1001 Finished scap: testwiki to php-1.32.0-wmf.13 and rebuild l10n cache (duration: 06m 14s) [14:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:26] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 3 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) Launch date for new site is 30 July 2018, so changes to old site's URL should happen at same time... [14:27:00] no problems this time with `scap sync "testwiki to php-1.32.0-wmf.13 and rebuild l10n cache"` [14:27:08] \o/ [14:28:20] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2061 - https://phabricator.wikimedia.org/T199759 (10Papaul) a:05Papaul>03Marostegui disk replaced [14:28:54] zeljkof: let me know when all is finished and might be a good time for the repool of mw2224 ;) [14:29:12] volans: will do [14:30:18] thx [14:30:38] (03CR) 10Ayounsi: [C: 031] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/446235 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey) [14:31:25] thcipriani: so, about https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Clean_up_old_stuff [14:32:10] I've created .13, existing branches are .10, .12 and .13 [14:32:15] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 3 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10BBlack) I don't think redirects are really an acceptable solution for the policy pages, as we'd still be se... [14:32:29] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Papaul) @RobH we can not do disks wipe on this system. The system can't boot and we do not have any identical server not in use to put the disk in and do the... [14:32:55] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10awight) @daniel Surprisingly, there is interest in this going through TechCom after all. I've been digesting... [14:34:08] thcipriani: active branches are .12 and .13, and the prior one is .10, so I don't need to run `scap clean VERSION` for anything else, right? [14:37:31] RECOVERY - Host snapshot1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.25 ms [14:38:37] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) a:05Papaul>03MoritzMuehlenhoff @MoritzMuehlenhoff assigning you this task if you have time to look at it while I am gone for vacation. Thanks [14:40:06] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 3 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Reedy) Is it definitely going to foundation.wikimedia.org? If so, I can make the patches for below >>! In... [14:41:11] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 3 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) Yes - the links should be updated. We were asked to setup the redirects to handle old links and an... [14:43:21] (03CR) 10Zfilipin: [C: 032] Group0 to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446304 (owner: 10Zfilipin) [14:43:30] RECOVERY - Host snapshot1005 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:44:18] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 3 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Reedy) Oh, and someone needs to update and deploy the portal links too. Probably need to ping discovery for... [14:44:31] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 3 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) To be clear, foundation.wikimedia.org was the URL we were asked several months ago to utilize - it... [14:44:54] (03Merged) 10jenkins-bot: Group0 to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446304 (owner: 10Zfilipin) [14:47:20] PROBLEM - mediawiki-installation DSH group on mw2224 is CRITICAL: Host mw2224 is not in mediawiki-installation dsh group [14:47:32] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.13 [14:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:17] (03CR) 10jenkins-bot: Group0 to 1.32.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446304 (owner: 10Zfilipin) [14:48:40] that's me, downtime expired [14:49:39] expanded [14:50:41] !log rebooting ms-fe1005 with SSBD-enabled microcode [14:50:43] volans: ok, I'm done [14:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:04] zeljkof: great, thanks, running scap pull and repooling [14:51:07] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 3 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10BBlack) So, trying to sum a bit of process on the above as I see it: 1. I need to merge https://gerrit.wik... [14:55:36] (03CR) 10BBlack: [C: 032] Add foundation.wikimedia.org hostname [dns] - 10https://gerrit.wikimedia.org/r/446307 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [14:56:58] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10RobH) We don't need an identical system, just any system we can install the disks into. I advise using a spare system to do this. Make sense? [14:57:35] (03PS1) 10Reedy: Move foundationwiki domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446333 (https://phabricator.wikimedia.org/T199808) [14:58:15] (03PS2) 10Reedy: Move foundationwiki domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446333 (https://phabricator.wikimedia.org/T199808) [14:58:52] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Marostegui) Or servers that we already decommissioned? [14:58:56] !log repooled mw2224 [14:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:46] (03CR) 10jerkins-bot: [V: 04-1] Move foundationwiki domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446333 (https://phabricator.wikimedia.org/T199808) (owner: 10Reedy) [15:02:20] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Papaul) We have no spare system that can take 12 disks I will just use one of the Dell decommissioned server. [15:04:20] (03CR) 10ArielGlenn: [C: 031] "Sure! Thanks for catching this." [puppet] - 10https://gerrit.wikimedia.org/r/446324 (https://phabricator.wikimedia.org/T198756) (owner: 10Filippo Giunchedi) [15:04:37] PROBLEM - mediawiki-installation DSH group on snapshot1005 is CRITICAL: Host snapshot1005 is not in mediawiki-installation dsh group [15:04:50] (03PS1) 10BBlack: cache_text: redirect old foundation wiki URLs [puppet] - 10https://gerrit.wikimedia.org/r/446334 (https://phabricator.wikimedia.org/T188776) [15:06:25] (03PS3) 10Reedy: Move foundationwiki domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446333 (https://phabricator.wikimedia.org/T199808) [15:08:00] (03CR) 10jerkins-bot: [V: 04-1] Move foundationwiki domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446333 (https://phabricator.wikimedia.org/T199808) (owner: 10Reedy) [15:08:46] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): Update core routers routing for labtest Cloud VPS deployment - https://phabricator.wikimedia.org/T199779 (10ayounsi) 1/ Other then the typos, it looks good to me. 10.192.16.0/22 -> 10.19**6**.16.0/2**1** 2/ From the Cloud team, only Chas... [15:09:10] (03CR) 10Reedy: cache_text: redirect old foundation wiki URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/446334 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [15:09:45] (03PS4) 10Reedy: Move foundationwiki domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446333 (https://phabricator.wikimedia.org/T188776) [15:09:58] PROBLEM - puppet last run on scb2005 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 4 minutes ago with 10 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[cpjobqueue/deploy],Exec[chown /srv/deployment/cpjobqueue for deploy-service],Package[recommendation-api/deploy] [15:11:21] (03CR) 10jerkins-bot: [V: 04-1] Move foundationwiki domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446333 (https://phabricator.wikimedia.org/T188776) (owner: 10Reedy) [15:12:06] (03PS1) 10Volans: Fix socket message for Python3 [software/conftool] - 10https://gerrit.wikimedia.org/r/446336 [15:12:13] _joe_: ^^^ [15:13:50] (03PS1) 10BBlack: apache: foundationwiki hostname transition [puppet] - 10https://gerrit.wikimedia.org/r/446337 [15:14:20] (03PS2) 10BBlack: apache: foundationwiki hostname transition [puppet] - 10https://gerrit.wikimedia.org/r/446337 (https://phabricator.wikimedia.org/T188776) [15:15:06] (03PS1) 10Reedy: Update redirects to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/446338 (https://phabricator.wikimedia.org/T188776) [15:18:09] (03PS1) 10BBlack: Add foundation.m.wikimedia.org hostname [dns] - 10https://gerrit.wikimedia.org/r/446340 (https://phabricator.wikimedia.org/T188776) [15:18:28] (03CR) 10BBlack: [C: 032] Add foundation.m.wikimedia.org hostname [dns] - 10https://gerrit.wikimedia.org/r/446340 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [15:18:37] !log volans@sarin conftool action : set/pooled=yes; selector: name=mw2224.codfw.wmnet [15:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:11] _joe_: quick temporary live hack to test that it works ;) ^^^ [15:20:38] (03CR) 10Filippo Giunchedi: [C: 032] dumps/snapshot: use syslogidentifier in systemd units [puppet] - 10https://gerrit.wikimedia.org/r/446324 (https://phabricator.wikimedia.org/T198756) (owner: 10Filippo Giunchedi) [15:21:00] (03PS5) 10Reedy: Move foundationwiki domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446333 (https://phabricator.wikimedia.org/T188776) [15:21:16] (03PS2) 10Filippo Giunchedi: dumps/snapshot: use syslogidentifier in systemd units [puppet] - 10https://gerrit.wikimedia.org/r/446324 (https://phabricator.wikimedia.org/T198756) [15:21:27] 10Operations, 10Traffic, 10Patch-For-Review: Setup wikimediafoundation.org domain for July 30 launch of new site - https://phabricator.wikimedia.org/T198922 (10herron) p:05Triage>03Normal [15:22:53] (03PS1) 10Volans: Release 1.0.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/446343 [15:24:48] RECOVERY - Device not healthy -SMART- on db2061 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2061&var-datasource=codfw%2520prometheus%252Fops [15:25:32] (03PS3) 10BBlack: apache: foundationwiki hostname transition [puppet] - 10https://gerrit.wikimedia.org/r/446337 (https://phabricator.wikimedia.org/T188776) [15:28:08] 10Operations, 10ops-codfw, 10Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339 (10Papaul) a:05Papaul>03RobH @RobH here is the audit you requested. |Racks|PDU|X|Y|Y| |A1|PS1|4.9|2.8|4.8| |A1|PS2|4.3|2.4|4.7| |A2|PS1|6.5|1.4|6.5... [15:28:35] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): Update core routers routing for labtest Cloud VPS deployment - https://phabricator.wikimedia.org/T199779 (10ayounsi) 05Open>03Resolved After a chat on IRC, change pushed to cr1/2-codfw: ```lang=diff [edit routing-options static route... [15:28:37] 10Operations, 10Core-Platform-Team, 10monitoring, 10Wikimedia-Incident: Add alerts for Logstash rates in production - https://phabricator.wikimedia.org/T199479 (10herron) p:05Triage>03Normal [15:31:28] (03PS2) 10Volans: wmf-auto-reimage: remove host from Debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/446323 [15:31:30] (03PS1) 10Volans: wmf-auto-reimage: use system URLs for Apache test [puppet] - 10https://gerrit.wikimedia.org/r/446346 [15:31:47] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /osm-intl/info.json (tile service info for osm-intl) timed out before a response was received [15:32:41] 10Operations, 10Wikimedia-Mailing-lists: Request for languageconverter@lists.wikimedia.org - https://phabricator.wikimedia.org/T198887 (10herron) a:03herron [15:32:48] RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [15:32:48] gehel, bblack ^^^ [15:32:56] ah, seems transiet [15:33:00] *transient? [15:33:34] strange [15:35:09] (03CR) 10Volans: [C: 032] wmf-auto-reimage: remove host from Debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/446323 (owner: 10Volans) [15:35:28] RECOVERY - puppet last run on scb2005 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:35:33] (03PS2) 10BBlack: cache_text: redirect old foundation wiki URLs [puppet] - 10https://gerrit.wikimedia.org/r/446334 (https://phabricator.wikimedia.org/T188776) [15:35:39] 10Operations, 10ops-codfw, 10Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339 (10RobH) @papaul: Anywhere that the phases are more than 4% out of balance from the lowest phase, they need to be rebalanced. This is something that you... [15:35:54] jouncebot: now [15:35:54] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [15:35:56] jouncebot: next [15:35:57] In 0 hour(s) and 24 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T1600) [15:40:27] !log rebooting ms-fe1006 with SSBD-enabled microcode [15:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:37] !log switching default analytics-in6 term to reject+log - T198623 [15:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:41] T198623: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 [15:44:47] PROBLEM - apertium apy on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:57] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve most-read articles for date with no data (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/content-html/{title}{/revision}{ [15:44:57] tent HTML for test page) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received [15:44:57] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) timed out before a response was received [15:44:58] PROBLEM - dhclient process on scb2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:45:07] PROBLEM - pdfrender on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:08] PROBLEM - eventstreams on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:30] (03CR) 10Reedy: [C: 031] apache: foundationwiki hostname transition [puppet] - 10https://gerrit.wikimedia.org/r/446337 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [15:45:48] RECOVERY - apertium apy on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.074 second response time [15:45:57] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [15:45:58] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [15:45:59] RECOVERY - dhclient process on scb2003 is OK: PROCS OK: 0 processes with command name dhclient [15:46:08] RECOVERY - pdfrender on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.074 second response time [15:46:12] there we go again [15:46:12] 10Operations, 10Wikimedia-Mailing-lists: Request for languageconverter@lists.wikimedia.org - https://phabricator.wikimedia.org/T198887 (10herron) 05Open>03Resolved Hello, this list has been created with open subscription and public archive enabled. The web admin interface is located at https://lists.wikim... [15:46:17] RECOVERY - eventstreams on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.103 second response time [15:46:19] this is "known" ^ [15:47:27] RECOVERY - mediawiki-installation DSH group on mw2224 is OK: OK [15:49:49] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.12/includes/libs/objectcache/BagOStuff.php: 2d919adcbf9120098ff81a49c94dea5d7127fe64 (duration: 00m 56s) [15:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:15] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.12/tests/phpunit/includes/libs/objectcache/BagOStuffTest.php: (no justification provided) (duration: 00m 54s) [15:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:13] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) Note that before switching to a default reject+log, I added terms to permit traffic to text-lb, misc-lb, and lists on port 443... [15:57:58] 10Operations, 10ops-codfw, 10Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339 (10Papaul) @RobH Better to do this when I am back from vacation. Thanks. [16:00:05] godog, moritzm, and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:01:01] 10Operations, 10ops-codfw, 10Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339 (10RobH) >>! In T163339#4430979, @Papaul wrote: > @RobH > Better to do this when I am back from vacation. Thanks. Absolutely, no reason to go mucking... [16:01:27] 10Operations, 10ops-codfw, 10Patch-For-Review: codfw pdu phase inbalances: audit and correct - https://phabricator.wikimedia.org/T163339 (10RobH) [16:07:53] 10Operations, 10Analytics, 10EventBus, 10SCB, 10Services (blocked): EventStreams accumulates too much memory on SCB nodes in CODFW - https://phabricator.wikimedia.org/T199813 (10mobrovac) p:05Triage>03Unbreak! [16:14:29] (03CR) 10Nuria: [C: 031] turnilo: fix indentation of config file [puppet] - 10https://gerrit.wikimedia.org/r/446238 (owner: 10Elukey) [16:16:57] 10Operations, 10SCB, 10Services (watching): Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199 (10mobrovac) It has happened yet again today on `scb2003`. So far it looks like EventStreams is swallowing memory, cf. {T199813}. [16:19:37] RECOVERY - HP RAID on db2061 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [16:24:28] 10Operations, 10Analytics, 10EventBus, 10Wikimedia-Stream, and 2 others: EventStreams accumulates too much memory on SCB nodes in CODFW - https://phabricator.wikimedia.org/T199813 (10mobrovac) [16:26:47] 10Operations: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10faidon) p:05Triage>03Normal [16:27:13] 10Operations, 10monitoring, 10Privacy, 10Security-Core: status.wikimedia.org should not load Google Analytics - https://phabricator.wikimedia.org/T115945 (10faidon) I filed T199816 for removing that page, we can follow up on that and if implemented, resolve this task and its parent. [16:28:00] (03PS1) 10Faidon Liambotis: Remove status.wikimedia.org monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/446358 (https://phabricator.wikimedia.org/T199816) [16:28:22] (03PS1) 10Faidon Liambotis: Remove status.wikimedia.org A/AAAA [dns] - 10https://gerrit.wikimedia.org/r/446359 (https://phabricator.wikimedia.org/T199816) [16:28:53] 10Operations, 10monitoring, 10Patch-For-Review: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10faidon) [16:32:05] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2061 - https://phabricator.wikimedia.org/T199759 (10Marostegui) 05Open>03Resolved All good - thank you! ``` root@db2061:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337F3720) Port Name: 1I Port... [16:38:16] jouncebot: next [16:38:17] In 0 hour(s) and 21 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T1700) [16:46:14] (03PS1) 10Reedy: Make foundationwiki readonly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446365 [16:48:09] (03CR) 10BBlack: [C: 031] Make foundationwiki readonly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446365 (owner: 10Reedy) [16:48:45] (03PS1) 10Ladsgroup: labs: Add change_tag_def to labs replicas [puppet] - 10https://gerrit.wikimedia.org/r/446366 (https://phabricator.wikimedia.org/T185355) [16:49:07] (03PS1) 10Aklapper: phabricator: Print IDs of projects of tasks assigned to disabled accounts [puppet] - 10https://gerrit.wikimedia.org/r/446367 [16:49:24] !log installing faad2 security updates [16:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:53] (03CR) 10Reedy: [C: 032] Make foundationwiki readonly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446365 (owner: 10Reedy) [16:50:49] moritzm: _joe_ godog: can this be merged as part of puppet SWAT? https://gerrit.wikimedia.org/r/446366 [16:50:56] That would be great, it's very simple [16:51:39] (03Merged) 10jenkins-bot: Make foundationwiki readonly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446365 (owner: 10Reedy) [16:51:55] (03CR) 10jenkins-bot: Make foundationwiki readonly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446365 (owner: 10Reedy) [16:51:58] Amir1: I can merge that [16:52:09] That'd be great [16:52:12] but it will require someone from cloud to run it [16:52:20] so not sure if really useful [16:52:45] bstorm_ maybe? [16:52:51] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Make foundationwiki readonly (duration: 00m 54s) [16:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:53] jynus: I don't who can help tbh [16:53:29] so what I mean is that I don't have any problem with merging it, but as far as I know, new tables will not appear automatically [16:53:40] so let me know if you still want me to do it [16:54:30] I can run it after it's merged [16:54:42] bstorm_: Great, thank you! [16:54:50] (03CR) 10Jcrespo: [C: 032] labs: Add change_tag_def to labs replicas [puppet] - 10https://gerrit.wikimedia.org/r/446366 (https://phabricator.wikimedia.org/T185355) (owner: 10Ladsgroup) [16:54:50] jynus: ^ [16:54:58] You're fast :) [16:55:29] done [16:56:06] bstorm_: also, if possible please run it on change_tag table as well, the new column has not been replicated last time I checked [16:56:53] you may need a local puppet run, whereve that runs (I don't know) [16:57:01] I'll check in a bit. In a meeting now. [16:57:11] So it'll be a few. [16:57:20] But only a few minutes :) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: That opportune time is upon us again. Time for a Services – Graphoid / Parsoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T1700). [17:00:26] no parsoid deploy [17:00:37] !log disabling puppet on most of mw* for foundationwiki apache config deploy - T188776 [17:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:42] T188776: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 [17:02:09] jynus: hi! Were those helpful, extra notes at the end of the minutes yours? If so, I've responded... [17:02:38] (03CR) 10BBlack: [C: 032] apache: foundationwiki hostname transition [puppet] - 10https://gerrit.wikimedia.org/r/446337 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [17:02:38] (03PS4) 10BBlack: apache: foundationwiki hostname transition [puppet] - 10https://gerrit.wikimedia.org/r/446337 (https://phabricator.wikimedia.org/T188776) [17:03:04] Amir1: ok, looking into that run on the replicas now [17:04:10] thanks! [17:05:41] (03PS6) 10Reedy: Move foundationwiki domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446333 (https://phabricator.wikimedia.org/T188776) [17:05:43] (03CR) 10Reedy: [C: 032] Move foundationwiki domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446333 (https://phabricator.wikimedia.org/T188776) (owner: 10Reedy) [17:06:59] (03Merged) 10jenkins-bot: Move foundationwiki domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446333 (https://phabricator.wikimedia.org/T188776) (owner: 10Reedy) [17:08:17] (03CR) 10jenkins-bot: Move foundationwiki domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446333 (https://phabricator.wikimedia.org/T188776) (owner: 10Reedy) [17:12:08] Amir1: change_tag seems to be there, at least on the DB I just checked. I'll get the new one out. [17:12:51] bstorm_: the table is there but the new column added (ct_tag_id) is not replicated [17:13:07] Ahh [17:13:36] I can refresh the view [17:14:09] bstorm_: the schema change for that is done for all except master of s1 (due to failover tomorrow) so you need to do later for enwiki as well [17:14:21] the new table is everywhere though [17:14:26] ok [17:15:11] If you can make a task for that last bit, I won't forget and can coordinate it (or link me to the current related task for the schema change, and I can link a task to it) [17:15:15] Thank you and sorry for the trouble [17:15:36] sure [17:15:40] (03PS1) 10BBlack: apache config for foundation wiki new hostname [puppet] - 10https://gerrit.wikimedia.org/r/446373 (https://phabricator.wikimedia.org/T188776) [17:15:56] (03CR) 10Reedy: [C: 031] apache config for foundation wiki new hostname [puppet] - 10https://gerrit.wikimedia.org/r/446373 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [17:16:32] (03CR) 10BBlack: [C: 032] apache config for foundation wiki new hostname [puppet] - 10https://gerrit.wikimedia.org/r/446373 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [17:19:40] bstorm_: https://phabricator.wikimedia.org/T199818 [17:19:46] Thanks [17:19:51] I shamelessly assigned it to you [17:20:08] I see that lol [17:23:09] The new table should be visible on the replicas. [17:24:11] I refreshed change_tag in its current state as well [17:25:35] That looks fantastic [17:25:44] (see it in frwiki) [17:28:23] (03CR) 10Smalyshev: [C: 031] wdqs: use syslogidentifier in systemd units [puppet] - 10https://gerrit.wikimedia.org/r/446318 (https://phabricator.wikimedia.org/T198756) (owner: 10Filippo Giunchedi) [17:37:23] (03PS1) 10Reedy: Move foundationwiki to priority 8 before *.wikimedia.org catch all [puppet] - 10https://gerrit.wikimedia.org/r/446375 (https://phabricator.wikimedia.org/T188776) [17:38:05] (03CR) 10BBlack: [C: 031] Move foundationwiki to priority 8 before *.wikimedia.org catch all [puppet] - 10https://gerrit.wikimedia.org/r/446375 (https://phabricator.wikimedia.org/T188776) (owner: 10Reedy) [17:39:29] (03PS2) 10Reedy: Move foundationwiki to priority 8 before *.wikimedia.org catch all [puppet] - 10https://gerrit.wikimedia.org/r/446375 (https://phabricator.wikimedia.org/T188776) [17:42:49] (03CR) 10BBlack: [C: 031] Move foundationwiki to priority 8 before *.wikimedia.org catch all [puppet] - 10https://gerrit.wikimedia.org/r/446375 (https://phabricator.wikimedia.org/T188776) (owner: 10Reedy) [17:43:18] (03CR) 10BBlack: [C: 032] Move foundationwiki to priority 8 before *.wikimedia.org catch all [puppet] - 10https://gerrit.wikimedia.org/r/446375 (https://phabricator.wikimedia.org/T188776) (owner: 10Reedy) [17:50:41] Hi, a recent patch to ConfirmEdit broke Flow and that hit production with wmf.13 (See T199811) I have a fix for it that I would like to SWAT soon if someone can sanity-check it. https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ConfirmEdit/+/446344/ [17:50:41] T199811: Flow completely stopped working for my user. - https://phabricator.wikimedia.org/T199811 [17:52:44] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: cleanup labcontrol1004 [dns] - 10https://gerrit.wikimedia.org/r/446376 (https://phabricator.wikimedia.org/T199781) [17:54:49] stephanebisson: Just .14? [17:54:50] *.13 [17:55:06] Reedy: yes (thanks for +2) [17:55:43] !log re-enabling puppet on mw* for foundationwiki apache config change T188776 [17:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:47] T188776: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 [17:56:38] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: cleanup labcontrol1004 [dns] - 10https://gerrit.wikimedia.org/r/446376 (https://phabricator.wikimedia.org/T199781) (owner: 10Arturo Borrero Gonzalez) [17:56:48] jouncebot: next [17:56:48] In 1 hour(s) and 3 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T1900) [17:56:52] Reedy: do you have the power to swat it out early or should I sign up for evening swat? [17:57:10] That's what I'm doing as I have the dirty deploy dir :) [17:58:12] what "MediaWiki train - Americas version"? [17:58:20] *what's [17:58:24] paladox: They made two different windows [17:58:30] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Epic: Relabel labcontrol1004.wikimedia.org as cloudcontrol1004.wikimedia.org - https://phabricator.wikimedia.org/T199782 (10aborrero) [17:58:30] ah ok, thanks. [17:58:35] An earlier one if someone in the EU is deploying. A later one for US people [18:04:30] (03PS2) 10BBlack: Update redirects to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/446338 (https://phabricator.wikimedia.org/T188776) (owner: 10Reedy) [18:08:21] 10Operations, 10ops-codfw, 10cloud-services-team, 10netops: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10aborrero) [18:11:06] 10Operations, 10ops-codfw, 10cloud-services-team, 10netops: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10RobH) a:03Papaul @papaul: When you return from vacation, please connect eth1 on both labtestnet200[23] and post the ports on this task. Then we ca... [18:11:38] PROBLEM - puppet last run on scb2005 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 5 minutes ago with 10 failures. Failed resources (up to 3 shown): Exec[absent_ensure_members],Exec[ops_ensure_members],Exec[wikidev_ensure_members],Exec[adm_ensure_members] [18:17:00] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.13/extensions/ConfirmEdit: Make a function public again (duration: 00m 56s) [18:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:07] stephanebisson: done [18:17:51] Reedy: Thanks a lot! I just tested and it's fixed. [18:17:56] Sweet [18:24:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: snapshot1005 does not power back up - https://phabricator.wikimedia.org/T198792 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson The system board has been replaced and the server is accessible now [18:28:27] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:28:36] !log reedy@deploy1001 Synchronized errorpages/: update foundationwiki urls (duration: 00m 54s) [18:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:56] !log reedy@deploy1001 Synchronized docroot/: update foundationwiki urls (duration: 00m 54s) [18:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:21] !log reedy@deploy1001 Synchronized tests/: foundationwiki (duration: 00m 53s) [18:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:43] !log reedy@deploy1001 Synchronized multiversion/MWMultiVersion.php: foundationwiki (duration: 00m 53s) [18:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:38] !log rolling restart eventstreams on scb2* nodes to avoid OOMs during the EU night [18:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:50] !log reedy@deploy1001 Synchronized wmf-config/: foundationwiki (duration: 00m 54s) [18:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:10] (03PS1) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446385 [18:36:12] (03CR) 10Reedy: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446385 (owner: 10Reedy) [18:36:42] !log disabling puppet on mw* for more foundationwiki work - T188776 [18:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:46] T188776: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 [18:37:08] (03CR) 10Jcrespo: [C: 031] mariadb: Promote db1067 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/445354 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [18:37:18] RECOVERY - puppet last run on scb2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:37:33] (03Merged) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446385 (owner: 10Reedy) [18:37:37] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:37:46] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Promote db1067 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445371 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [18:37:50] (03CR) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446385 (owner: 10Reedy) [18:38:33] !log reedy@deploy1001 Synchronized wmf-config/interwiki.php: Updating interwiki cache (duration: 02m 53s) [18:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:00] !log ban elastic1030 from eqiad search cluster for latency issues. Likely related to T156137 [18:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:04] T156137: Reduce impact of GC pauses on elasticsearch response time - https://phabricator.wikimedia.org/T156137 [18:39:34] (03CR) 10BBlack: [C: 032] Update redirects to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/446338 (https://phabricator.wikimedia.org/T188776) (owner: 10Reedy) [18:39:38] (03CR) 10Imarlier: [C: 031] "> @Imarlier: Hm.., rename it how? I haven't changed the name of the" [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [18:41:25] 10Operations, 10Goal: Migrate the hardware inventory from Racktables to Netbox - https://phabricator.wikimedia.org/T199083 (10herron) p:05Triage>03Normal [18:41:40] 10Operations, 10Traffic, 10Goal: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10herron) p:05Triage>03Normal [18:42:17] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:46:48] !log re-enabling puppet on mw* for more foundationwiki work - T188776 [18:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:52] T188776: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 [18:48:02] (03PS3) 10BBlack: cache_text: redirect old foundation wiki URLs [puppet] - 10https://gerrit.wikimedia.org/r/446334 (https://phabricator.wikimedia.org/T188776) [18:48:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0 [18:49:22] (03CR) 10BBlack: [C: 032] cache_text: redirect old foundation wiki URLs [puppet] - 10https://gerrit.wikimedia.org/r/446334 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [18:51:27] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:57:02] strange, that router interface on cr2-eqiad had scheduled maintenance performed on it earlier today. zayo sent the all clear about 8 hours ago [19:00:05] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T1900) [19:04:24] Hauskatze: Poke [19:04:43] Yes sir [19:04:53] How are we supposed to update these portal templates on meta? [19:05:10] They ain't maintained through Meta anymore afaik [19:05:20] I think there's a separate repo somewhere, IIRC [19:05:21] Only the sister projects ones [19:05:25] a git repo I mean [19:05:35] bblack: Yeah, I've done that one [19:05:39] The problem is some of them are still onwiki [19:05:57] Hauskatze: So I'm fine just editing them on meta? [19:05:57] There's the [[Www.wikibooks.org template]] et al. [19:06:07] Yeah, see https://meta.wikimedia.org/wiki/Special:Contributions/Reedy_(WMF) [19:06:10] ah [19:06:15] Reedy: there's some Scribunto to autoupdate them [19:06:23] Which I don't understand [19:06:23] Not sure if it'll work in this case? [19:06:29] It's auto updating from temp? [19:06:48] we don't use temp anymore I think, not for stats updates and the like [19:06:58] I'd ask jan_drewniak tbh [19:07:07] he's taken care of regular updates [19:07:15] and surely knows where to touch [19:11:42] * Reedy just does them all manually [19:14:10] Okay :) [19:15:28] Looks like https://gerrit.wikimedia.org/g/wikimedia/portals/deploy [19:15:31] That a bot builds stuff? [19:16:07] Why don't they make it to github? [19:16:07] https://github.com/wikimedia/portals-deploy [19:16:33] Hauskatze: Two more commits to portals over in -dev [19:16:46] Looks like PortalsBuilder only runs weekly? [19:18:52] Reedy: that's correct, only weekly [19:19:42] best bit is we can make jenkins run now [19:21:38] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: Herron TTN-0002561980 Wikimedia Foundation Inc / /OGYX/120003/ /ZYO / / Hard Down 101 /FIBER /AUSWTXOOHG1/HSTSTXGSHG1 appears to be cut between Austin and San Marcos Repair in Process [19:25:42] https://meta.wikimedia.org/wiki/Project_portals needs updating [19:25:58] "Changes made here are immediately seen by millions of people." [19:26:04] No I don't think they are... [19:27:19] billions? :) [19:27:34] (03PS1) 10BBlack: foundationwiki rename: fixup trivial refs across puppet [puppet] - 10https://gerrit.wikimedia.org/r/446395 (https://phabricator.wikimedia.org/T188776) [19:29:15] (03PS1) 10Reedy: Update portals for foundationwiki move [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446396 (https://phabricator.wikimedia.org/T199808) [19:30:20] bblack: I'm unclear if it's deployed until it's merged into the repo [19:30:32] ala https://gerrit.wikimedia.org/r/#/c/wikimedia/portals/deploy/+/446394/1/wikibooks.org/index.html [19:30:57] (03CR) 10Reedy: [C: 032] Update portals for foundationwiki move [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446396 (https://phabricator.wikimedia.org/T199808) (owner: 10Reedy) [19:31:44] Reedy: no idea [19:31:57] At one point, it did go live automatically [19:32:09] in any case, it should be safe for it to go live automatically now [19:32:30] (03Merged) 10jenkins-bot: Update portals for foundationwiki move [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446396 (https://phabricator.wikimedia.org/T199808) (owner: 10Reedy) [19:34:50] !log reedy@deploy1001 Synchronized portals/wikipedia.org/assets: T199808 (duration: 00m 53s) [19:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:54] T199808: Update portals to cater for foundationwiki move - https://phabricator.wikimedia.org/T199808 [19:35:19] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 3 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Reedy) [19:35:45] !log reedy@deploy1001 Synchronized portals: T199808 (duration: 00m 54s) [19:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:11] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 3 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10BBlack) **Human status update, amidst the flurry of gerritbot notes:** At this point, the foundationwiki i... [19:41:30] (03PS4) 10BBlack: wikimediafoundation.org: add CAA authorizing LE [dns] - 10https://gerrit.wikimedia.org/r/446306 (https://phabricator.wikimedia.org/T198922) [19:42:04] (03CR) 10BBlack: [C: 032] wikimediafoundation.org: add CAA authorizing LE [dns] - 10https://gerrit.wikimedia.org/r/446306 (https://phabricator.wikimedia.org/T198922) (owner: 10BBlack) [19:51:16] jouncebot: now [19:51:17] For the next 1 hour(s) and 8 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T1900) [19:51:19] jouncebot: next [19:51:19] In 3 hour(s) and 8 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T2300) [19:51:22] Cool. Can scap [19:52:27] !log reedy@deploy1001 Started scap: updating l10n cache for foundationwiki url changes [19:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:38] PROBLEM - Disk space on elastic1032 is CRITICAL: DISK CRITICAL - free space: /srv 73308 MB (10% inode=99%) [19:55:10] (03CR) 10ArielGlenn: [C: 031] "The dumps-related changes all look good." [puppet] - 10https://gerrit.wikimedia.org/r/446395 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [19:55:33] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) They kicked back a denied and now I'm on with support to file a new task, no self dispatch. Request 975216005 filed! [19:55:37] hmmmm, test2.wikipedia.org is no longer part of group 0? i see it is running wmf.12 right now [19:55:44] test.wikipedia.org has wmf.13 [19:56:17] apparently not [19:57:10] Then someone needs to update https://www.mediawiki.org/wiki/MediaWiki_1.32/Roadmap [19:59:34] https://github.com/wikimedia/operations-mediawiki-config/commits/master/dblists/group0.dblist [19:59:37] https://github.com/wikimedia/operations-mediawiki-config/commit/d6f8ac56b524d0060a75a32b84e1f7b9ba2e0f04#diff-5e9969217eb03889ab48e4b504cc06f7 [19:59:43] Looks like it changed in December 2017 [20:00:32] (03PS1) 10Andrew Bogott: Revert "cloud vps: disable labtestnet2001 and replace it with labtestnet2003" [puppet] - 10https://gerrit.wikimedia.org/r/446404 [20:01:04] i think it was reasonable to move it, it is actually nice to have test wikis running different MW versions to compare. but i will note that this means we have no group0 wiki that is running FlaggedRevs. maybe we should install it on testwiki. [20:01:15] (i don't care enough to do anything about this myself though :) ) [20:04:08] (03CR) 10jenkins-bot: Update portals for foundationwiki move [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446396 (https://phabricator.wikimedia.org/T199808) (owner: 10Reedy) [20:04:42] (03PS2) 10Andrew Bogott: Revert "cloud vps: disable labtestnet2001 and replace it with labtestnet2003" [puppet] - 10https://gerrit.wikimedia.org/r/446404 [20:05:41] (03CR) 10Reedy: [C: 031] foundationwiki rename: fixup trivial refs across puppet [puppet] - 10https://gerrit.wikimedia.org/r/446395 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [20:08:01] (03PS1) 10Reedy: Revert "Make foundationwiki readonly" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446406 [20:08:04] (03CR) 10Andrew Bogott: [C: 032] Revert "cloud vps: disable labtestnet2001 and replace it with labtestnet2003" [puppet] - 10https://gerrit.wikimedia.org/r/446404 (owner: 10Andrew Bogott) [20:14:04] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) a:03Imarlier [20:17:12] !log un-ban elastic1030 from eqiad search cluster [20:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:47] RECOVERY - Disk space on elastic1032 is OK: DISK OK [20:21:08] (03PS2) 10BBlack: foundationwiki rename: fixup trivial refs across puppet [puppet] - 10https://gerrit.wikimedia.org/r/446395 (https://phabricator.wikimedia.org/T188776) [20:21:52] (03CR) 10BBlack: [C: 032] foundationwiki rename: fixup trivial refs across puppet [puppet] - 10https://gerrit.wikimedia.org/r/446395 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [20:26:37] PROBLEM - Disk space on elastic1051 is CRITICAL: DISK CRITICAL - free space: /srv 73227 MB (10% inode=99%) [20:29:06] (03PS1) 10Bstorm: dumps: fail over dumps web to labstore1006 [dns] - 10https://gerrit.wikimedia.org/r/446476 (https://phabricator.wikimedia.org/T196651) [20:30:46] PROBLEM - Disk space on elastic1051 is CRITICAL: DISK CRITICAL - free space: /srv 73231 MB (10% inode=99%) [20:44:15] RECOVERY - Disk space on elastic1051 is OK: DISK OK [20:49:08] (03PS2) 10Bstorm: WIP dumps: fail over dumps web to labstore1006 [dns] - 10https://gerrit.wikimedia.org/r/446476 (https://phabricator.wikimedia.org/T196651) [20:50:59] !log reedy@deploy1001 Finished scap: updating l10n cache for foundationwiki url changes (duration: 58m 32s) [20:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:45] (03PS2) 10Reedy: Revert "Make foundationwiki readonly" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446406 [20:53:49] (03CR) 10Reedy: [C: 032] Revert "Make foundationwiki readonly" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446406 (owner: 10Reedy) [20:53:59] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 3 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Reedy) >>! In T188776#4431684, @BBlack wrote: > There's some cleanup commits still going on all over the pl... [20:55:45] (03Merged) 10jenkins-bot: Revert "Make foundationwiki readonly" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446406 (owner: 10Reedy) [20:57:03] !log reedy@deploy1001 Synchronized wmf-config/: make foundation.wikimedia.org writeable T188776 (duration: 00m 53s) [20:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:06] T188776: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 [20:58:39] (03CR) 10jenkins-bot: Revert "Make foundationwiki readonly" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446406 (owner: 10Reedy) [21:21:23] 10Operations, 10Research, 10SRE-Access-Requests: Unable to access SWAP notebooks using LDAP - https://phabricator.wikimedia.org/T199757 (10RobH) @Capt_Swing This doesn't appear to be an #ldap-access-requests, since you are specifically linking to SSH and production access to systems. (Also those directions s... [21:22:03] 10Operations, 10LDAP-Access-Requests, 10Research: Unable to access SWAP notebooks using LDAP - https://phabricator.wikimedia.org/T199757 (10RobH) [21:22:56] (03PS1) 10Andrew Bogott: bootstrapvz and vmbuilder: Deal with different 'hostname' metadata vars [puppet] - 10https://gerrit.wikimedia.org/r/446493 [21:23:53] (03CR) 10Andrew Bogott: [C: 032] bootstrapvz and vmbuilder: Deal with different 'hostname' metadata vars [puppet] - 10https://gerrit.wikimedia.org/r/446493 (owner: 10Andrew Bogott) [21:37:11] 10Operations, 10LDAP-Access-Requests, 10Research: Unable to access SWAP notebooks using LDAP - https://phabricator.wikimedia.org/T199757 (10Capt_Swing) Thanks @RobH. I added the wrong project by mistake. [21:40:42] 10Operations, 10LDAP-Access-Requests, 10Research: Unable to access SWAP notebooks using LDAP - https://phabricator.wikimedia.org/T199757 (10RobH) I thought it was perhaps access request, but then saw its likely and ldap fix and you had it right. I started to look into fixing it for you. Since you are staf... [21:40:55] 10Operations, 10LDAP-Access-Requests, 10Research: Unable to access SWAP notebooks using LDAP - https://phabricator.wikimedia.org/T199757 (10RobH) 05Open>03Resolved a:03RobH (If it isn't fixed, reopen this task!) [21:41:08] 10Operations, 10LDAP-Access-Requests, 10Research: Unable to access SWAP notebooks using LDAP - https://phabricator.wikimedia.org/T199757 (10RobH) a:05RobH>03None [21:55:08] (03PS1) 10Bstorm: dumps distribution: failing web services over to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/446497 (https://phabricator.wikimedia.org/T196651) [22:05:09] (03PS1) 10Reedy: Fix foundationwiki url in tests/urls.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446498 [22:05:50] (03CR) 10Reedy: [C: 032] Fix foundationwiki url in tests/urls.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446498 (owner: 10Reedy) [22:07:39] (03Merged) 10jenkins-bot: Fix foundationwiki url in tests/urls.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446498 (owner: 10Reedy) [22:17:46] (03PS1) 10Reedy: Remove specific wikimedia foundation mobile url template [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446500 [22:17:48] Krinkle: ^ [22:18:14] (03CR) 10Reedy: [C: 032] Remove specific wikimedia foundation mobile url template [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446500 (owner: 10Reedy) [22:19:31] (03Merged) 10jenkins-bot: Remove specific wikimedia foundation mobile url template [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446500 (owner: 10Reedy) [22:20:42] (03CR) 10Bstorm: [C: 031] "I'm all for it, but I'm curious if there are still any concerns." [puppet] - 10https://gerrit.wikimedia.org/r/435631 (https://phabricator.wikimedia.org/T153577) (owner: 10Alex Monk) [22:21:03] !log reedy@deploy1001 Synchronized tests/urls.txt: testding! (duration: 00m 53s) [22:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:15] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove override for foundationwiki in wgMobileUrlTemplate (duration: 00m 53s) [22:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:03] (03CR) 10Krinkle: foundationwiki rename: fixup trivial refs across puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/446395 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [22:31:56] (03CR) 10BBlack: foundationwiki rename: fixup trivial refs across puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/446395 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [22:33:06] 10Operations, 10Traffic: Pick up a suitable ACME library for certcentral - https://phabricator.wikimedia.org/T199717 (10Krenair) Oh, other thing to maybe consider that I had in the notes: Currently acme_tiny will just check that it can complete the challenge itself once, with any random backend host (web serve... [22:34:26] (03PS1) 10BBlack: fixup blackbox_exporter check for foundationwiki ToS URLs [puppet] - 10https://gerrit.wikimedia.org/r/446501 [22:34:42] (03CR) 10jenkins-bot: Fix foundationwiki url in tests/urls.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446498 (owner: 10Reedy) [22:34:44] (03CR) 10jenkins-bot: Remove specific wikimedia foundation mobile url template [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446500 (owner: 10Reedy) [22:34:47] (03PS2) 10BBlack: fixup blackbox_exporter check for foundationwiki [puppet] - 10https://gerrit.wikimedia.org/r/446501 [22:35:08] (03CR) 10BBlack: [C: 032] fixup blackbox_exporter check for foundationwiki [puppet] - 10https://gerrit.wikimedia.org/r/446501 (owner: 10BBlack) [22:35:25] (03PS1) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446502 [22:35:27] (03CR) 10Reedy: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446502 (owner: 10Reedy) [22:35:42] Well, that's no good [22:35:46] Hauskatze: ^ [22:36:48] (03PS2) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446502 [22:37:26] (03CR) 10Reedy: "Why did scap update-interwiki-cache on beta update prods interwiki.php?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446502 (owner: 10Reedy) [22:38:20] andrewbogott: I merged your bootstrap.sh changes, they'd been lingering an hour and didn't seem breaky! :) [22:38:33] oh, thanks [22:40:27] (03PS3) 10Reedy: Updating beta interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446502 [22:40:31] (03CR) 10Reedy: [C: 032] Updating beta interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446502 (owner: 10Reedy) [22:42:08] (03Merged) 10jenkins-bot: Updating beta interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446502 (owner: 10Reedy) [22:42:15] Reedy: you pinged? [22:42:20] (03CR) 10jenkins-bot: Updating beta interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446502 (owner: 10Reedy) [22:42:28] I thought you'd got scap update-interwiki-cache working on beta [22:42:40] no, I got scap update-interwiki-cache working [22:42:46] it was broken [22:42:50] on production [22:42:54] thcipriani fixed it [22:43:11] but the script does not update the beta one [22:43:14] hence I filed a task [22:43:36] !log reedy@deploy1001 Synchronized wmf-config/interwiki-labs.php: (no justification provided) (duration: 00m 53s) [22:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:54] meh [22:43:59] IW cache updated anyway for you :P [22:44:05] which I currently can't find [22:44:33] T198844 [22:44:34] T198844: Add a --labs option to 'scap update-interwiki-cache' to be able to update the interwiki-labs.php file using Scap - https://phabricator.wikimedia.org/T198844 [22:44:34] https://phabricator.wikimedia.org/T197166 [22:45:55] In any case I can't use the script either because it requires me to have +2 on operations/mediawiki-config.git and 'forge author identity' permissions as well so it does not end having xyz@deployment-tin.eqiad.wmnet as commit address [22:48:45] (03CR) 10MarcoAurelio: Updating beta interwiki cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446502 (owner: 10Reedy) [22:49:29] (03CR) 10MarcoAurelio: Updating beta interwiki cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446502 (owner: 10Reedy) [22:50:20] I wonder if we should have a separate repo to contain the labs overrides [22:50:40] or does that make it more likely someone will update one and forget the other [22:54:19] 10Operations, 10LDAP-Access-Requests, 10Research: Unable to access SWAP notebooks using LDAP - https://phabricator.wikimedia.org/T199757 (10Capt_Swing) Yep, that worked. Thank you, @RobH ! [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180717T2300). [23:00:04] MatmaRex: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:27] hi [23:03:30] * Reedy eyes MatmaRex [23:03:53] noooo [23:03:54] scap [23:04:26] ;_; [23:05:06] Reedy: hmm actually, do you mind if i add another one? https://gerrit.wikimedia.org/r/c/mediawiki/core/+/446480 [23:05:10] if we're scapping anyway [23:05:22] sure [23:05:45] cherry-pick: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/446506 [23:05:49] * Reedy waits for jerkins [23:07:36] (03PS1) 10Thcipriani: Scap: update-interwiki-cache for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446507 (https://phabricator.wikimedia.org/T198844) [23:08:34] Reedy: for reference, the three message fixes affect the last row of the form here: https://www.mediawiki.org/wiki/Special:Log and the remaining one makes this not throw exceptions: https://www.mediawiki.org/wiki/Special:Log/gblrename [23:18:18] 10Operations, 10ops-codfw, 10Analytics, 10EventBus, and 3 others: EventStreams accumulates too much memory on SCB nodes in CODFW - https://phabricator.wikimedia.org/T199813 (10Liuxinyu970226) [23:23:31] 2 more to merge [23:25:07] 1 [23:25:42] MatmaRex: That's them all merged, right? [23:26:06] yeah, all four [23:26:20] scap time then [23:26:52] !log reedy@deploy1001 Started scap: log event fixes [23:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:46] (03PS1) 10BBlack: make foundation.wikimedia.org redirect a 302 [puppet] - 10https://gerrit.wikimedia.org/r/446509 (https://phabricator.wikimedia.org/T188776) [23:45:12] (03CR) 10BBlack: [C: 032] make foundation.wikimedia.org redirect a 302 [puppet] - 10https://gerrit.wikimedia.org/r/446509 (https://phabricator.wikimedia.org/T188776) (owner: 10BBlack) [23:53:43] sync-apaches: 61% (ok: 163; fail: 0; left: 101) [23:53:45] ffs [23:55:01] :( [23:55:41] scap-cdb-rebuild: 34% (ok: 99; fail: 0; left: 187) [23:58:24] scap-cdb-rebuild: 98% (ok: 283; fail: 0; left: 3) [23:59:05] c'mon scap [23:59:09] do it before the end of the window