[00:00:08] (03PS2) 10DannyS712: Fix addition of Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530015 (https://phabricator.wikimedia.org/T230083) [00:00:14] (03PS3) 10DannyS712: Fix addition of Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530015 (https://phabricator.wikimedia.org/T230083) [00:00:40] (03PS4) 10DannyS712: Fix addition of Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530015 (https://phabricator.wikimedia.org/T230083) [01:07:17] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230442 (10Bstorm) It seems that this is showing a loss of 4 disks. We may want to check a controller in this case. I see that after our big outage where this was one of the two hypervisors that went down from disk... [01:09:57] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230442 (10Bstorm) Since the filesystem has gone read-only, I was only able to get part of the firmware terminal logs. [01:12:42] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230442 (10Bstorm) Some controller info: ` Adapter #0 ============================================================================== Versions ================ Product Name : PE... [01:13:16] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230442 (10Bstorm) Nothing in the eventlog when I tried to retrieve it. [01:31:26] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230289 (10Bstorm) Per T230442, this appears to be something strange going on, possibly a controller freaking out. It lost 4 disks in a very short time and is now a read-only volume. Feel fr... [01:32:02] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) [01:32:46] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230442 (10Bstorm) [01:33:27] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230442 (10Bstorm) Just to re-emphasize: this system does not have any loads on it at this time, so it's a wonderful time for it to blow up. It can be repaired and rebooted as needed. [01:38:07] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:01] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:07:03] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:07:15] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [02:10:27] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [02:30:29] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:32:07] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:52:39] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1376555112 and 106 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:54:05] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 730140312 and 50 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:54:27] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1408886648 and 96 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:57:19] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [02:57:31] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 430110520 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:00:31] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:00:49] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 533059552 and 43 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:02:05] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 600738240 and 29 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:03:41] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5096 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:03:55] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 35480 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:04:03] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 60232 and 53 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:11:45] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page [03:11:45] pected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:14:59] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:19:51] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITI [03:19:51] view mobile HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:23:03] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:40:51] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:42:27] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:00:13] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:05:03] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:09:55] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a rando [04:09:55] eturned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:14:47] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:54:53] (03CR) 10Marostegui: [C: 03+2] wmnet: Point m3-master codfw to dbproxy2003 [dns] - 10https://gerrit.wikimedia.org/r/529847 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [04:54:57] (03PS2) 10Marostegui: wmnet: Point m3-master codfw to dbproxy2003 [dns] - 10https://gerrit.wikimedia.org/r/529847 (https://phabricator.wikimedia.org/T202367) [05:04:13] 10Operations, 10ops-eqiad, 10DC-Ops: hw troubleshooting: power supply for db1129 - https://phabricator.wikimedia.org/T230458 (10Marostegui) [05:04:33] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: hw troubleshooting: power supply for db1129 - https://phabricator.wikimedia.org/T230458 (10Marostegui) [05:09:01] (03PS1) 10Marostegui: db2122: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/530021 (https://phabricator.wikimedia.org/T228969) [05:09:36] (03CR) 10Marostegui: [C: 03+2] db2122: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/530021 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [05:39:40] (03PS1) 10Vgutierrez: Release 1.13.9-1+wmf2 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/530023 [05:41:06] (03PS2) 10Vgutierrez: Release 1.13.9-1+wmf2 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/530023 [06:00:47] (03PS3) 10Vgutierrez: Release 1.13.9-1+wmf2 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/530023 [06:07:31] (03PS1) 10Marostegui: dbproxy2003: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/530025 (https://phabricator.wikimedia.org/T202367) [06:08:27] (03CR) 10Marostegui: [C: 03+2] dbproxy2003: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/530025 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [06:20:07] (03Abandoned) 10Vgutierrez: Release 1.13.9-1+wmf2 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/530023 (owner: 10Vgutierrez) [06:25:49] (03Restored) 10Vgutierrez: Release 1.13.9-1+wmf2 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/530023 (owner: 10Vgutierrez) [07:03:26] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2044 is CRITICAL: cluster=mysql device=cciss,1 instance=db2044:9100 job=node site=codfw Marostegui T230459 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops [07:04:38] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Remove db2063 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530034 (https://phabricator.wikimedia.org/T230459) [07:05:53] (03CR) 10Marostegui: [C: 03+2] db-codfw,db-eqiad.php: Remove db2063 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530034 (https://phabricator.wikimedia.org/T230459) (owner: 10Marostegui) [07:06:48] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Remove db2063 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530034 (https://phabricator.wikimedia.org/T230459) (owner: 10Marostegui) [07:07:22] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Remove db2063 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530034 (https://phabricator.wikimedia.org/T230459) (owner: 10Marostegui) [07:08:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2063 from config T230459 (duration: 00m 48s) [07:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:12] T230459: Replace db2044 with db2063 - https://phabricator.wikimedia.org/T230459 [07:09:05] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2063 from config T230459 (duration: 00m 47s) [07:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:09] (03PS1) 10Marostegui: mariadb: Move db2063 from s2 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/530035 (https://phabricator.wikimedia.org/T230459) [07:11:45] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2063 from s2 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/530035 (https://phabricator.wikimedia.org/T230459) (owner: 10Marostegui) [07:14:11] (03CR) 10Ema: [C: 03+2] Add discovery CNAME phabricator -> phab1003 [dns] - 10https://gerrit.wikimedia.org/r/529306 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [07:14:14] (03PS2) 10Ema: Add discovery CNAME phabricator -> phab1003 [dns] - 10https://gerrit.wikimedia.org/r/529306 (https://phabricator.wikimedia.org/T210411) [07:15:41] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler1002/17882/ the -1 is a known issue that needs to be addressed with a big r" [puppet] - 10https://gerrit.wikimedia.org/r/530035 (https://phabricator.wikimedia.org/T230459) (owner: 10Marostegui) [07:15:52] (03CR) 10Marostegui: [V: 03+2 C: 03+2] mariadb: Move db2063 from s2 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/530035 (https://phabricator.wikimedia.org/T230459) (owner: 10Marostegui) [07:18:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/529309 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [07:21:56] (03PS4) 10Vgutierrez: Release 1.13.9-1+wmf2 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/530023 [07:26:22] RECOVERY - Disk space on ms-be2021 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2021&var-datasource=codfw+prometheus/ops [07:27:26] that's me ^ [07:32:35] 10Operations, 10ops-codfw, 10DBA: (2019-08-31)rack/setup/install db2131.codfw.wmnet - https://phabricator.wikimedia.org/T229251 (10Marostegui) [07:32:40] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) [07:48:03] (03CR) 10Ema: [C: 03+1] Release 1.13.9-1+wmf2 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/530023 (owner: 10Vgutierrez) [07:51:57] (03PS2) 10Ema: Add TLS termination for phabricator.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/529309 (https://phabricator.wikimedia.org/T210411) [07:55:17] (03CR) 10Ema: [C: 03+2] Add TLS termination for phabricator.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/529309 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [08:00:44] (03PS1) 10Filippo Giunchedi: swift: stop monitoring individual daemons [puppet] - 10https://gerrit.wikimedia.org/r/530080 (https://phabricator.wikimedia.org/T228878) [08:02:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1002/17884/" [puppet] - 10https://gerrit.wikimedia.org/r/530080 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [08:04:39] (03CR) 10Vgutierrez: [C: 03+2] Release 1.13.9-1+wmf2 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/530023 (owner: 10Vgutierrez) [08:10:23] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [08:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:31] (03Abandoned) 10Ema: restbase: add TLS support via profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [08:12:37] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99) [08:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:41] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [08:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:31] !log uploaded nginx-1.13.9-1+wmf2 to apt.wikimedia.org (stretch) [08:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:29] (03PS1) 10Giuseppe Lavagetto: envoyproxy: only run build-envoy-config from puppet [puppet] - 10https://gerrit.wikimedia.org/r/530083 [08:20:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10abi_) >>! In T230020#5403340, @Dzahn wrote: > Hi @abi_ may i ask what the part you are planning to run on mwmaint servers will be? It's less com... [08:24:42] (03CR) 10Ema: [C: 03+1] "LGTM, one doubt" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530083 (owner: 10Giuseppe Lavagetto) [08:25:33] !log upgrading nginx to 1.13.9-1+wmf2 in cp5001 (upload) and cp5007 (text) [08:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17885/vega.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/530083 (owner: 10Giuseppe Lavagetto) [08:33:07] Noticing search index not being updated after edits on nl.wikipedia. At least a 5 min lag. Maybe an issue? [08:33:37] Krinkle: you mean cirrus search? [08:34:02] there is a cluster reboot in progress, which does increase the update lag [08:34:11] also 5 minutes is well within the expectations [08:37:31] RECOVERY - snapshot of s3 in codfw on db1115 is OK: snapshot for s3 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-08-14 05:05:00 from db2098.codfw.wmnet:3313 (766 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:48:30] (03PS1) 10Giuseppe Lavagetto: envoyproxy: rebuild configuration when the admin file is changed [puppet] - 10https://gerrit.wikimedia.org/r/530087 [08:51:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoyproxy: rebuild configuration when the admin file is changed [puppet] - 10https://gerrit.wikimedia.org/r/530087 (owner: 10Giuseppe Lavagetto) [08:51:24] (03PS1) 10Ammarpad: Add new throttle rule for cawiki editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530088 (https://phabricator.wikimedia.org/T230313) [08:52:01] !log upgrading nginx to 1.13.9-1+wmf2 in cp1075, cp2001, cp3030 and cp4027 (text) and cp1076, cp2002, cp3034, cp4021 (upload) [08:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:03] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1098 threshold =0.2 breach: active_shards_percent_as_number: 75.07943713118475, timed_out: False, unassigned_shards: 1098, delayed_unassigned_shards: 0, relocating_shards: 0, number_of_pending_tasks: 0, cluster_name: production-search-omega-eqiad, number_of_in_flight_fetch: 0, number_of_nodes: 15 [08:53:03] g_in_queue_millis: 0, active_shards: 3308, status: yellow, active_primary_shards: 1469, number_of_data_nodes: 15, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:56:17] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: unassigned_shards: 589, number_of_in_flight_fetch: 0, number_of_data_nodes: 18, active_shards: 3811, number_of_pending_tasks: 6, status: yellow, active_shards_percent_as_number: 86.49568769859283, delayed_unassigned_shards: 0, number_of_nodes: 18, cluster_name: production-search-omega- [08:56:17] ng_shards: 6, relocating_shards: 0, active_primary_shards: 1469, task_max_waiting_in_queue_millis: 67554, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [08:56:29] ^ overly sensitive check, will tune [09:07:24] (03PS2) 10Ema: ATS: use TLS and discovery hostname for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/529318 (https://phabricator.wikimedia.org/T210411) [09:09:05] (03CR) 10Ema: [C: 03+2] ATS: use TLS and discovery hostname for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/529318 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:16:04] (03PS2) 10Ema: ATS: enable compress plugin on cp5002 [puppet] - 10https://gerrit.wikimedia.org/r/529945 (https://phabricator.wikimedia.org/T227432) [09:17:11] (03CR) 10Ema: [C: 03+2] ATS: enable compress plugin on cp5002 [puppet] - 10https://gerrit.wikimedia.org/r/529945 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:20:00] !log cp5002: ats-backend-restart to enable compress plugin [09:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:02] poor cp5002 [09:22:15] (03PS3) 10Marostegui: mariadb: Promote db1133 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/529331 (https://phabricator.wikimedia.org/T229657) [09:22:21] (03PS2) 10Marostegui: wmnet: Promote db1133 to m5 master [dns] - 10https://gerrit.wikimedia.org/r/529333 (https://phabricator.wikimedia.org/T229657) [09:22:52] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10fgiunchedi) We're now collecting metrics from all managed PDUs into prometheus, including environmental sensors. The names re... [09:26:27] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10fgiunchedi) When the time comes to upgrade PDUs puppet should be updated too to reflect the new reality, specifically the `facilities` module to either add `model => 'sen... [09:28:22] (03PS1) 10Vgutierrez: Merge tag 'upstream/8.0.4' [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/530091 [09:28:27] (03CR) 10Gehel: [C: 04-1] "I think there is place to have both __iter__() and split(). While iteration could be done with hosts.split(hosts.lentgh), having RemoteHos" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe) [09:28:34] (03CR) 10jerkins-bot: [V: 04-1] Merge tag 'upstream/8.0.4' [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/530091 (owner: 10Vgutierrez) [09:28:41] of course [09:31:38] (03PS2) 10Vgutierrez: Release 8.0.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/530091 [09:31:50] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/530091 (owner: 10Vgutierrez) [09:40:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This is great, and should be expanded further :) See my comment inline to fix the current query." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [09:40:36] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [09:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:52] (03PS1) 10Jbond: cross-validate-accounts: add gpu-users to list of ops groups [puppet] - 10https://gerrit.wikimedia.org/r/530093 [09:47:18] (03CR) 10jerkins-bot: [V: 04-1] cross-validate-accounts: add gpu-users to list of ops groups [puppet] - 10https://gerrit.wikimedia.org/r/530093 (owner: 10Jbond) [09:48:07] (03PS2) 10Jbond: cross-validate-accounts: add gpu-users to list of ops groups [puppet] - 10https://gerrit.wikimedia.org/r/530093 [09:48:44] (03PS1) 10Andrew Bogott: Move cloudvirt1021, 1022 and 1023 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/530094 (https://phabricator.wikimedia.org/T229873) [09:49:12] (03CR) 10Vgutierrez: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/530091 (owner: 10Vgutierrez) [09:49:27] (03CR) 10Jbond: [C: 03+2] cross-validate-accounts: add gpu-users to list of ops groups [puppet] - 10https://gerrit.wikimedia.org/r/530093 (owner: 10Jbond) [09:53:20] gehel: it's now been 2 hours, still seemingly unchanged/outdated nlwiki result for several pages I edited [09:53:28] https://nl.wikipedia.org/w/index.php?sort=relevance&search=insource%3A%2FTop_icon_raw%2F&title=Speciaal:Zoeken&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns10=1 [09:53:35] (03CR) 10Volans: "replied inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe) [09:53:49] should have <= 6 results, instead of 9. As most pages were edited today to no longer match [10:04:50] (03CR) 10Filippo Giunchedi: "Thanks for the review!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [10:10:04] (03PS1) 10Elukey: Add sre.hadoop.reboot-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) [10:14:29] (03PS4) 10Filippo Giunchedi: mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) [10:16:13] (03CR) 10Ema: [C: 03+1] Release 8.0.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/530091 (owner: 10Vgutierrez) [10:17:19] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/530091 (owner: 10Vgutierrez) [10:24:09] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10fgiunchedi) From a chat with @faidon it emerged that we have at least three main use cases for PDU metrics: 1. Checking over... [10:28:37] !log Starting smoketest of termbox service on eqiad: T229907 [10:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:45] T229907: Synthetic Load Test - https://phabricator.wikimedia.org/T229907 [10:38:40] Krinkle: @lunch, I'll have a look as soon as I'm back, but this might need Erik expertise [10:43:18] (03PS1) 10Filippo Giunchedi: icinga: add acknowledge details to emails [puppet] - 10https://gerrit.wikimedia.org/r/530098 (https://phabricator.wikimedia.org/T230413) [10:45:06] (03PS1) 10Ema: ATS: leave AE removal to Lua [puppet] - 10https://gerrit.wikimedia.org/r/530099 (https://phabricator.wikimedia.org/T227432) [10:45:17] 10Operations, 10DNS, 10Domains, 10Traffic: Could not reach wikipedia from domain wikipedia.fi - https://phabricator.wikimedia.org/T230470 (104shadoww) [10:46:38] !log depool cp5002 after crash. See /var/log/trafficserver/crash-2019-08-14-104502.log [10:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:23] (03CR) 10Ema: [C: 03+2] ATS: leave AE removal to Lua [puppet] - 10https://gerrit.wikimedia.org/r/530099 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:50:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Move cloudvirt1021, 1022 and 1023 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/530094 (https://phabricator.wikimedia.org/T229873) (owner: 10Andrew Bogott) [10:54:01] (03CR) 10Elukey: Add Cache-Control response header for Wikistats V2's index.html (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529795 (https://phabricator.wikimedia.org/T230136) (owner: 10Elukey) [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190814T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:51] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think this patch should be rebased on top of https://gerrit.wikimedia.org/r/#/c/operations/software/spicerack/+/529976/ and use it as a " (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe) [11:05:04] (03CR) 10Volans: [C: 03+1] "LGTM, we might decide to add another method to use the batch_size instead of the number of splits. But that's unrelated to this one." [software/spicerack] - 10https://gerrit.wikimedia.org/r/529976 (owner: 10Giuseppe Lavagetto) [11:05:21] 10Operations, 10DNS, 10Domains, 10Traffic: Could not reach wikipedia from domain wikipedia.fi - https://phabricator.wikimedia.org/T230470 (10Vgutierrez) a quick check shows: ` willikins:~ vgutierrez$ host -t ns wikipedia.fi Host wikipedia.fi not found: 2(SERVFAIL) ` And from the whois output: ` Nameserver... [11:06:42] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [11:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:16] (03PS1) 10Vgutierrez: Point wikipedia.fi domain to the non canonical redirect service [dns] - 10https://gerrit.wikimedia.org/r/530103 (https://phabricator.wikimedia.org/T230470) [11:12:04] Krinkle: I've created T230472 to track the issue. I don't see anything obviously wrong (except that the search results don't match the current content). [11:12:04] T230472: Search index not updated for nl.w.o - https://phabricator.wikimedia.org/T230472 [11:12:05] 10Operations, 10DNS, 10Domains, 10Traffic, 10Patch-For-Review: Could not reach wikipedia from domain wikipedia.fi - https://phabricator.wikimedia.org/T230470 (10Vgutierrez) p:05Triage→03Normal [11:12:15] gehel: thanjs [11:12:19] Erik might find something that makes more sense. [11:12:51] gehel: I've done some null edits in case it was transient, but still seems outdated. [11:13:03] I could just reindex nl.w.o, but I'd prefer Erik to have a look first [11:13:32] gehel: yeah, also no reason at this time to believe it is limited to nlwiki (why would it :) ) [11:13:52] yep, and probably not limited to your edits either :) [11:14:18] I'm using "insource:" which be related. I don't know how separate those indexes are [11:14:57] we have a few things that are updated through crons, but I doubt that insource relies on any of those [11:17:32] (03PS1) 10Vgutierrez: nc_redirects.dat: Redirect wikipedia.fi to https://fi.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/530104 (https://phabricator.wikimedia.org/T230470) [11:17:58] gehel: I'm looking at some plain searches (not insource) for edits made 5+ minutes ago, e.g. https://nl.wikipedia.org/w/index.php?title=Jacques_Vermeire&curid=442588&diff=54376623&oldid=54376556 [11:18:02] https://nl.wikipedia.org/w/index.php?search=%22speelde+de+rol+van+DDT%22&title=Speciaal%3AZoeken&go=Artikel&ns0=1 [11:18:03] no results either [11:18:42] gotta get some free lunch, bbl [11:19:19] we're still processing updates (https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=1565770737580&to=1565781537580&refresh=1m&var-cluster=cloudelastic&var-exported_cluster=cloudelastic-chi), so not sure what's happening here [11:19:31] !log termbox smoketests finished [11:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:48] !log repooling cp5002 [11:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:34] (03PS2) 10Vgutierrez: nc_redirects.dat: Redirect wikipedia.fi to https://fi.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/530104 (https://phabricator.wikimedia.org/T230470) [11:25:28] (03CR) 10Vgutierrez: [C: 03+2] nc_redirects.dat: Redirect wikipedia.fi to https://fi.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/530104 (https://phabricator.wikimedia.org/T230470) (owner: 10Vgutierrez) [11:29:50] (03CR) 10Vgutierrez: [C: 03+2] Point wikipedia.fi domain to the non canonical redirect service [dns] - 10https://gerrit.wikimedia.org/r/530103 (https://phabricator.wikimedia.org/T230470) (owner: 10Vgutierrez) [11:30:49] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.424e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [11:34:03] 10Operations, 10DNS, 10Domains, 10Traffic, 10Patch-For-Review: Could not reach wikipedia from domain wikipedia.fi - https://phabricator.wikimedia.org/T230470 (10Vgutierrez) So, after adding the zone file for wikipedia.fi and the proper redirect rules: ` $ curl http://wikipedia.fi -o /dev/null -v 2>&1|gre... [11:43:15] (03CR) 10Mforns: "Awesome improvement!!" [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [11:43:45] 10Operations, 10DNS, 10Domains, 10Traffic, 10Patch-For-Review: Could not reach wikipedia from domain wikipedia.fi - https://phabricator.wikimedia.org/T230470 (104shadoww) Ok. Thanks for lighting fast fix for this @Vgutierrez! [11:46:38] (03PS1) 10Jbond: idp: add keystore password configueration to apereo_cas [puppet] - 10https://gerrit.wikimedia.org/r/530108 [11:50:05] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.22e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [11:50:41] (03CR) 10Jbond: [C: 03+2] idp: add keystore password configueration to apereo_cas [puppet] - 10https://gerrit.wikimedia.org/r/530108 (owner: 10Jbond) [11:50:51] (03PS2) 10Jbond: idp: add keystore password configueration to apereo_cas [puppet] - 10https://gerrit.wikimedia.org/r/530108 [11:51:45] (03CR) 10Jbond: [C: 03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/530108 (owner: 10Jbond) [11:52:55] (03CR) 10Mobrovac: "LGTM. Question: should the comment be sanitised / escaped so as to avoid cmd arg breakage?" [puppet] - 10https://gerrit.wikimedia.org/r/530098 (https://phabricator.wikimedia.org/T230413) (owner: 10Filippo Giunchedi) [11:58:32] (03PS1) 10Jbond: apereo_cas: add empty keystore file [labs/private] - 10https://gerrit.wikimedia.org/r/530111 [11:59:07] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: add empty keystore file [labs/private] - 10https://gerrit.wikimedia.org/r/530111 (owner: 10Jbond) [11:59:46] (03PS3) 10Jbond: idp: add keystore password configueration to apereo_cas [puppet] - 10https://gerrit.wikimedia.org/r/530108 [12:00:17] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/530108 (owner: 10Jbond) [12:02:07] (03PS1) 10Jbond: apereo_cas: add some content to the keystore [labs/private] - 10https://gerrit.wikimedia.org/r/530112 [12:02:29] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: add some content to the keystore [labs/private] - 10https://gerrit.wikimedia.org/r/530112 (owner: 10Jbond) [12:03:10] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/530108 (owner: 10Jbond) [12:05:31] (03CR) 10Jbond: [C: 03+2] idp: add keystore password configueration to apereo_cas [puppet] - 10https://gerrit.wikimedia.org/r/530108 (owner: 10Jbond) [12:05:38] (03PS4) 10Jbond: idp: add keystore password configueration to apereo_cas [puppet] - 10https://gerrit.wikimedia.org/r/530108 [12:09:41] jbond unleashed /o\ [12:10:39] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:16] (03PS1) 10Jbond: apereo_cas: correct some settings [puppet] - 10https://gerrit.wikimedia.org/r/530115 [12:16:42] (03PS2) 10Jbond: apereo_cas: correct some settings [puppet] - 10https://gerrit.wikimedia.org/r/530115 [12:16:54] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/530115 (owner: 10Jbond) [12:17:40] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [12:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:31] (03PS3) 10Jbond: apereo_cas: correct some settings [puppet] - 10https://gerrit.wikimedia.org/r/530115 [12:20:04] !log rolling upgrade of nginx to 1.13.9-1+wmf2 in the cache cluster [12:20:06] lol [12:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:33] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 4136 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [12:20:39] (03CR) 10Jbond: [C: 03+2] apereo_cas: correct some settings [puppet] - 10https://gerrit.wikimedia.org/r/530115 (owner: 10Jbond) [12:24:49] !log We're going to try making a new wiki. T212881 [12:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:58] T212881: addWiki.php broken creating ES tables - https://phabricator.wikimedia.org/T212881 [12:24:59] Please stand back, we're trying science. [12:27:31] (03PS16) 10Reedy: Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [12:28:04] (03PS1) 10Jbond: ldap: add unboundID provider to fix ldap ssl issues [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530117 [12:28:15] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:51] 10Operations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10faidon) That's an awesome idea, nice! We can't advertise just the /23 + /48 from eqord as these would be more-specifics to what eqiad itself advertises - and thus all of the eqiad traffic would flow... [12:29:13] ACKNOWLEDGEMENT - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. John Bond Host still in development https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:33] (03PS5) 10Elukey: Add Cache-Control response header for Wikistats V2's index.html [puppet] - 10https://gerrit.wikimedia.org/r/529795 (https://phabricator.wikimedia.org/T230136) [12:29:35] (03PS17) 10Reedy: Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [12:29:38] (03CR) 10Jbond: [V: 03+2 C: 03+2] ldap: add unboundID provider to fix ldap ssl issues [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530117 (owner: 10Jbond) [12:31:16] (03CR) 10Jforrester: [C: 03+1] Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [12:31:25] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:30] (03CR) 10Reedy: [C: 03+2] Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [12:36:33] (03CR) 10Elukey: [C: 03+2] Add Cache-Control response header for Wikistats V2's index.html [puppet] - 10https://gerrit.wikimedia.org/r/529795 (https://phabricator.wikimedia.org/T230136) (owner: 10Elukey) [12:36:39] (03Merged) 10jenkins-bot: Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [12:37:18] (03CR) 10jenkins-bot: Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [12:39:13] (03PS1) 10Jbond: apereo_cas: ensure we use the latest git repo and add some notifies [puppet] - 10https://gerrit.wikimedia.org/r/530121 [12:41:33] (03CR) 10Jbond: [C: 03+2] apereo_cas: ensure we use the latest git repo and add some notifies [puppet] - 10https://gerrit.wikimedia.org/r/530121 (owner: 10Jbond) [12:43:08] (03PS1) 10Reedy: Add napwikisource to wikiversions and commonsupload.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530122 (https://phabricator.wikimedia.org/T210752) [12:43:56] (03CR) 10Reedy: [C: 03+2] Add napwikisource to wikiversions and commonsupload.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530122 (https://phabricator.wikimedia.org/T210752) (owner: 10Reedy) [12:44:58] (03Merged) 10jenkins-bot: Add napwikisource to wikiversions and commonsupload.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530122 (https://phabricator.wikimedia.org/T210752) (owner: 10Reedy) [12:46:25] (03CR) 10jenkins-bot: Add napwikisource to wikiversions and commonsupload.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530122 (https://phabricator.wikimedia.org/T210752) (owner: 10Reedy) [12:47:28] !log Wiki creation is still not working correctly, unfortunately. [12:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:57] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [12:48:59] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:13] (03PS1) 10Jbond: apereo_cas: dont use starttls [puppet] - 10https://gerrit.wikimedia.org/r/530124 [12:54:37] (03CR) 10Jbond: [C: 03+2] apereo_cas: dont use starttls [puppet] - 10https://gerrit.wikimedia.org/r/530124 (owner: 10Jbond) [12:56:57] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:28] 10Operations, 10ops-eqsin: update PDUs for eqsin (asset tag and other info) - https://phabricator.wikimedia.org/T211368 (10faidon) 05Resolved→03Open Note this is now flagged in the Accounting report instead, as these are missing from Finance's spreadsheet - they have not been documented as assets, which is... [13:01:47] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:22] (03PS1) 10Ema: ATS: use proper origin for grafana.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/530127 (https://phabricator.wikimedia.org/T227432) [13:11:18] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/530098 (https://phabricator.wikimedia.org/T230413) (owner: 10Filippo Giunchedi) [13:11:26] (03CR) 10Ema: [C: 03+2] ATS: use proper origin for grafana.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/530127 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:11:43] (03CR) 10Volans: "Some comment inline, thanks for creating cookbooks!" (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [13:12:09] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.515e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:12:47] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [13:13:54] The mm lag is related to cirrusSearchElasticaWrite [13:13:57] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [13:14:01] I guess related to the rolling reboot [13:14:08] (of elasticsearch) [13:16:24] elukey: yep, not sure I can do something about it. The reboot is fairly conservative in waiting between servers to let everything recover. [13:16:39] nono all good I was only warning people :) [13:17:47] we have a ticket open somewhere to find a way to actually pause the processing of the updates on the kafka / job runner side, which would be a way better solution than what we are doing atm [13:18:01] but no update on that one for some time [13:18:32] (03PS1) 10Cmjohnson: Merge branch 'master' of https://gerrit.wikimedia.org/r/p/operations/dns into mydnschanges [dns] - 10https://gerrit.wikimedia.org/r/530129 [13:18:47] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [13:18:56] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'master' of https://gerrit.wikimedia.org/r/p/operations/dns into mydnschanges [dns] - 10https://gerrit.wikimedia.org/r/530129 (owner: 10Cmjohnson) [13:19:10] (03Abandoned) 10Cmjohnson: Merge branch 'master' of https://gerrit.wikimedia.org/r/p/operations/dns into mydnschanges [dns] - 10https://gerrit.wikimedia.org/r/530129 (owner: 10Cmjohnson) [13:22:36] (03PS1) 10Cmjohnson: Adding mgmt dns ganeti1009-1022 [dns] - 10https://gerrit.wikimedia.org/r/530132 (https://phabricator.wikimedia.org/T228924) [13:24:10] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns ganeti1009-1022 [dns] - 10https://gerrit.wikimedia.org/r/530132 (https://phabricator.wikimedia.org/T228924) (owner: 10Cmjohnson) [13:28:50] (03PS1) 10Ema: webserver-misc-apps: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/530134 (https://phabricator.wikimedia.org/T210411) [13:29:41] (03PS1) 10Jbond: idp: add crypto key support with spec tests [puppet] - 10https://gerrit.wikimedia.org/r/530135 [13:29:43] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.159e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:30:40] (03CR) 10jerkins-bot: [V: 04-1] idp: add crypto key support with spec tests [puppet] - 10https://gerrit.wikimedia.org/r/530135 (owner: 10Jbond) [13:31:37] (03PS1) 10Ema: secret: dummy key for webserver-misc-apps [labs/private] - 10https://gerrit.wikimedia.org/r/530136 (https://phabricator.wikimedia.org/T210411) [13:32:03] (03PS2) 10Jbond: idp: add crypto key support with spec tests [puppet] - 10https://gerrit.wikimedia.org/r/530135 [13:34:12] 10Operations, 10netbox: Setup Swift Storage for Netbox image (was: netbox won't allow me to upload photos of the rack) - https://phabricator.wikimedia.org/T209182 (10Volans) 05Resolved→03Open Re-opening as there is still some work to do given that as it is right now is less redundant that it was before. Na... [13:35:01] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10abian) [13:36:52] (03CR) 10Ema: [C: 03+2] webserver-misc-apps: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/530134 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [13:41:08] 10Operations, 10ops-eqiad, 10vm-requests, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Cmjohnson) @Jclark-ctr Mgmt IP's that need to be setup on the idrac Instructions for setup https://wikitech.wikimedia.org/wiki/Platform-specifi... [13:44:27] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) [13:47:04] 10Operations, 10ops-eqiad, 10vm-requests, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Jclark-ctr) [13:48:02] (03PS1) 10Jakob: Whitelist jenkins for edit rate limits on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530144 (https://phabricator.wikimedia.org/T230481) [13:50:13] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Jclark-ctr) [13:50:59] (We're still testing in prod, sorry anyone that wants to do something.) [13:52:31] (03CR) 10Jakob: "Here is an example of the Selenium user logging in from the IP in question: https://logstash-beta.wmflabs.org/app/kibana#/doc/logstash-*/l" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530144 (https://phabricator.wikimedia.org/T230481) (owner: 10Jakob) [13:53:16] !log reedy@deploy1001 Synchronized dblists/: T212881 (duration: 00m 48s) [13:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:25] T212881: addWiki.php broken creating ES tables - https://phabricator.wikimedia.org/T212881 [13:54:04] (03PS3) 10Jbond: idp: add crypto key support with spec tests [puppet] - 10https://gerrit.wikimedia.org/r/530135 [13:55:29] !log reedy@deploy1001 rebuilt and synchronized wikiversions files: T212881 [13:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:06] (03CR) 10Jbond: [C: 03+2] idp: add crypto key support with spec tests [puppet] - 10https://gerrit.wikimedia.org/r/530135 (owner: 10Jbond) [13:56:34] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T210752 (duration: 00m 47s) [13:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:42] T210752: Create Wikisource Neapolitan - https://phabricator.wikimedia.org/T210752 [13:57:20] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) @Jclark-ctr ganeti1019 10.65.5.114 ganeti1020 10.65.5.115 ganeti1021 10.65.5.116 ganeti1022 10.65.5.117 [13:58:08] !log reedy@deploy1001 Synchronized static/images/project-logos/: T210752 (duration: 00m 47s) [13:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:07] (03PS1) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530149 [13:59:09] (03CR) 10Reedy: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530149 (owner: 10Reedy) [13:59:12] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: hw troubleshooting: power supply for db1129 - https://phabricator.wikimedia.org/T230458 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr John, Please check to make sure that the power cables are fully seated. Update the task and let me know if I need to order a new... [13:59:13] (03PS1) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530150 [13:59:15] (03CR) 10Reedy: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530150 (owner: 10Reedy) [13:59:50] (03Abandoned) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530150 (owner: 10Reedy) [14:00:05] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:00:07] (03PS1) 10Ema: Add TLS termination for webserver_misc_apps [puppet] - 10https://gerrit.wikimedia.org/r/530151 (https://phabricator.wikimedia.org/T210411) [14:00:22] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530149 (owner: 10Reedy) [14:01:01] (03CR) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530149 (owner: 10Reedy) [14:01:54] !log reedy@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 03m 04s) [14:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:11] (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for webserver-misc-apps [labs/private] - 10https://gerrit.wikimedia.org/r/530136 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:09:17] !log rolling back nginx upgrade in cp3032 [14:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:45] 10Operations, 10Elasticsearch, 10Traffic, 10Discovery-Search (Current work), 10Patch-For-Review: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) This issue is solved for now and cloudelastic checks for all ports have bee... [14:13:32] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Icinga reports read time out error for some checks on cloudelastic cluster - https://phabricator.wikimedia.org/T230366 (10Mathew.onipe) [14:15:37] (03CR) 10Phamhi: [C: 03+2] Move cloudvirt1021, 1022 and 1023 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/530094 (https://phabricator.wikimedia.org/T229873) (owner: 10Andrew Bogott) [14:16:19] PROBLEM - DPKG on cp3032 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:17:41] !log upgrading envoy package to 1.11.1 [14:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:22] (03PS2) 10Phamhi: Move cloudvirt1021, 1022 and 1023 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/530094 (https://phabricator.wikimedia.org/T229873) (owner: 10Andrew Bogott) [14:23:23] (03PS1) 10Jbond: u2f: add FIDO U2F [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530156 [14:24:03] (03PS1) 10Fsero: envoy: bump to 1.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/530157 [14:24:34] (03PS1) 10Eevans: sessionstore: Add flag to disable TLS support [deployment-charts] - 10https://gerrit.wikimedia.org/r/530158 (https://phabricator.wikimedia.org/T229697) [14:24:58] (03CR) 10Jbond: [V: 03+2 C: 03+2] u2f: add FIDO U2F [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530156 (owner: 10Jbond) [14:26:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: bump to 1.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/530157 (owner: 10Fsero) [14:27:18] (03CR) 10Fsero: [V: 03+2 C: 03+2] envoy: bump to 1.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/530157 (owner: 10Fsero) [14:27:37] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.18e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:27:54] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Icinga reports read time out error for some checks on cloudelastic cluster - https://phabricator.wikimedia.org/T230366 (10Mathew.onipe) After some conversation with @EBernhardson, it was discovered dump are currently being loaded into the clou... [14:38:47] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 7 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:40:06] (03CR) 10Will Doran: "I currently only have +1 :/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/530158 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [14:40:43] (03PS5) 10Mathew.onipe: cloudelastic: fix monitored ip addresses [puppet] - 10https://gerrit.wikimedia.org/r/529362 (https://phabricator.wikimedia.org/T229621) (owner: 10Jbond) [14:40:45] (03PS4) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) [14:40:47] (03PS2) 10Mathew.onipe: icinga: add timeout option to elastic checks [puppet] - 10https://gerrit.wikimedia.org/r/529806 (https://phabricator.wikimedia.org/T230366) [14:41:00] (03CR) 10Mathew.onipe: icinga: add timeout option to elastic checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529806 (https://phabricator.wikimedia.org/T230366) (owner: 10Mathew.onipe) [14:43:33] RECOVERY - DPKG on cp3032 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:45:54] 10Operations, 10Maps: Change maps codfw replication factor for v4 keyspace - https://phabricator.wikimedia.org/T226161 (10Mathew.onipe) 05Open→03Resolved [14:46:01] (03CR) 10Krinkle: sessionstore: Add flag to disable TLS support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/530158 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [14:48:43] 10Operations, 10Maps: Maps2004 ran into disk space issues again after reimaging with new partitioning scheme - https://phabricator.wikimedia.org/T224874 (10Mathew.onipe) 05Open→03Resolved This was traced to some initial problems during osm-initial-script. This was resolved by reinitializing osm again. [14:48:47] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Mathew.onipe) [14:49:02] 10Operations, 10Maps, 10SRE-tools, 10User-Joe, 10User-jijiki: Create cookbook for postgres initialization on maps cluster - https://phabricator.wikimedia.org/T220946 (10Mathew.onipe) 05Open→03Resolved [14:49:04] 10Operations, 10SRE-tools, 10User-Joe, 10User-jijiki: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10Mathew.onipe) [14:53:31] (03PS1) 10Cmjohnson: Adding mgmt dns entries for an-conf100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/530163 (https://phabricator.wikimedia.org/T227025) [14:54:57] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns entries for an-conf100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/530163 (https://phabricator.wikimedia.org/T227025) (owner: 10Cmjohnson) [14:55:09] !log upgrade nginx to 1.13.9-1wm2 in cp3032 [14:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:01] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Cmjohnson) [14:56:26] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Cmjohnson) +an-conf1001 1H IN A 10.65.5.118 +an-conf1002 1H IN A 10.65.5.119 +an-conf1003 1H IN A 10.65.5.120 [14:56:43] (03PS1) 10Jhedden: openstack: add glance image sync to codfw [puppet] - 10https://gerrit.wikimedia.org/r/530164 (https://phabricator.wikimedia.org/T223907) [14:57:50] (03PS2) 10Ema: Add TLS termination for webserver_misc_apps [puppet] - 10https://gerrit.wikimedia.org/r/530151 (https://phabricator.wikimedia.org/T210411) [14:57:52] (03PS1) 10Ema: envoy: configure tls_terminator if ensure: present [puppet] - 10https://gerrit.wikimedia.org/r/530165 [14:58:53] (03CR) 10Paladox: [C: 03+1] envoy: configure tls_terminator if ensure: present [puppet] - 10https://gerrit.wikimedia.org/r/530165 (owner: 10Ema) [14:59:09] (03CR) 10Jhedden: [C: 03+2] openstack: add glance image sync to codfw [puppet] - 10https://gerrit.wikimedia.org/r/530164 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:00:52] (03PS3) 10Ema: Add TLS termination for webserver_misc_apps [puppet] - 10https://gerrit.wikimedia.org/r/530151 (https://phabricator.wikimedia.org/T210411) [15:01:54] (03CR) 10Ema: [C: 03+2] Add TLS termination for webserver_misc_apps [puppet] - 10https://gerrit.wikimedia.org/r/530151 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [15:04:39] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: configure tls_terminator if ensure: present [puppet] - 10https://gerrit.wikimedia.org/r/530165 (owner: 10Ema) [15:04:55] (03PS2) 10Ema: envoy: configure tls_terminator if ensure: present [puppet] - 10https://gerrit.wikimedia.org/r/530165 [15:05:22] (03CR) 10Eevans: sessionstore: Add flag to disable TLS support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/530158 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [15:05:55] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Cmjohnson) a:05RobH→03Jclark-ctr @Jclark-ctr can you add asset tags and enter these servers into Netbox (T221698 is the procurement task)... [15:06:03] (03CR) 10Ema: [C: 03+2] envoy: configure tls_terminator if ensure: present [puppet] - 10https://gerrit.wikimedia.org/r/530165 (owner: 10Ema) [15:06:39] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Cmjohnson) a:05RobH→03Jclark-ctr @Jclark-ctr can you add asset tags and enter these servers into Netbox (T222916 is the procurement task).... [15:07:37] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10RobH) Please do not assign this to me, it is awaiting installation by DC ops into 10G racks, and not on me. This should be processed by the... [15:08:27] (03CR) 10Eevans: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/530158 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [15:10:51] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10BBlack) As noted in T155359 - WMDE has moved the hosting of this to some other platform, including the DNS hosting (and we never had the whois... [15:15:47] (03CR) 10BPirkle: [C: 03+1] "I also only have the ability to +1, but (especially given that reversion is explicitly called out in the task) this looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/530158 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [15:16:27] (03PS3) 10Gehel: icinga: add timeout option to elastic checks [puppet] - 10https://gerrit.wikimedia.org/r/529806 (https://phabricator.wikimedia.org/T230366) (owner: 10Mathew.onipe) [15:18:43] (03CR) 10Gehel: [C: 03+2] icinga: add timeout option to elastic checks [puppet] - 10https://gerrit.wikimedia.org/r/529806 (https://phabricator.wikimedia.org/T230366) (owner: 10Mathew.onipe) [15:20:46] (03PS1) 10Ema: webserver_misc_apps: do not install envoy [puppet] - 10https://gerrit.wikimedia.org/r/530168 (https://phabricator.wikimedia.org/T210411) [15:20:56] onimisionipe: ^ merged, I'll let you keep an eye on it [15:21:13] alright. Thanks! [15:22:28] (03CR) 10Ema: [C: 03+2] webserver_misc_apps: do not install envoy [puppet] - 10https://gerrit.wikimedia.org/r/530168 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [15:23:12] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: hw troubleshooting: power supply for db1129 - https://phabricator.wikimedia.org/T230458 (10Jclark-ctr) 05Open→03Resolved Found Power cable not fully seated . Reseated cable. [15:24:06] RECOVERY - IPMI Sensor Status on db1129 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:25:11] (03PS1) 10Jbond: cas: add google auth [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530169 [15:26:01] gilles: he gilles, could you at some point take a look at https://gerrit.wikimedia.org/r/#/c/operations/software/thumbor-plugins/+/524330/ ? [15:29:16] * Krinkle debugging on mwdebug1001 [15:29:40] (03CR) 10Jbond: [V: 03+2 C: 03+2] cas: add google auth [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530169 (owner: 10Jbond) [15:30:36] (03PS1) 10Jbond: idp: add google authenticator [puppet] - 10https://gerrit.wikimedia.org/r/530170 [15:30:52] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99) [15:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:00] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 2.922e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:31:36] (03CR) 10jerkins-bot: [V: 04-1] idp: add google authenticator [puppet] - 10https://gerrit.wikimedia.org/r/530170 (owner: 10Jbond) [15:35:48] (03PS2) 10Jbond: idp: add google authenticator [puppet] - 10https://gerrit.wikimedia.org/r/530170 [15:37:44] PROBLEM - Host elastic1017 is DOWN: PING CRITICAL - Packet loss = 100% [15:38:38] hmmm [15:38:59] (03CR) 10Jbond: [C: 03+2] idp: add google authenticator [puppet] - 10https://gerrit.wikimedia.org/r/530170 (owner: 10Jbond) [15:40:28] ^ failed network after reboot for elastic1017-1019, checking [15:40:41] and extending downtime [15:41:06] I'll leave you to it [15:41:56] !log powercycling elastic101[789] [15:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:56] (03PS1) 10Ema: Revert "ATS: enable compress plugin on cp5002" [puppet] - 10https://gerrit.wikimedia.org/r/530171 (https://phabricator.wikimedia.org/T227432) [15:44:49] (03PS2) 10Ema: Revert "ATS: enable compress plugin on cp5002" [puppet] - 10https://gerrit.wikimedia.org/r/530171 (https://phabricator.wikimedia.org/T227432) [15:45:27] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [15:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:10] (03CR) 10Ema: [C: 03+2] Revert "ATS: enable compress plugin on cp5002" [puppet] - 10https://gerrit.wikimedia.org/r/530171 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [15:48:52] (03PS1) 10Jbond: idp: add gauth crypto keys [puppet] - 10https://gerrit.wikimedia.org/r/530172 [15:50:18] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) >>! In T225005#5406176, @herron wrote: >>>! In T225005#5... [15:50:49] (03PS2) 10Jbond: idp: add gauth crypto keys [puppet] - 10https://gerrit.wikimedia.org/r/530172 [15:50:58] !log cp5002: ats-backend-restart to disable compress plugin while I'm not around [15:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:21] (03CR) 10Jbond: [C: 03+2] idp: add gauth crypto keys [puppet] - 10https://gerrit.wikimedia.org/r/530172 (owner: 10Jbond) [15:52:43] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) Andrew is on holidays, but it looks good to me! [15:57:56] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) >>! In T225005#5414274, @elukey wrote: > Andrew is on ho... [16:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190814T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:11:41] (03PS3) 10BBlack: anycast recdns: config for many eqiad canaries [puppet] - 10https://gerrit.wikimedia.org/r/528524 (https://phabricator.wikimedia.org/T228190) [16:11:43] (03PS4) 10BBlack: anycast recdns: enable for codfw clients [puppet] - 10https://gerrit.wikimedia.org/r/526788 (https://phabricator.wikimedia.org/T228190) [16:11:45] (03PS3) 10BBlack: anycast recdns: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/528525 (https://phabricator.wikimedia.org/T228190) [16:11:54] (03CR) 10Herron: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/530098 (https://phabricator.wikimedia.org/T230413) (owner: 10Filippo Giunchedi) [16:21:09] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) >>! In T225005#5414291, @herron wrote: >>>! In T225005#5... [16:22:31] 10Operations, 10Traffic, 10Patch-For-Review: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 (10BBlack) I'm not sure if it goes as a subtask here, or of T167841 and/or T227808 - but recording here so we don't forget, from an earlier IRC conversation: As things stand, if e... [16:24:03] (03PS1) 10Jhedden: openstack: add core filter to nova scheduler [puppet] - 10https://gerrit.wikimedia.org/r/530175 [16:25:02] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 8300 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [16:26:12] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) Completely ignorant about it, I'd loop in @jijiki :) [16:30:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by phamhi on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1023.mgmt.eqiad.... [16:30:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1023.mgmt.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudv... [16:31:04] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10Pchelolo) @herron I believe this is the documentation for it htt... [16:31:16] (03CR) 10Jhedden: "Ran through puppet compiler and it's not picking up `profile::openstack::base::nova::scheduler_filters` defined in `hieradata/common/prof" [puppet] - 10https://gerrit.wikimedia.org/r/530175 (owner: 10Jhedden) [16:31:26] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [16:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:11] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [16:32:14] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [16:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:10] (03PS2) 10Jhedden: openstack: add core filter to nova scheduler [puppet] - 10https://gerrit.wikimedia.org/r/530175 [16:42:02] (03PS3) 10Jhedden: openstack: add core filter to nova scheduler [puppet] - 10https://gerrit.wikimedia.org/r/530175 [16:42:27] (03CR) 10Jhedden: "puppet compiler results look good: https://puppet-compiler.wmflabs.org/compiler1001/17898/" [puppet] - 10https://gerrit.wikimedia.org/r/530175 (owner: 10Jhedden) [16:45:48] (03PS4) 10Jhedden: openstack: add core filter to nova scheduler [puppet] - 10https://gerrit.wikimedia.org/r/530175 [16:46:43] (03PS5) 10Jhedden: openstack: add core filter to nova scheduler [puppet] - 10https://gerrit.wikimedia.org/r/530175 [16:52:01] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10RobH) a:05RobH→03None [16:52:29] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10RobH) a:05RobH→03None [16:52:37] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh - https://phabricator.wikimedia.org/T227133 (10RobH) a:05RobH→03None [16:52:46] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10RobH) a:05RobH→03None [16:53:18] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10RobH) a:05RobH→03None [16:53:24] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10RobH) a:05RobH→03None [16:53:45] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227538 (10RobH) a:05RobH→03None [16:59:55] (03CR) 10SBassett: "AndyRussG: Ejegg: Is there a separate privacy policy for this and/or user warning within CN? And are there some specific pages04.net URLs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [17:07:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by phamhi on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1023.mgmt.eqiad.... [17:08:56] * Krinkle scap pulls to reset mwdebug1001 and then goes to edit its files [17:16:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Phamhi) I managed to bypass that issue by running ` sudo wmf-auto-reimage-host --no-verify -p T229871 cloudvirt1023.mgmt.eqiad.w... [17:29:41] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.rolling-reboot [17:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:14] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.232e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [17:56:34] (03PS1) 10Bstorm: toolforge: rebranding k8s control plane to control [puppet] - 10https://gerrit.wikimedia.org/r/530186 (https://phabricator.wikimedia.org/T229009) [17:57:49] * Krinkle resetting mwdebug1001 - no longer used [17:59:12] (03PS1) 10Jforrester: DNM: Disable the videojs TMH beta feature due to resource issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530187 [18:00:44] (03CR) 10Bstorm: "Obviously, I'll want to update the prefix puppet in toolsbeta for this after this is merged." [puppet] - 10https://gerrit.wikimedia.org/r/530186 (https://phabricator.wikimedia.org/T229009) (owner: 10Bstorm) [18:01:15] (03CR) 10Bstorm: "Note that my local copy of git recognized that the role change is a rename. Apparently this version of git does not." [puppet] - 10https://gerrit.wikimedia.org/r/530186 (https://phabricator.wikimedia.org/T229009) (owner: 10Bstorm) [18:01:40] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=0) [18:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:44] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 9048 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [18:07:06] ebernhardson: gehel: fwiw, seems nlwiki index still stale and not getting closer to being up to date [18:07:14] (5+ hours now) [18:08:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1023.mgmt.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudv... [18:08:56] (03PS1) 10Bstorm: dumpsdistribution: set primary web to the server marked web [dns] - 10https://gerrit.wikimedia.org/r/530188 (https://phabricator.wikimedia.org/T217473) [18:09:38] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:09:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:12:23] (03PS1) 10Bstorm: dumpsdistribution: fail back to labstore1006 properly [puppet] - 10https://gerrit.wikimedia.org/r/530189 (https://phabricator.wikimedia.org/T217473) [18:18:30] (03CR) 10Jhedden: [C: 03+1] dumpsdistribution: fail back to labstore1006 properly [puppet] - 10https://gerrit.wikimedia.org/r/530189 (https://phabricator.wikimedia.org/T217473) (owner: 10Bstorm) [18:19:12] (03CR) 10Jhedden: [C: 03+1] dumpsdistribution: set primary web to the server marked web [dns] - 10https://gerrit.wikimedia.org/r/530188 (https://phabricator.wikimedia.org/T217473) (owner: 10Bstorm) [18:25:15] (03PS1) 10MarcoAurelio: Add .gitreview [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/530192 [18:25:29] (03CR) 10MarcoAurelio: [V: 03+2 C: 03+2] Add .gitreview [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/530192 (owner: 10MarcoAurelio) [18:25:37] (03PS1) 10Jbond: html: attempt to update the header [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530193 [18:26:22] (03CR) 10Jbond: [V: 03+2 C: 03+2] html: attempt to update the header [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530193 (owner: 10Jbond) [18:30:13] (03PS1) 10Jbond: html: try to fix the css [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530195 [18:31:13] (03CR) 10Jbond: [V: 03+2 C: 03+2] html: try to fix the css [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530195 (owner: 10Jbond) [18:32:20] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:50] (03PS1) 10Jbond: html: remove links and copyright [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530196 [18:38:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] html: remove links and copyright [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530196 (owner: 10Jbond) [18:39:41] Krinkle: I talked with Erik, this is actually a known issue. We have about 7M documents of backlog at the moment and chewing through them as we can [18:41:08] 10Operations, 10Toolforge, 10Tools, 10Mobile, 10User-RhinosF1: Zoom on tools.wmflabs.org gets stuck - https://phabricator.wikimedia.org/T230508 (10RhinosF1) [18:41:24] (03CR) 10AndyRussG: "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [18:41:57] 10Operations, 10Toolforge, 10Tools, 10Mobile, 10User-RhinosF1: Zoom on tools.wmflabs.org gets stuck - https://phabricator.wikimedia.org/T230508 (10RhinosF1) Tagging with #user-rhinosf1 to remind me to add proper browser info later [18:43:47] (03PS1) 10Jbond: html: try to completly remove footers and headers [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530197 [18:44:16] (03CR) 10Jbond: [V: 03+2 C: 03+2] html: try to completly remove footers and headers [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530197 (owner: 10Jbond) [18:45:30] 10Operations, 10Toolforge, 10Tools, 10Mobile: Zoom on tools.wmflabs.org gets stuck - https://phabricator.wikimedia.org/T230508 (10RhinosF1) [18:47:10] (03CR) 10AndyRussG: "Just another note: Even if our general conclusion has been that calls from the client to this external domain are OK, we don't think it's " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [18:49:07] (03PS1) 10Jbond: html: add empty content [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530198 [18:49:30] (03CR) 10Jbond: [V: 03+2 C: 03+2] html: add empty content [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530198 (owner: 10Jbond) [19:03:10] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 60 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [19:08:46] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [19:10:35] 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10ayounsi) Circuit is down again, opened ticket 16915334. Account rep replied to the thread and put their client support manager in the loop as well. [19:20:18] gehel: thanks, good to know [19:31:00] (03PS1) 10Herron: prometheus-ipsec-exporter: initial commit of version 0.3.1 [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/530203 (https://phabricator.wikimedia.org/T230236) [19:34:24] (03PS1) 10Jbond: html: ad public workstation [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530204 [19:34:45] (03CR) 10Jbond: [V: 03+2 C: 03+2] html: ad public workstation [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530204 (owner: 10Jbond) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190814T2000). [20:00:38] 10Operations, 10Performance-Team, 10TechCom-RFC, 10Traffic, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Ladsgroup) We talked about this with @Tgr in the hackathon and one easy way to bypass the issue of the redirect loop is to serve the ma... [20:08:07] (03CR) 10Ayounsi: "That's a big one!" (0320 comments) [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [20:41:37] (03CR) 10Eevans: [V: 03+2 C: 03+2] "> Patch Set 1: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/530158 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [20:44:17] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [20:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:55] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [20:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:11] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [20:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:32] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [20:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:37] (03CR) 10Bstorm: [C: 03+2] dumpsdistribution: fail back to labstore1006 properly [puppet] - 10https://gerrit.wikimedia.org/r/530189 (https://phabricator.wikimedia.org/T217473) (owner: 10Bstorm) [20:57:54] (03PS1) 10Eevans: Revert "sessionstore: Add flag to disable TLS support" [deployment-charts] - 10https://gerrit.wikimedia.org/r/530216 (https://phabricator.wikimedia.org/T229697) [20:58:16] (03CR) 10Bstorm: [C: 03+2] dumpsdistribution: set primary web to the server marked web [dns] - 10https://gerrit.wikimedia.org/r/530188 (https://phabricator.wikimedia.org/T217473) (owner: 10Bstorm) [20:59:15] (03CR) 10Eevans: "I give up." [deployment-charts] - 10https://gerrit.wikimedia.org/r/530216 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [21:04:55] (03CR) 10Eevans: [V: 03+2 C: 03+2] Revert "sessionstore: Add flag to disable TLS support" [deployment-charts] - 10https://gerrit.wikimedia.org/r/530216 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [21:13:00] 10Operations, 10netops: ospf link-protection - https://phabricator.wikimedia.org/T167306 (10ayounsi) [21:13:03] !log apply freeze to cloudelastic writes, to determine if backlog processing can catchup while deferring cloudelastic writes [21:13:03] 10Operations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi) [21:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:11] 10Operations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi) [21:15:59] (03PS1) 10Eevans: Deploy 2019-08-14-210839-production Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/530220 (https://phabricator.wikimedia.org/T229697) [21:17:32] (03CR) 10Eevans: [V: 03+2 C: 03+2] Deploy 2019-08-14-210839-production Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/530220 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [21:18:46] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [21:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:13] (03PS1) 10CDanis: admin: add clarakosi to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/530221 (https://phabricator.wikimedia.org/T230204) [21:23:30] 10Operations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi) Sounds good, final version, including both AS 65002 and AS 65001 as optional to keep it generic. Tested the regex using `show route aspath-regex "^(65002|65001)? 64600.*"` Will push IPv6 fir... [21:24:50] (03CR) 10CDanis: [C: 03+2] admin: add clarakosi to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/530221 (https://phabricator.wikimedia.org/T230204) (owner: 10CDanis) [21:26:18] (03PS1) 10Eevans: sessionstore: (Temporarily )use HTTP for liveness [deployment-charts] - 10https://gerrit.wikimedia.org/r/530223 (https://phabricator.wikimedia.org/T229697) [21:26:29] !log thaw writes to cloudelastic [21:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:34] cdanis: https://wikitech.wikimedia.org/wiki/Ops_Onboarding is what I found (MA) [21:27:43] but it's for 'ops' [21:28:09] and it looks someone else would be doing the process, not the new employee himself [21:29:01] so I guess for Clara she needs to be added to ldap_only_users, sudo puppet-merge and sudo ldap-modify-user clarakosi wmf [21:29:07] (03CR) 10Eevans: [V: 03+2 C: 03+2] sessionstore: (Temporarily )use HTTP for liveness [deployment-charts] - 10https://gerrit.wikimedia.org/r/530223 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [21:29:13] but I'm not quite sure about the whole process [21:30:26] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [21:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:50] (03PS1) 10CDanis: admin: add annet to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/530227 (https://phabricator.wikimedia.org/T229963) [21:33:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:35:02] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:35:11] (03CR) 10CDanis: [C: 03+2] admin: add annet to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/530227 (https://phabricator.wikimedia.org/T229963) (owner: 10CDanis) [21:37:53] !log advertise core v6 range (2620:0:860::/46) from eqord - T167841 [21:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:01] T167841: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 [21:39:30] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:40:56] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:45:44] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:46:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:46:52] (03PS1) 10MarcoAurelio: openldap::offboard-user.py: WMF_FR renamed to acl*WMF-FR [puppet] - 10https://gerrit.wikimedia.org/r/530230 [21:49:48] XioNoX: same esams link flapping again today? it was off earlier too [21:50:59] 10Operations, 10Core Platform Team Workboards (Green), 10Performance-Team (Radar): Grafana dashboards for sessionstore, k8s staging, are not working - https://phabricator.wikimedia.org/T230515 (10Eevans) [21:51:55] bblack: yeah I updated https://phabricator.wikimedia.org/T228827 and have a ticket open with them, seems like a fibercut and crew (kinda?) fixed it, also have an email thread with Level3 account rep about the recuring outages, I can CC you next time I reply [21:53:03] see the emails to maint-announce for the current outage fix [21:57:02] thanks! [21:57:10] 10Operations, 10Puppet: offobard-user.py: do not hardcode Phabricator project names, use PHID instead - https://phabricator.wikimedia.org/T230516 (10MarcoAurelio) [21:57:25] 10Operations, 10Puppet: offobard-user.py: do not hardcode Phabricator project names, use PHID instead - https://phabricator.wikimedia.org/T230516 (10MarcoAurelio) [21:59:30] 10Operations, 10Puppet: offobard-user.py: do not hardcode Phabricator project names, use PHID instead - https://phabricator.wikimedia.org/T230516 (10mmodell) Yes PHIDs would work and indeed, PHIDs are likely to be more stable over time. [22:00:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10CDanis) @abi_ Can you clarify if you need access to private (user webrequest logs) data? [22:01:07] !log starting scs-ulsfo replacement. There will be icinga errors and they are intentionally being allowed so we know when things dont recover properly T230077 [22:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:18] T230077: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 [22:01:19] ie rob wants to knowwwwww [22:04:34] 10Operations, 10Puppet: offboard-user.py: do not hardcode Phabricator project names, use PHID instead - https://phabricator.wikimedia.org/T230516 (10Krinkle) [22:08:08] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 38, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:19:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): elastic1017 lost network after reboot - https://phabricator.wikimedia.org/T230518 (10Gehel) [22:23:20] PROBLEM - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is CRITICAL: usage: check_elasticsearch_shard_size.py [-h] [--url URL] [--timeout SECONDS] https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [22:28:59] 10Operations, 10Puppet: offboard-user.py: do not hardcode Phabricator project names, use PHID instead - https://phabricator.wikimedia.org/T230516 (10MarcoAurelio) Also `#acl*wmf-siem-policy-admins` was renamed to #acl_security_team and that was not changed in the offboarding script (I'll amend r530230 to fix t... [22:31:36] (03PS2) 10MarcoAurelio: openldap::offboard-user.py: Adjust several renamed projects [puppet] - 10https://gerrit.wikimedia.org/r/530230 [22:38:08] !log freeze writes to cloudelastic for real this time [22:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:08] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for tes [22:39:08] he unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [22:40:12] (03PS3) 10MarcoAurelio: openldap::offboard-user.py: Adjust several renamed projects [puppet] - 10https://gerrit.wikimedia.org/r/530230 [22:41:39] 10Operations, 10Puppet: offboard-user.py: do not hardcode Phabricator project names, use PHID instead - https://phabricator.wikimedia.org/T230516 (10MarcoAurelio) Same with `acl*operations-team` -> `acl*sre-team`: https://phabricator.wikimedia.org/feed/6590383927931223643/ [22:44:58] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:45:17] 10Operations, 10Puppet: offboard-user.py: do not hardcode Phabricator project names, use PHID instead - https://phabricator.wikimedia.org/T230516 (10MarcoAurelio) I wonder if the script did at least detected that the project was renamed and removed the users from said sensitive projects afterwards. Usually the... [22:45:18] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:48:48] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [23:00:00] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 3.106e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [23:00:05] MaxSem, RoanKattouw, and Niharika: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190814T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:07:07] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/530203 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [23:08:36] (03CR) 10Cwhite: [C: 03+1] icinga: add acknowledge details to emails [puppet] - 10https://gerrit.wikimedia.org/r/530098 (https://phabricator.wikimedia.org/T230413) (owner: 10Filippo Giunchedi) [23:13:01] !log leave cloudelastic writes paused, and dropping from backlog queue, to allow primary clusters to catch up [23:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:53] (03CR) 10Cwhite: [C: 03+1] mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [23:40:46] (03CR) 10Cwhite: [C: 03+1] swift: stop monitoring individual daemons [puppet] - 10https://gerrit.wikimedia.org/r/530080 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [23:44:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): elastic1017 lost network after reboot - https://phabricator.wikimedia.org/T230518 (10wiki_willy) a:03Cmjohnson