[00:32:36] (03PS1) 10Mstyles: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) [00:32:43] !log About to begin wdqs deploy; before-deploy tests on canary `wdqs1003` are passing [00:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:49] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@03219df]: 0.3.55 [00:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:26] !log Following deploy to canary `wdqs1003`, automated tests are passing as is a manual test of an example query. Proceeding... [00:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:00] !log T222669 [Elasticsearch reindex] Began long-running reindex of cirrus elasticsearch for `codfw`, `eqiad`, and `cloudelastic`. 3 tmux sessions on `ryankemper@mwmaint1002`: `reindex_eqiad`, `reindex_codfw`, `reindex_cloudelastic` [00:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:07] T222669: Normalize homoglyphs in mixed-script tokens when possible - https://phabricator.wikimedia.org/T222669 [00:46:13] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@03219df]: 0.3.55 (duration: 11m 24s) [00:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:00] !log [wdqs deploy] following deploy, example query succeeds on `query.wikidata.org`, proceeding to post deploy steps [00:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:30] !log Restarted `wdqs-updater` simultaneously across all wdqs hosts: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [00:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:52] !log Restarted `wdqs-categories` across wdqs test hosts: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [00:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:12] !log Restarting `wdqs-categories` one host at a time across all wdqs production instances: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 60 && systemctl restart wdqs-categories && sleep 30 && pool'` [00:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:35] (03CR) 10Krinkle: [C: 03+1] arclamp: Use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639885 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [01:20:53] (03CR) 10Krinkle: [C: 03+1] webperf: change navtiming to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639197 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [01:33:10] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle) @Marostegui Should we assume that it has already been determined that there is no significant benefit today to the (other) query groups? Or... [02:06:36] 10Operations: Unclean stop of jobrunner service via puppet - https://phabricator.wikimedia.org/T158288 (10Krinkle) 05Open→03Declined jobchron mediawiki/services/jobrunner are no longer used in production. [02:18:14] !log (WDQS deploy completed) [02:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:49] (03PS1) 10Krinkle: ProductionServices: Document hostname of redis_lock hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640576 (https://phabricator.wikimedia.org/T267581) [07:22:38] (03CR) 10Effie Mouzeli: [C: 03+1] "that helps, thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640576 (https://phabricator.wikimedia.org/T267581) (owner: 10Krinkle) [07:40:41] 10Operations, 10ops-codfw, 10netops: ripe-atlast-codfw is down - https://phabricator.wikimedia.org/T267714 (10elukey) [07:40:58] 10Operations, 10ops-codfw, 10netops: ripe-atlast-codfw is down - https://phabricator.wikimedia.org/T267714 (10elukey) p:05Triage→03High [07:41:37] ACKNOWLEDGEMENT - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 566 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map Elukey T267714 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:41:37] ACKNOWLEDGEMENT - Host ripe-atlas-codfw IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:201:208:80:152:244) Elukey T267714 [07:41:37] ACKNOWLEDGEMENT - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 651 probes of 652 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map Elukey T267714 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:41:37] ACKNOWLEDGEMENT - Host ripe-atlas-codfw is DOWN: PING CRITICAL - Packet loss = 100% Elukey T267714 [07:48:57] (03PS1) 10Muehlenhoff: Extend MOU for Robert West [puppet] - 10https://gerrit.wikimedia.org/r/640662 [07:50:08] (03PS2) 10Muehlenhoff: Extend MOU for Robert West [puppet] - 10https://gerrit.wikimedia.org/r/640662 [07:52:45] (03CR) 10Muehlenhoff: [C: 03+2] Extend MOU for Robert West [puppet] - 10https://gerrit.wikimedia.org/r/640662 (owner: 10Muehlenhoff) [08:25:32] (03CR) 10Ayounsi: Add CSV import to ProvisionServerNetwork script (037 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 (owner: 10Ayounsi) [08:33:36] (03PS17) 10Ayounsi: Add CSV import to ProvisionServerNetwork script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [08:33:38] (03PS18) 10Ayounsi: ProvisionServerNetwork, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [08:33:40] (03PS1) 10Ayounsi: Add python 3.8 to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/640664 [08:34:11] (03CR) 10jerkins-bot: [V: 04-1] ProvisionServerNetwork, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi) [08:34:13] (03CR) 10jerkins-bot: [V: 04-1] Add python 3.8 to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/640664 (owner: 10Ayounsi) [08:34:15] (03CR) 10jerkins-bot: [V: 04-1] Add CSV import to ProvisionServerNetwork script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 (owner: 10Ayounsi) [08:37:10] (03PS18) 10Ayounsi: Add CSV import to ProvisionServerNetwork script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [08:37:10] (03PS19) 10Ayounsi: ProvisionServerNetwork, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [08:37:12] (03PS2) 10Ayounsi: Add python 3.8 to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/640664 [09:15:42] 10Operations, 10netops: Prevent advertising invalid prefixes from customers - https://phabricator.wikimedia.org/T267719 (10ayounsi) p:05Triage→03High [09:17:47] (03PS1) 10Ayounsi: Drop special-ranges in BGP_outfilter [homer/public] - 10https://gerrit.wikimedia.org/r/640666 (https://phabricator.wikimedia.org/T267719) [09:53:34] !log prioritized DE-CIX IXP - T262681 [09:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:42] (03PS1) 10Ayounsi: Prioritize DE-CIX Dallas IXP [homer/public] - 10https://gerrit.wikimedia.org/r/640671 (https://phabricator.wikimedia.org/T262681) [09:55:27] (03CR) 10Ayounsi: [C: 03+2] Prioritize DE-CIX Dallas IXP [homer/public] - 10https://gerrit.wikimedia.org/r/640671 (https://phabricator.wikimedia.org/T262681) (owner: 10Ayounsi) [09:55:53] (03Merged) 10jenkins-bot: Prioritize DE-CIX Dallas IXP [homer/public] - 10https://gerrit.wikimedia.org/r/640671 (https://phabricator.wikimedia.org/T262681) (owner: 10Ayounsi) [10:03:39] (03PS1) 10Bartosz Dziewoński: Fix getHeadlineNodeAndOffset() returning text nodes [extensions/DiscussionTools] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/640497 (https://phabricator.wikimedia.org/T267284) [10:05:17] (03PS1) 10Ayounsi: Add HE to DE-CIX Dallas [homer/public] - 10https://gerrit.wikimedia.org/r/640672 [10:05:57] (03PS2) 10Ayounsi: Add HE to DE-CIX Dallas [homer/public] - 10https://gerrit.wikimedia.org/r/640672 [10:06:39] (03CR) 10Ayounsi: [C: 03+2] Add HE to DE-CIX Dallas [homer/public] - 10https://gerrit.wikimedia.org/r/640672 (owner: 10Ayounsi) [10:07:08] (03Merged) 10jenkins-bot: Add HE to DE-CIX Dallas [homer/public] - 10https://gerrit.wikimedia.org/r/640672 (owner: 10Ayounsi) [10:13:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:27:27] <_joe_> XioNoX: I guess that's a consequence of your change [10:27:55] ah yep, waiting on HE to configure their side [10:28:36] ACKNOWLEDGEMENT - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE ayounsi Waiting for HE to configure their side https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:28:50] thx [10:34:33] !log delete unused interfaces from asw-d-codfw [10:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:43] (03CR) 10Jbond: "thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/640512 (https://phabricator.wikimedia.org/T267396) (owner: 10Andrew Bogott) [11:24:25] (03PS1) 10Lucas Werkmeister (WMDE): Remove propagateChangeVisibility repo setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640676 [11:26:15] jouncebot: refresh please [11:26:16] I refreshed my knowledge about deployments. [11:45:39] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for jgiannelos - https://phabricator.wikimedia.org/T267585 (10jijiki) 05Open→03Resolved a:03jijiki @Jgiannelos all done, please reopen or find me on irc if something is not right. [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201111T1200). [12:00:04] MatmaRex and Lucas_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] o/ [12:00:26] Lucas_WMDE: I can deploy, or you can 🙂 [12:00:31] I can do it :) [12:01:04] hi [12:01:09] jouncebot: now [12:01:09] For the next 0 hour(s) and 58 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201111T1200) [12:01:10] Lucas_WMDE: okay, leaving to you :-) [12:02:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix getHeadlineNodeAndOffset() returning text nodes [extensions/DiscussionTools] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/640497 (https://phabricator.wikimedia.org/T267284) (owner: 10Bartosz Dziewoński) [12:07:23] (03Merged) 10jenkins-bot: Fix getHeadlineNodeAndOffset() returning text nodes [extensions/DiscussionTools] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/640497 (https://phabricator.wikimedia.org/T267284) (owner: 10Bartosz Dziewoński) [12:08:26] MatmaRex: the change should be on mwdebug1001 now, can you test it? [12:08:35] yeah. looking [12:09:28] looks good now [12:09:35] Lucas_WMDE: ^ [12:09:40] ok! [12:11:25] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.16/extensions/DiscussionTools/includes/CommentParser.php: Backport: [[gerrit:640497|Fix getHeadlineNodeAndOffset() returning text nodes (T267284)]] (duration: 01m 01s) [12:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:33] T267284: DiscussionTools fails on beta enwiki `Wikipedia:Quick_directory` - https://phabricator.wikimedia.org/T267284 [12:12:13] (03PS2) 10Lucas Werkmeister (WMDE): Enable propagatePageDeletion on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636453 [12:12:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable propagatePageDeletion on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636453 (owner: 10Lucas Werkmeister (WMDE)) [12:13:25] (03Merged) 10jenkins-bot: Enable propagatePageDeletion on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636453 (owner: 10Lucas Werkmeister (WMDE)) [12:13:58] I can’t really test this change, I’ll just quickly check that mwdebug doesn’t explode [12:14:59] (thanks for deploying) [12:15:17] np :) [12:16:26] (03PS2) 10Lucas Werkmeister (WMDE): Remove propagateChangeVisibility repo setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640676 [12:16:57] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:636453|Enable propagatePageDeletion on Wikidata]] (duration: 00m 59s) [12:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove propagateChangeVisibility repo setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640676 (owner: 10Lucas Werkmeister (WMDE)) [12:20:12] (03Merged) 10jenkins-bot: Remove propagateChangeVisibility repo setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640676 (owner: 10Lucas Werkmeister (WMDE)) [12:20:53] and this change should be a no-op, checking on mwdebug that nothing obvious breaks [12:21:39] looks like everything’s working [12:23:12] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:640676|Remove propagateChangeVisibility repo setting]] (duration: 00m 58s) [12:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:30] I think that’s it [12:24:35] any last-minute changes? :) [12:25:41] !log EU backport&config window done [12:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:26] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 113170960 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:55:06] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4936 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:58:56] 10Operations, 10Commons, 10MediaWiki-File-management: File from commons is not loaded properly - https://phabricator.wikimedia.org/T267668 (10CptViraj) [13:03:43] silvia [13:06:05] df -h [13:45:44] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [13:48:11] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) A few more tests. the TL;DR says varnish 6 is at fault probably, but with a question mark. Test... [13:48:14] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640685 [13:49:55] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640686 [13:51:19] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/640687 [13:52:46] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=cp3054.esams.wmnet [13:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:57] (03PS1) 10Jbond: puppet: migrate from require_package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) [14:18:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/640666 (https://phabricator.wikimedia.org/T267719) (owner: 10Ayounsi) [14:19:52] (03CR) 10jerkins-bot: [V: 04-1] puppet: migrate from require_package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [14:26:18] (03PS16) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) [14:27:03] (03PS2) 10Jbond: puppet: migrate from require_package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) [14:27:09] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond) [14:28:44] (03PS17) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) [14:28:54] (03CR) 10jerkins-bot: [V: 04-1] puppet: migrate from require_package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [14:29:39] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=cp3054.esams.wmnet [14:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:43] (03PS3) 10Jbond: puppet: migrate from require_package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) [14:33:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:48] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:18] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:36:02] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:39:32] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.28 ms [14:43:39] (03PS1) 10Jbond: test puppet merge [puppet] - 10https://gerrit.wikimedia.org/r/640689 [14:44:32] (03CR) 10Jbond: "PCC (still running): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26422/console" [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [14:44:42] (03PS2) 10Jbond: test puppet merge [puppet] - 10https://gerrit.wikimedia.org/r/640689 [14:44:50] (03PS3) 10Jbond: test puppet merge [puppet] - 10https://gerrit.wikimedia.org/r/640689 [14:45:42] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:05] (03CR) 10Jbond: [C: 03+2] test puppet merge [puppet] - 10https://gerrit.wikimedia.org/r/640689 (owner: 10Jbond) [14:51:03] (03PS18) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) [14:52:49] (03PS19) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) [14:54:56] (03PS1) 10Jbond: Revert "test puppet merge" [puppet] - 10https://gerrit.wikimedia.org/r/640499 [14:55:04] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "test puppet merge" [puppet] - 10https://gerrit.wikimedia.org/r/640499 (owner: 10Jbond) [14:57:06] (03CR) 10Jbond: "Ready for review again" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond) [15:20:51] (03CR) 10Faidon Liambotis: [C: 03+1] Drop special-ranges in BGP_outfilter [homer/public] - 10https://gerrit.wikimedia.org/r/640666 (https://phabricator.wikimedia.org/T267719) (owner: 10Ayounsi) [15:21:40] (03CR) 10Ayounsi: [C: 03+2] Drop special-ranges in BGP_outfilter [homer/public] - 10https://gerrit.wikimedia.org/r/640666 (https://phabricator.wikimedia.org/T267719) (owner: 10Ayounsi) [15:22:07] (03Merged) 10jenkins-bot: Drop special-ranges in BGP_outfilter [homer/public] - 10https://gerrit.wikimedia.org/r/640666 (https://phabricator.wikimedia.org/T267719) (owner: 10Ayounsi) [15:26:22] PROBLEM - Check systemd state on ms-be2053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:27:26] 10Operations, 10netops, 10Patch-For-Review: Prevent advertising invalid prefixes from customers - https://phabricator.wikimedia.org/T267719 (10ayounsi) Pushed to cr3-ulsfo: `name=before Prefix Nexthop MED Lclpref AS path * 172.16.0.0/21 Self I *... [15:50:28] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2053 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:52:32] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [15:52:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [15:52:32] RECOVERY - Check systemd state on ms-be2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:54:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:14:07] 10Operations, 10LDAP-Access-Requests: LDAP access for Till Mletzko - https://phabricator.wikimedia.org/T267744 (10tmletzko) [16:21:00] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2053 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:30:09] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [16:30:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [16:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:42] PROBLEM - MD RAID on ms-be2031 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:35:44] ACKNOWLEDGEMENT - MD RAID on ms-be2031 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T267746 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:36:01] 10Operations, 10ops-codfw: Degraded RAID on ms-be2031 - https://phabricator.wikimedia.org/T267746 (10ops-monitoring-bot) [16:38:05] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [16:38:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [16:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:32] (03PS1) 10Ayounsi: Revert "temporarily route Italy to codfw" [dns] - 10https://gerrit.wikimedia.org/r/640500 [16:41:51] (03PS2) 10Ayounsi: Revert "temporarily route Italy to codfw" [dns] - 10https://gerrit.wikimedia.org/r/640500 [16:42:41] (03CR) 10Ayounsi: [C: 03+2] Revert "temporarily route Italy to codfw" [dns] - 10https://gerrit.wikimedia.org/r/640500 (owner: 10Ayounsi) [16:44:35] !log Revert "temporarily route Italy to codfw" [16:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:13] 10Operations, 10netops: Prevent advertising invalid prefixes from customers - https://phabricator.wikimedia.org/T267719 (10ayounsi) 05Open→03Resolved Done. [16:56:32] PROBLEM - HP RAID on ms-be2031 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:2 - Failed: 2I:4:1 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:56:35] ACKNOWLEDGEMENT - HP RAID on ms-be2031 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:2 - Failed: 2I:4:1 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T267748 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:56:39] 10Operations, 10ops-codfw: Degraded RAID on ms-be2031 - https://phabricator.wikimedia.org/T267748 (10ops-monitoring-bot) [17:02:29] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [17:04:37] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 56.87 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:07:53] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 73.63 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:16:07] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 45.78 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:19:27] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 72.24 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:22:11] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [17:26:05] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 37.72 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:26:37] PROBLEM - Device not healthy -SMART- on ms-be2031 is CRITICAL: cluster=swift device=None instance=ms-be2031 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2031&var-datasource=codfw+prometheus/ops [17:28:53] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [17:31:01] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 77.31 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:32:09] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [17:53:23] (03PS4) 10Jberkel: Enable "Cite" button in toolbar for enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640087 (https://phabricator.wikimedia.org/T267504) [18:34:35] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 65, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:00:04] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201111T1900) [19:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201111T1900) [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:47:51] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 173 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:49:31] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:44:33] 10Operations, 10LDAP-Access-Requests: LDAP access for Till Mletzko - https://phabricator.wikimedia.org/T267744 (10Aklapper) 05Open→03Stalled @tmletzko: Hi, please see https://phabricator.wikimedia.org/project/profile/1564/ for required information. [21:00:04] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201111T2100). [21:07:48] 10Operations, 10Domains, 10Traffic, 10Patch-For-Review: Change of nameservers for Wikimedia.org.tr - https://phabricator.wikimedia.org/T259792 (10Asaf) Some more weeks on, I repeat the request to make progress, or at least offer an ETA for this. Thank you. [21:25:11] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:25:27] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:27:09] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:28:31] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:58:06] (03CR) 10Ladsgroup: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/637849 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [22:00:00] (03CR) 10Ladsgroup: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/637849 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [22:00:40] (03CR) 10Ladsgroup: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/637849 (https://phabricator.wikimedia.org/T152882) (owner: 10Ladsgroup) [22:12:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:13:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:28:38] Guys, are the servers experiencing an overload or something. Wikipedia is intermittently having issues loading. This is only happening on Wikipedia, no where else on the Internet. [22:29:29] No one else has complained [22:29:33] What sort of issues loading? [22:32:04] Reedy, requests to the server are not being answered. [22:32:19] So I'm just sitting here with indefinite loading. [22:32:30] Happening to all devices. [22:32:39] But the rest of the internet is loading fine. [22:33:33] You've tested it all? [22:33:34] The problems appear to be intermittent, but are excacerbated when trying to get past the 2FA login. [22:38:37] Reedy, now it's working again. [22:50:51] we've had no other reports from users on irc, so hopefully it was somehow just you [22:54:24] Maybe transient DNS issue somewhere along the way. [23:26:56] 10Operations, 10Commons, 10MediaWiki-File-management: File from commons is not loaded properly - https://phabricator.wikimedia.org/T267668 (10ColinFine) I don't know if this is helpful, but when I tried editing [[Allan Shivers]] on en-wiki, which is showing the problem, I noticed that the image appears in "S... [23:32:27] 10Operations, 10Commons, 10MediaWiki-File-management: File from commons is not loaded properly - https://phabricator.wikimedia.org/T267668 (10Urbanecm) p:05Medium→03High >>! In T267668#6617648, @jijiki wrote: > @AntiCompositeNumber the feature we have been working on has been enabled in mw1276 (api) and... [23:34:19] 10Operations, 10Commons, 10MediaWiki-File-management, 10Wikimedia-production-error: File from commons is not loaded properly - https://phabricator.wikimedia.org/T267668 (10jijiki) [23:52:54] 10Operations, 10Commons, 10MediaWiki-File-management, 10Wikimedia-production-error: File from commons is not loaded properly - https://phabricator.wikimedia.org/T267668 (10AntiCompositeNumber) >>! In T267668#6619716, @ColinFine wrote: > I don't know if this is helpful, but when I tried editing [[Allan Shiv... [23:54:19] 10Operations, 10Commons, 10MediaWiki-File-management, 10Wikimedia-production-error: Some recent Commons uploads not available on other wikis (2020-11) - https://phabricator.wikimedia.org/T267668 (10AntiCompositeNumber)