[00:12:44] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) This query: ``` SELECT rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_t... [00:24:28] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) Explains: codfw: {P7551} eqiad: {P7552} [00:25:16] 10Operations, 10DBA, 10Performance-Team, 10Wikidata, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) Looks like codfw one does not use index. @jcrespo do you have any idea why that could happen? [00:32:56] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Krinkle) a:05Krinkle>03None [00:36:43] 10Operations, 10wikitech.wikimedia.org, 10Documentation: Update wasat/mwmaint2001 docs on Wikitech - https://phabricator.wikimedia.org/T204389 (10Smalyshev) [00:52:26] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Reedy) What db host are you running it against? There are specific hosts for recent changes as per the db-*.php files ```... [01:02:42] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) @Reedy I am not sure which host, I just logged in to maintenance host for eqiad and codfw. Lookups show db2082.... [01:04:52] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) I tried `db2085:3318` and the result the same as other codfw host. So if that's what actual API is using, that... [01:08:58] !log depooled wdqs2003 to let it catch up [01:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:36] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Reedy) >>! In T202764#4585276, @Smalyshev wrote: > I tried `db2085:3318` and the result the same as other codfw host. So i... [02:22:24] 10Operations, 10Wikimedia-Mailing-lists: Add admins to mailing list engineering@ - https://phabricator.wikimedia.org/T204393 (10Quiddity) [02:46:40] Αlⅼɑh is ԁⲟіᥒg [03:31:33] Аlⅼaһ is ԁoiᥒg [03:31:33] ѕuᥒ іs ᥒot doіnɡ Alⅼah is doіnɡ [03:31:33] moഠᥒ is ᥒot ԁoinɡ Αⅼlah іs dоiᥒg [03:38:14] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Krinkle) Earlier when I was investigating with Stas, performing [the slow recentchanges api request](https://www.wikidata.... [04:53:53] !log re-pooled wdqs2003 [04:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:29] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [05:14:39] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [2000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [05:14:49] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [05:23:28] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is OK: OK: Less than 20.00% above the threshold [1200.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [06:04:00] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) So taking a quick look at the schema differences between rc groups in different DCs, they look the same: The... [06:09:24] Αlⅼaһ іs dⲟing [06:18:19] Alⅼah ⅰѕ dοiᥒg [06:18:19] ѕun ⅰѕ nοt doing Aⅼⅼaһ is ԁoinɡ [06:18:20] moon iѕ not dοіnɡ Аllaһ is ԁഠⅰᥒɡ [06:28:19] PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [06:37:53] (03PS1) 10Revi: Enable FileExport on Korean Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460622 (https://phabricator.wikimedia.org/T204399) [06:58:39] RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:33:48] PROBLEM - High lag on wdqs2003 is CRITICAL: 3607 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:35:49] PROBLEM - High lag on wdqs2003 is CRITICAL: 3652 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:05:49] PROBLEM - Check systemd state on analytics1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:42:37] !log joal@deploy1001 Started deploy [analytics/refinery@f4d1f24]: Deploying cluster to prevent failures from missing misc data [11:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:56] !log joal@deploy1001 Finished deploy [analytics/refinery@f4d1f24]: Deploying cluster to prevent failures from missing misc data (duration: 15m 19s) [11:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:38] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:58:39] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:00:49] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:01:49] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:11:47] hi, any issue on Commons currently? i.e. https://commons.wikimedia.org/wiki/Commons:Village_pump#Problem_uploading_large_file_(~200_MB) [12:12:01] also problem for me ^ [12:12:55] yannf: Maybe create a bug report about it? [12:13:45] ok [12:14:02] I will try again, and make a report if it fails again [12:14:14] Great! Thank you for taking the time to do so [12:15:47] (03PS1) 10Elukey: profile::analytics::reginery::job::data_purge: fix cron job [puppet] - 10https://gerrit.wikimedia.org/r/460697 [12:16:51] (03CR) 10Joal: "LGTM ! Thanks elukey :)" [puppet] - 10https://gerrit.wikimedia.org/r/460697 (owner: 10Elukey) [12:16:55] (03CR) 10Joal: [C: 031] profile::analytics::reginery::job::data_purge: fix cron job [puppet] - 10https://gerrit.wikimedia.org/r/460697 (owner: 10Elukey) [12:18:29] (03CR) 10Elukey: [C: 032] profile::analytics::reginery::job::data_purge: fix cron job [puppet] - 10https://gerrit.wikimedia.org/r/460697 (owner: 10Elukey) [12:18:34] 10Operations, 10Puppet, 10Cloud-VPS, 10Release-Engineering-Team, and 3 others: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 (10hashar) Can we now delete the old instances and remove them from the Jenkins config? In Jenkins we just have to delete: * https://integration.... [12:22:01] (03CR) 10Hashar: [C: 032] Add tox.ini [software/keyholder] - 10https://gerrit.wikimedia.org/r/460065 (owner: 10Thcipriani) [12:22:48] (03Merged) 10jenkins-bot: Add tox.ini [software/keyholder] - 10https://gerrit.wikimedia.org/r/460065 (owner: 10Thcipriani) [12:23:16] (03PS1) 10Hashar: Add .gitreview [software/keyholder] - 10https://gerrit.wikimedia.org/r/460698 [12:28:45] (03CR) 10jerkins-bot: [V: 04-1] Stop spawning ssh-keygen but generate fps ourselves [software/keyholder] - 10https://gerrit.wikimedia.org/r/458249 (owner: 10Faidon Liambotis) [12:28:52] (03CR) 10jerkins-bot: [V: 04-1] Unlink the Unix domain socket when exiting [software/keyholder] - 10https://gerrit.wikimedia.org/r/458247 (owner: 10Faidon Liambotis) [12:29:06] (03CR) 10jerkins-bot: [V: 04-1] Add compatibility with Construct 2.8.22 and 2.9.45 [software/keyholder] - 10https://gerrit.wikimedia.org/r/458245 (owner: 10Faidon Liambotis) [12:29:15] (03CR) 10jerkins-bot: [V: 04-1] Abstract the SSH fingerprint generation [software/keyholder] - 10https://gerrit.wikimedia.org/r/458248 (owner: 10Faidon Liambotis) [12:29:19] (03CR) 10jerkins-bot: [V: 04-1] Parse/build agent request/responses once [software/keyholder] - 10https://gerrit.wikimedia.org/r/458243 (owner: 10Faidon Liambotis) [12:29:24] (03CR) 10jerkins-bot: [V: 04-1] Refactor handle() [software/keyholder] - 10https://gerrit.wikimedia.org/r/458244 (owner: 10Faidon Liambotis) [12:29:37] (03CR) 10jerkins-bot: [V: 04-1] Switch path handling to pathlib.Path [software/keyholder] - 10https://gerrit.wikimedia.org/r/458246 (owner: 10Faidon Liambotis) [12:29:43] (03CR) 10jerkins-bot: [V: 04-1] Implement SSH_AGENTC_LOCK/SSH_AGENTC_UNLOCK [software/keyholder] - 10https://gerrit.wikimedia.org/r/458242 (owner: 10Faidon Liambotis) [12:29:46] (03CR) 10jerkins-bot: [V: 04-1] Add permission checks for various commands [software/keyholder] - 10https://gerrit.wikimedia.org/r/458240 (owner: 10Faidon Liambotis) [12:29:47] (03CR) 10jerkins-bot: [V: 04-1] Use mlockall() to avoid any potential swapping [software/keyholder] - 10https://gerrit.wikimedia.org/r/458239 (owner: 10Faidon Liambotis) [12:30:03] (03CR) 10jerkins-bot: [V: 04-1] Implement all the SSH agent bits and stop proxying [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [12:30:05] (03CR) 10jerkins-bot: [V: 04-1] Make pylint a little happier [software/keyholder] - 10https://gerrit.wikimedia.org/r/458238 (owner: 10Faidon Liambotis) [12:30:18] (03CR) 10jerkins-bot: [V: 04-1] Verify the validity of signature requests [software/keyholder] - 10https://gerrit.wikimedia.org/r/458241 (owner: 10Faidon Liambotis) [12:30:26] (03PS1) 10Urbanecm: Introduce engineer user group on Czech Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460699 (https://phabricator.wikimedia.org/T203000) [12:31:09] (03CR) 10jerkins-bot: [V: 04-1] Split SshAgentCommand type to Request/Response [software/keyholder] - 10https://gerrit.wikimedia.org/r/458237 (owner: 10Faidon Liambotis) [12:31:18] (03Abandoned) 10Urbanecm: Introduce engineer user group on Czech Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460699 (https://phabricator.wikimedia.org/T203000) (owner: 10Urbanecm) [12:32:24] zeljkof, thcipriani, no_justification & other deployers: Anybody to deploy T204243 / https://gerrit.wikimedia.org/r/460432 ? Last minute throttle lift except, I cannot use regular deploy window. [12:32:25] T204243: Throttle exemption for event in Ireland - https://phabricator.wikimedia.org/T204243 [12:33:27] marostegui, https://phabricator.wikimedia.org/T204408 [12:34:09] (03PS1) 10Urbanecm: Add *.nasimonline.ir to wgCopyUploadsDomains whitelist for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460700 (https://phabricator.wikimedia.org/T203371) [12:39:19] (03PS1) 10Urbanecm: Create eliminator group at Vietnamese Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460701 (https://phabricator.wikimedia.org/T202207) [12:51:20] (03PS1) 10Elukey: profile::analytics::refinery::job::camus: remove check for webreq:misc [puppet] - 10https://gerrit.wikimedia.org/r/460702 [12:56:55] (03PS2) 10Elukey: Remove Camus jobs and checks for cache::misc [puppet] - 10https://gerrit.wikimedia.org/r/460702 [12:59:14] (03CR) 10Joal: [C: 031] "LGTM !" [puppet] - 10https://gerrit.wikimedia.org/r/460702 (owner: 10Elukey) [12:59:49] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12469/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/460702 (owner: 10Elukey) [13:15:38] RECOVERY - Check systemd state on analytics1003 is OK: OK - running: The system is fully operational [13:32:28] hashar: CI is broken for keyholder, see above [15:15:18] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:19:39] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:32:39] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:34:49] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:40:27] Urbanecm: I don't want to deploy on the weekend when no one is around, even for a low-risk deploy like a throttle rule. Looking at the patch, it looks like European mid-day SWAT will be too late. Is it possible to reach out to hashar or zeljkof early on Monday? Or maybe add a row to the top of the deployment schedule and ping them with details in IRC? [15:54:18] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:56:29] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:52:29] thcipriani, can try to... [17:06:50] Urbanecm: thcipriani I'll be around on Monday morning [17:07:06] Good. Should I ping you by then to ensure processing? [17:16:58] Urbanecm: yes please, remind me [17:17:05] will do :) [17:24:29] PROBLEM - puppet last run on cp4030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:49] RECOVERY - puppet last run on cp4030 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:12:58] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:15:09] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:24:59] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is CRITICAL: 59.58 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:26:08] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is OK: (C)60 le (W)70 le 70.41 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:05:59] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 51842 MB (10% inode=99%) [21:13:39] RECOVERY - Disk space on elastic1023 is OK: DISK OK [21:51:29] 10Operations, 10Phabricator: Phabricator is slow - https://phabricator.wikimedia.org/T204421 (10Paladox) [21:52:11] 10Operations, 10Phabricator: Phabricator is slow - https://phabricator.wikimedia.org/T204421 (10Paladox) p:05Triage>03Unbreak! UBN as it is very slow accessing phabricator. To the point i have to wait a few secs each time i press a link. [21:57:08] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10Paladox) this bug has hit us again T204421 [22:13:09] RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 1078 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:38:32] Someone reported that previous versions of this file on Commons are reporting file not found: https://en.wikipedia.org/wiki/File:White_House_1846.jpg and I see the same problem. [22:38:51] I'm thinking that this is the correct place to report this, yes? [22:43:49] godog ^^ [22:49:38] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:51:48] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:58:19] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:06:59] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:31:59] PROBLEM - Apache HTTP on mw2277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:32:59] RECOVERY - Apache HTTP on mw2277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.093 second response time [23:39:38] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:41:48] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen