[00:43:07] o/ hey all, the Search API that's used by apps seems to be returning 403's... [00:43:17] such as: [00:43:19] https://en.wikipedia.org/w/api.php?action=query&format=json&maxlag=6&requestid=&errorformat=plaintext&prop=description%7Cpageimages%7Cinfo&list=search&generator=prefixsearch&redirects=1&converttitles=1&formatversion=2&piprop=thumbnail&pithumbsize=320&pilicense=any&inprop=varianttitles&srsearch=a&srnamespace=0&srlimit=1&sroffset=0&srwhat=text&srinfo=suggestion&srprop=&gpssearch=a&gpsnamespace=0&gpslimit=20 [00:44:39] dbrant: https://phabricator.wikimedia.org/T241421 [00:46:12] Reedy: ah, thanks; just checking whether this is known. [00:46:31] It's seemingly from https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/561300/6/modules/varnish/templates/text-frontend.inc.vcl.erb,unified [00:46:45] I don't know why it's apparently blocking all IPs though [00:47:27] Reedy it's not blocking the ip [00:47:34] if (req.http.host == "en.wikipedia.org" && req.url ~ "/w/api.php\?.*srsearch=.*") { [00:48:01] I can read [00:48:14] Then error message is vague and is being shown to everyone [00:48:30] The commit message suggests it should be only blocking one UA [00:48:35] But it's seemingly blocking *all* requests [00:49:17] ok, sorry :) [00:54:01] (03PS1) 10CDanis: Revert "block search API traffic from one U-A running on AWS" [puppet] - 10https://gerrit.wikimedia.org/r/561321 [00:55:29] dbrant: ^ [00:55:32] (03CR) 10CDanis: [C: 03+2] Revert "block search API traffic from one U-A running on AWS" [puppet] - 10https://gerrit.wikimedia.org/r/561321 (owner: 10CDanis) [00:56:36] yeah, really sorry about that :( [00:58:43] aha, that's alright, no worries! [01:02:24] and... i'm able to search again! thx [01:03:10] I'll write up an incident doc tomorrow [01:03:32] final cp-text nodes are running puppet now [01:05:42] 100% now. sorry again, happy new year [01:06:52] (03PS1) 10CDanis: *properly* block search traffic of just one UA from AWS [puppet] - 10https://gerrit.wikimedia.org/r/561322 [01:08:38] (03PS2) 10CDanis: *properly* block search traffic of just one UA from AWS [puppet] - 10https://gerrit.wikimedia.org/r/561322 (https://phabricator.wikimedia.org/T241421) [03:37:21] 10Operations, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) > [] define labstore::nfs_mount: diamond::collector { 'Nfsiostat' } > > This is a custom collector deployed via Puppet (modules/d... [04:57:22] !log depooling labweb1002 so I can hotfix labweb1001 for T240734 [04:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:03] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [07:49:04] PROBLEM - wiki content on commons #page on commons.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Picture not found on https://commons.wikimedia.org:443/wiki/Main_Page - 99118 bytes in 0.010 second response time https://phabricator.wikimedia.org/project/view/1118/ [08:05:08] RECOVERY - wiki content on commons #page on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 171405 bytes in 0.281 second response time https://phabricator.wikimedia.org/project/view/1118/ [09:11:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:13:43] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:55:03] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [16:12:37] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 147935272 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:14:23] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 186656 and 50 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:18:42] (03PS1) 10BryanDavis: wiki replicas: Remove outdated comment about spamblacklist [puppet] - 10https://gerrit.wikimedia.org/r/561352 (https://phabricator.wikimedia.org/T241668) [17:47:52] (03CR) 10DannyS712: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/561352 (https://phabricator.wikimedia.org/T241668) (owner: 10BryanDavis) [18:09:55] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 236801256 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:11:43] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 66448 and 65 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:14:38] 10Operations, 10cloud-services-team: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10bd808) 05Open→03Declined >>! In T151304#5555145, @MoritzMuehlenhoff wrote: > See earlier discussion on task, this is still used by Toolforge, so WMCS SREs might still want to tweak the log... [18:14:42] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10bd808) [18:26:47] (03PS4) 10BryanDavis: cloud: update maintain-views to handle dblists with comments [puppet] - 10https://gerrit.wikimedia.org/r/555740 (https://phabricator.wikimedia.org/T239415) [18:53:59] (03PS1) 10MarcoAurelio: Modify &wgArticleCount to 'any' for ta.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561355 (https://phabricator.wikimedia.org/T241684) [18:55:06] (03PS1) 10EBernhardson: Perform weekly dumps of all public media urls [puppet] - 10https://gerrit.wikimedia.org/r/561356 (https://phabricator.wikimedia.org/T240520) [18:55:45] (03CR) 10jerkins-bot: [V: 04-1] Perform weekly dumps of all public media urls [puppet] - 10https://gerrit.wikimedia.org/r/561356 (https://phabricator.wikimedia.org/T240520) (owner: 10EBernhardson) [18:56:20] (03PS2) 10MarcoAurelio: Modify $wgArticleCount to 'any' for ta.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561355 (https://phabricator.wikimedia.org/T241684) [19:07:15] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36729792 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:09:03] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 41664 and 64 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:11:08] (03PS1) 10Urbanecm: Set Author and Author_talk aliases for Autore NS at napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561359 (https://phabricator.wikimedia.org/T231880) [19:14:06] (03PS2) 10EBernhardson: Perform weekly dumps of all public media urls [puppet] - 10https://gerrit.wikimedia.org/r/561356 (https://phabricator.wikimedia.org/T240520) [19:15:42] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561355 (https://phabricator.wikimedia.org/T241684) (owner: 10MarcoAurelio) [20:18:57] (03PS1) 10MarcoAurelio: Modify ge.wikimedia project logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561365 (https://phabricator.wikimedia.org/T241327) [20:33:17] (03PS2) 10MarcoAurelio: Modify ge.wikimedia project logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561365 (https://phabricator.wikimedia.org/T241327) [21:45:59] 10Operations, 10Toolforge, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Tool Labs / Tool Forge - https://phabricator.wikimedia.org/T210991 (10bd808) I am going to merge this into {T210993} and copy the data from the task description over there. Havin... [21:46:12] 10Operations, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [21:46:14] 10Operations, 10Toolforge, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Tool Labs / Tool Forge - https://phabricator.wikimedia.org/T210991 (10bd808) [21:46:47] 10Operations, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [21:50:47] 10Operations, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [21:57:00] 10Operations, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [22:19:02] (03PS1) 10BryanDavis: toolforge: replace diamond redis monitoring with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/561379 (https://phabricator.wikimedia.org/T210993) [23:38:51] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state