[00:42:09] (03CR) 10Alex Monk: "Ottomata: I've looked into it. The reason we can't use monitoring::graphite_threshold is that it relies upon monitoring::service, which re" [puppet] - 10https://gerrit.wikimedia.org/r/311747 (owner: 10Alex Monk) [00:50:30] (03CR) 10Alex Monk: "Actually, would it just silently not attempt to share the resource and be okay with that? If so we might be able to use it." [puppet] - 10https://gerrit.wikimedia.org/r/311747 (owner: 10Alex Monk) [01:44:36] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:56:57] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:09:29] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:18:12] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:19:08] (03PS1) 10Matanya: standard packages: re-add intel-microcode [puppet] - 10https://gerrit.wikimedia.org/r/312714 [02:21:49] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:33:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [02:38:58] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:43:10] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [03:07:20] PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 38999 MB (3% inode=99%) [03:22:06] RECOVERY - Disk space on maps-test2001 is OK: DISK OK [03:38:29] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:43:30] PROBLEM - MegaRAID on ms-be1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [03:44:12] PROBLEM - Disk space on ms-be1002 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error [03:47:33] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [03:52:34] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [04:00:43] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdl1] [04:01:03] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [04:01:45] RECOVERY - Disk space on ms-be1002 is OK: DISK OK [06:26:12] RECOVERY - Disk space on scb1002 is OK: DISK OK [06:51:06] (03CR) 10Hashar: [C: 04-1] "So I knew that something was eventually going to wrong, and on benefit of doubt CR+1 anyway." [puppet] - 10https://gerrit.wikimedia.org/r/312705 (owner: 10EBernhardson) [06:57:18] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 531 bytes in 0.017 second response time [07:07:48] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:11:51] (03CR) 10Jcrespo: [C: 031] "That was my fault when merging 3 different patches into 1. Maybe this can be cherry-picked in beta, not sure we can merge it these week in" [puppet] - 10https://gerrit.wikimedia.org/r/312652 (owner: 10Hashar) [07:20:07] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.062 second response time [07:33:14] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:42:18] PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:15] RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:38:13] PROBLEM - Disk space on scb1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=84%) [11:00:10] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1202 [11:05:13] RECOVERY - check_mysql on frdb1001 is OK: Uptime: 858982 Threads: 1 Questions: 187272113 Slow queries: 6682 Opens: 6879 Flush tables: 1 Open tables: 571 Queries per second avg: 218.016 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:21:02] (03PS1) 10Urbanecm: Change protection level autoreview in arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312734 (https://phabricator.wikimedia.org/T146575) [11:22:00] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665350 (10abian) [11:26:27] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665357 (10abian) [11:31:34] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665360 (10abian) [11:51:33] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665376 (10Esc3300) [11:52:46] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665333 (10Esc3300) Maybe related: T146529 [12:07:49] (03CR) 10Luke081515: [C: 031] Change protection level autoreview in arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312734 (https://phabricator.wikimedia.org/T146575) (owner: 10Urbanecm) [12:31:21] (03PS1) 10Dereckson: Disable Upload Wizard blacklist issues on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312735 (https://phabricator.wikimedia.org/T146417) [13:14:48] mhhh scb spammed with celery logs 🤔 [13:25:37] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:45:21] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665415 (10Sjoerddebruin) p:05Triage>03High Lots of reports coming in. [13:48:34] looks like it ran out of disk space for a couple of hours until logrotate kicked in, I'm going to stop puppet and redirect celery logs to /srv/log/celery /cc akosiaris halfak Amir1 [13:49:12] godog: Thanks. I take care of it [13:49:48] We probably need to shut up celery [13:50:27] let me check the puppet config [13:50:39] Amir1: yeah it is logging like crazy and two almost identical lines [13:50:56] It's probably in debug mode [13:51:01] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:52:08] celery? all lines have INFO which I'm assuming is the level [13:52:20] The log level is ERROR https://github.com/wikimedia/operations-puppet/blob/production/modules/ores/manifests/worker.pp#L2 [13:52:26] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:52:26] PROBLEM - dhclient process on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:52:57] it's probably overwritten somewhere in hiera or the web deamon is spitting out a lot [13:53:15] let me check logs directly [13:54:36] PROBLEM - salt-minion processes on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:55:32] looks like the loglevel is forced INFO in ores_celery.py [13:55:32] application.worker_main(argv=["celery_worker", "--loglevel=INFO"]) [13:57:14] godog: it shouldn't happen because that part only applies when being ran directly not as a deamon [13:57:42] godog: I can't check logs, it says needs root and I don't have [13:58:15] Can you send me several lines in private? (you can delete IP, etc. I signed the NDA though) [13:59:24] sure [14:00:33] but yeah you are right that shouldn't apply [14:18:52] Amir1: ok I'll let you debug why loglevel isn't ERROR as it should be but INFO instead, though I'm going ahead and move the logging to /srv where's space or scb will run out of space [14:22:40] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665507 (10Esc3300) Which ones show it? - https://grafana.wikimedia.org/dashboard/db/wikidata-query-servi... [14:24:16] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665333 (10Multichill) I guess Varnish 5xx rate at https://grafana.wikimedia.org/dashboard/db/wikidata-query... [14:26:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [14:27:03] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665511 (10Esc3300) Yeah, looks like it: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?p... [14:28:56] !log temporarily redirect ores-celery-worker logs to /srv/log/celery/syslog.log and remove old daemon.log.1 to avoid scb100* filling up the / filesystem [14:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:29:24] RECOVERY - Disk space on scb1001 is OK: DISK OK [14:30:27] !log putting wdqs1002 in maintenance mode, server looks unstable, investigating... - T146576 [14:30:28] T146576: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576 [14:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:24] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665514 (10Esc3300) [14:34:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:34:30] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665333 (10jcrespo) According to icinga/grafana, several servers have lag (including that one). [14:37:58] PROBLEM - WDQS HTTP on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.007 second response time [14:38:30] PROBLEM - WDQS SPARQL on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.007 second response time [14:39:15] ^ I'm on the WDQS issue... [14:39:29] !log restarting blazegraph on wdqs1002 - T146576 [14:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:34] T146576: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576 [14:40:00] 06Operations, 10ORES, 15User-Ladsgroup: ores celery worker spamming scb logs - https://phabricator.wikimedia.org/T146583#2665522 (10fgiunchedi) [14:40:02] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665536 (10Gehel) logs on wdqs1002 show error opening sockets: ``` Sep 25 14:35:08 wdqs1002 bash[10426]: 20... [14:48:26] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665333 (10EricJ) I was getting the same issue intermittently when hitting the sparql endpoint programmatica... [14:53:20] RECOVERY - WDQS HTTP on wdqs1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9779 bytes in 0.007 second response time [14:53:51] RECOVERY - WDQS SPARQL on wdqs1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9779 bytes in 0.006 second response time [14:54:29] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665541 (10Gehel) Restarting blazegraph on wdqs1002 seems to have solved the issue. I'll dig into the logs t... [15:03:30] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2665542 (10Gehel) It seems that wdqs1002 started showing replication lag issues around 16:00 UTC on Saturday... [15:04:45] godog: I was afk, I hope it doesn't happen again [15:04:55] I debug and fix it [15:05:10] probably a deployment [15:06:06] 06Operations, 10ORES, 15User-Ladsgroup: ores celery worker spamming scb logs - https://phabricator.wikimedia.org/T146583#2665546 (10Ladsgroup) [15:20:14] RECOVERY - dhclient process on stat1002 is OK: PROCS OK: 0 processes with command name dhclient [15:22:23] RECOVERY - salt-minion processes on stat1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:22:37] RECOVERY - Disk space on stat1002 is OK: DISK OK [15:24:01] !log executed https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Fixing_HDFS_mount_at_.2Fmnt.2Fhdfs on stat1002 (fusermount didn't succeed to umount though) [15:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:21] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:56] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:27:04] thanks stat1002 [15:28:00] Is someone from Ops around just to sit while I deploy for https://phabricator.wikimedia.org/T146581 ? [15:28:27] Caused down time today [15:34:21] _joe_: around for ^ [15:34:24] ? [16:03:53] I'm still waiting for someone :D [16:04:13] Amir1 ops is out all week [16:04:33] they are ofsite this week [16:04:36] obviously but there should probably someone in case of emergency [16:05:07] Oh [16:05:10] (+ it's weekend) [16:06:35] yep [16:09:14] Amir1: i'd try pinging the person on clinic duty (see channel topic) [16:09:52] MatmaRex: I tried but his is not in the channel (probably due to time zone differences) [16:10:15] s/his/he [16:10:21] uh, he's not online at all. whoops… [16:10:43] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:12:00] the funny thing is I can do it but someone from ops should be around in case something went really really wrong :D [16:12:47] (Also deployed in labs, works just fine) [16:13:11] yeah the clinic duty person is not online 24/7 [16:13:37] Amir1, what sort of change to the currently-deployed version is involved? [16:14:06] Krenair: I want to only deploy this: https://gerrit.wikimedia.org/r/#/c/312740/ [16:14:58] I agree but what if a DDoS attacker chooses this time? [16:15:36] if your concern was security-related maybe we shouldn't be having this discussion here [16:15:46] maybe I'm too pessimist [16:16:05] it looks simple, you've tested it in labs [16:18:06] how many nodes run this service? [16:18:51] two in eqiad scb100[12] (+ two in codfw) [16:19:04] Okay, I deploy it [16:19:06] are you capable of depooling each and applying the change there, then repooling? [16:19:16] We have canary nodes [16:19:36] I think scb1002 is the canary one [16:25:58] !log deploying 9cc2009 on ORES nodes (T146581) [16:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:13] T146581: celery log level is INFO causing disruption on ORES service - https://phabricator.wikimedia.org/T146581 [16:28:29] deployed in the canary node (scb2001) [16:30:46] deploying to all nodes [16:35:42] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:43:03] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:45:05] Okay, everything is fine for now [16:50:58] It's to enjoy what's left from the weekend [16:51:00] o/ [17:08:18] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:24:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:27:47] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:32:19] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [17:40:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:42:12] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [17:56:18] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:21:13] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [18:31:32] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:40:16] (03PS1) 10Hashar: puppetmaster: polish git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/312747 [18:40:18] (03PS1) 10Hashar: puppetmaster: git-sync-upstream early abort [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) [18:42:26] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 641 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4043897 keys - replication_delay is 641 [18:47:16] (03PS2) 10Hashar: puppetmaster: git-sync-upstream early abort [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) [18:51:25] (03CR) 10Hashar: "cherry picked on beta puppet master" [puppet] - 10https://gerrit.wikimedia.org/r/312747 (owner: 10Hashar) [18:51:38] (03CR) 10Hashar: "cherry picked on beta puppet master" [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) (owner: 10Hashar) [18:56:30] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:10:00] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4008356 keys - replication_delay is 0 [20:29:57] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:43:07] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:55:08] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:08:15] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:19:33] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [21:22:00] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [22:12:49] (03CR) 10Alex Monk: ""always cause rebase to rebase"? always causes git to rebase?" [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) (owner: 10Hashar) [22:24:07] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]