[00:01:24] (03CR) 10Dzahn: "https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#sendemail.connectTimeout" [puppet] - 10https://gerrit.wikimedia.org/r/507853 (owner: 10Paladox) [00:06:32] !log restarting gerrit to pick up config changes for 2 mail threads and lower timeout (gerrit:507852, gerrit: 507853) [00:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:42] paladox: had a little scare because i saw errors but unrelated [00:08:48] oh [00:08:59] the external id exception thing [00:09:04] ah [00:09:19] you mean with a similar error to https://phabricator.wikimedia.org/T222336 ? [00:09:21] ConfigInvalidException: Invalid external ID config for note [00:09:32] ConfigInvalid sounded bad ..after changing config.. at first [00:09:36] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/507853 (owner: 10Paladox) [00:09:39] but that's different [00:10:05] !log remove static route to 208.80.155.128/25 on cr1/2-eqiad - T193496 [00:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:09] T193496: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 [00:10:17] ah ok [00:11:02] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] [00:11:18] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] [00:11:34] ah [00:11:36] looking [00:11:52] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] [00:13:01] too [00:13:10] i thought it wasnt even used yet? [00:13:20] the private key that is [00:13:41] oh..nevermind :) this is of course just from the gerrit restart [00:13:50] you can safely ignore or just run puppet once [00:14:01] ok! [00:15:32] PROBLEM - puppet last run on schema2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [00:16:22] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:17:06] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/16301/netmon1002.wikimedia.org/change.netmon1002.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [00:22:39] (03PS4) 10Ayounsi: LibreNMS, file files permission, add app key, add logrotate [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) [00:27:33] (03PS1) 10Ayounsi: LibreNMS, add fake db_user and db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/507906 [00:27:54] (03PS2) 10Ayounsi: LibreNMS, add fake db_user and db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/507906 [00:28:16] (03CR) 10Dzahn: [C: 03+2] LibreNMS, add fake db_user and db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/507906 (owner: 10Ayounsi) [00:29:12] (03CR) 10Dzahn: [V: 03+2 C: 03+2] LibreNMS, add fake db_user and db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/507906 (owner: 10Ayounsi) [00:29:43] (03PS5) 10Dzahn: LibreNMS, file files permission, add app key, add logrotate [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [00:29:54] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [00:30:28] PROBLEM - Host elastic2038 is DOWN: PING CRITICAL - Packet loss = 100% [00:31:52] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.codfw.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.codfw.wmnet, port=9443): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:32:03] !log powercycling elastic2038 [00:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:02] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, number_of_nodes: 14, active_shards: 3158, unassigned_shards: 225, active_primary_shards: 1128, initializing_shards: 0, delayed_unassigned_shards: 225, status: yellow, number_of_data_nodes: 14, number_of_pending_tasks: 0, relocating_shards: 0 [00:33:02] ight_fetch: 0, active_shards_percent_as_number: 93.34909843334319, task_max_waiting_in_queue_millis: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [00:34:14] (03CR) 10Dzahn: [C: 03+1] "compiling works now. lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [00:34:33] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/16303/" [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [00:34:38] RECOVERY - Host elastic2038 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [00:34:47] (03CR) 10Ayounsi: [C: 03+2] LibreNMS, file files permission, add app key, add logrotate [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [00:37:50] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [00:38:46] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Dzahn) @Gehel It crashed today. Just went down and had no console output. Then i powercycled it.. Then it was back. Note the service came back before the host was back..... [00:42:04] RECOVERY - puppet last run on schema2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:43:42] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [00:44:47] 10Operations: cronspam from smart-data-dump due to facter bug - https://phabricator.wikimedia.org/T222326 (10Dzahn) p:05Triage→03Normal Thanks! That stopped the spam, so priority isn't that high now. [00:45:01] 10Operations, 10observability, 10Patch-For-Review: LibreNMS upgrade to 1.51 - https://phabricator.wikimedia.org/T207706 (10ayounsi) 05Open→03Resolved a:03ayounsi Everything is done. Future upgrades should be much smoother. [00:47:40] 04Critical Testing transport from LibreNMS [00:51:32] XioNoX: ^ this is a good thing?:) [00:51:47] yeah, I pushed the "test" button [00:51:51] :) [00:52:06] nice that the upgrade is done now :) [00:53:32] 10Operations, 10Cloud-Services, 10netops, 10cloud-services-team (Kanban): Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10ayounsi) 05Open→03Resolved 208.80.155.128/25 has been cleaned up from Netbox and DNS (see above). [01:12:31] 08̶W̶a̶r̶n̶i̶n̶g Device cr3-ulsfo.wikimedia.org recovered from Juniper environment status [01:12:37] 08̶W̶a̶r̶n̶i̶n̶g Device cr4-ulsfo.wikimedia.org recovered from Juniper environment status [01:12:47] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-eqord.wikimedia.org recovered from Juniper environment status [01:15:21] god that yellow is obnoxious [01:16:42] should be green :) [01:16:58] it's not a warning. it's "no more warning" [01:17:07] More it's not readable on the grey background [01:17:22] well, that's black for me [01:17:33] https://imgur.com/a/cNyxlu6 [01:17:39] It's a little better zoomed in... [01:21:50] looks nice on black background :) [01:22:14] needs skin support and per-user settings :p [01:22:34] (03PS1) 10Alex Monk: network: remove old labs public range [puppet] - 10https://gerrit.wikimedia.org/r/507907 (https://phabricator.wikimedia.org/T193496) [01:23:07] 08,01 ̶W̶a̶r̶n̶i̶n̶g [01:23:12] Reedy: Looks fine to me. ;-) https://usercontent.irccloud-cdn.com/file/468mrdhG/image.png [01:24:46] I think I have the worst [01:25:00] https://i.imgur.com/kz2Q4rM.png [01:25:51] your's looks much better Krenair [01:26:10] Can at least read yours [01:26:44] mine looks like: https://phabricator.wikimedia.org/F28909362 [01:26:59] about as bad as mine :P [01:27:06] lol [01:27:40] yeah true my shade of yellow is better, but mine can't even handle the first character [01:29:45] software sucks [01:31:32] Character encoding and rendering sucks. [01:31:40] If only we'd invented computers before writing. ;-) [01:31:59] heh [01:35:53] Reedy: https://gist.github.com/TehNut/8f58c78170be56abd8b4 [01:36:21] DarkenQuassel?:) [01:36:50] https://i.imgur.com/amMKaJz.png [02:21:42] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Tgr) Not sure what counts as consensus for something like this, but +1 from me: 100% of the traffic is CC-d from the wikitech or ops list; it's too underspecified to be useful to the peopl... [03:13:50] PROBLEM - puppet last run on analytics1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:25:08] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:40:26] RECOVERY - puppet last run on analytics1058 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [03:45:48] PROBLEM - puppet last run on restbase1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:51:46] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:12:22] RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:25:59] (03PS1) 10Ayounsi: LibreNMS, change $install_dir to group librenms [puppet] - 10https://gerrit.wikimedia.org/r/507909 (https://phabricator.wikimedia.org/T207706) [04:27:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:28:29] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/16304/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/507909 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [04:34:26] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:40:23] weird, I was just getting served pages without styling [04:46:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:54:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:55:09] <_joe_> !log progressively depooling parsoid servers in codfw to assess load tolerance [04:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:19] !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=parsoid,dc=codfw,name=wtp20(1[7-9]|20).* [04:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:36] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:03:38] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:03:42] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:03:52] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:04:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:04:12] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:04:52] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:05:00] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:05:02] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:05:04] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [05:05:06] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:05:18] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:05:24] !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=parsoid,dc=codfw,name=wtp201[5-6].* [05:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:22] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:06:38] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [05:06:50] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [05:06:54] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [05:07:16] FYI - getting back 503 errors when loading user pages. [05:08:59] PROBLEM - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2844 bytes in 1.420 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:09:08] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1527 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [05:09:46] Meta is giving me 503s ... probably what icinga-wm is going on about [05:09:55] twinkleoptions.js (probably others) not loading either [05:10:08] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [05:10:28] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2403 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [05:11:25] PROBLEM - LVS HTTPS IPv6 on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2775 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:11:28] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [05:11:38] PROBLEM - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2818 bytes in 1.658 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:14:04] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2825 bytes in 0.455 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:14:06] RECOVERY - LVS HTTPS IPv6 on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16151 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:14:31] Packet loss and other errors are decreasing [05:14:52] * Oshwah is watching the graphs [05:15:06] PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:15:10] JJMC89: Try now [05:15:20] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2817 bytes in 0.461 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:15:31] Might still get errors, but they seem to be recovering... [05:16:18] Oshwah: Its getting better. Still some 503s when loading some js. [05:16:19] RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16137 bytes in 0.529 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:16:27] JJMC89: Same with me [05:16:28] the stylesheets aren't loading for me at https://en.wikipedia.org/wiki/User:MusikAnimal [05:16:42] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16189 bytes in 0.484 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:17:03] My scripts aren't loading or pulling any data (my live scripts are getting errors with the API) [05:17:06] I was seeing some problems wit stylesheets earlier [05:17:10] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.01207 https://grafana.wikimedia.org/dashboard/db/logstash [05:17:12] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.04798 https://grafana.wikimedia.org/dashboard/db/logstash [05:17:13] but it appears okay now? [05:17:28] Krenair: Page loading on https looks better [05:17:34] Yea, styles weren't loading a little bit ago for me too. [05:17:47] .js / .css still having some issues [05:18:21] RECOVERY - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16202 bytes in 1.300 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:19:17] strangely, the CSS loads in incognito or another browser, but clearing the cache in my main browser window doesn't work [05:19:26] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16199 bytes in 0.472 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:19:45] <_joe_> uhm [05:19:53] musikanimal: Interesting [05:19:56] <_joe_> can I ask you in which contintent are you located? [05:20:08] Everything now looks A-OK; keeping watch... [05:20:19] proudly and openly a New Yorker [05:20:53] I'm in the U.S. [05:20:55] aha, I think it was the DNS cache. Clearing that worked [05:21:00] RECOVERY - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16190 bytes in 1.264 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:21:11] or it was just coincidence [05:21:26] not sure there's been any big DNS changes overnight... [05:22:08] nope, nothing in dns.git [05:22:10] I saw problems earlier and I'm in Europe [05:22:33] coincidence then. Everything looks good now [05:22:53] so only user pages were affected here? That was my observation [05:23:12] <_joe_> musikanimal: indeed, coincidence [05:23:36] musikanimal: I was getting issues all over meta and enwiki. [05:25:00] musikanimal, I encountered it browsing enwiki mainspace [05:26:16] I had it just for userpages, and only certain userpages. I kept hitting "Random article" and all of those loaded fine [05:26:25] <_joe_> the problem was one varnish server with issues [05:26:29] <_joe_> more or less known [05:26:34] <_joe_> in our main active dc [05:26:38] <_joe_> https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-3h&to=now [05:26:42] <_joe_> see cp1077 [05:27:30] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:27:34] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:27:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:27:36] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:27:36] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:27:38] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:27:54] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:28:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:29:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:29:22] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:29:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:30:10] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:30:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:30:18] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:30:26] I am also running into issues. My UTC clock disappeared on enwiki [05:30:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:30:34] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:30:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:31:20] _joe_, looks like maybe the problems aren't over? [05:31:31] <_joe_> sigh [05:31:36] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:31:36] <_joe_> indeed [05:31:46] maybe time to depool that varnish? [05:32:09] <_joe_> Krenair: no it needs a restart [05:32:33] my UTC clock is now back, for now at least [05:32:42] _joe_: I'm around, can I help with anything? [05:32:47] <_joe_> !log restarting varnish backend on cp1077 [05:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:05] <_joe_> XioNoX: this ^^ should solve things in theory [05:33:28] ok, don't hesitate to ping me if needed [05:33:58] I'm still seeing problems [05:34:04] <_joe_> I'm not sure that's the only issue btw [05:34:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:34:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:34:20] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:34:28] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:34:36] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:34:42] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:34:47] "Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes." [05:34:50] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:34:58] Still getting 503 error returns when looking at page history [05:34:58] when I tried loading enwiki spi @_joe_ [05:35:05] I'm refreshing https://en.wikipedia.org/wiki/Emily_Watson and getting 503s [05:35:14] (while logged in) [05:35:24] 503 logged in as well [05:35:26] intermittently [05:35:29] better way to describe my error [05:35:30] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:35:36] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:35:38] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:35:43] <_joe_> Krenair: I can load that just fine [05:35:44] same as Krenair [05:35:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:35:52] cp1085 [05:35:56] <_joe_> but can you post me your X-cache header? [05:35:56] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:35:58] <_joe_> ok [05:36:00] <_joe_> as I thought [05:36:01] shows up on each of the 503 pages I get [05:36:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:36:08] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:36:12] <_joe_> I can't really restart that as well now [05:36:22] my x-cache header... x-cache [05:36:22] cp1085 int, cp3032 miss, cp3030 pass [05:36:44] Graphs are starting to curve back into the direction of when the original issue was present just earlier... [05:36:54] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:36:56] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:37:02] But not all of them (yet?) [05:38:03] * Oshwah is keeping watch [05:38:14] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:39:45] it seems to be working okay for me now [05:40:06] ^ [05:40:34] Graphs are recovering. A much smaller blip than the earlier 503 errors... [05:40:52] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:41:13] <_joe_> we are on it, I also called other people to help [05:41:18] <_joe_> we will keep an eye [05:41:26] <_joe_> I don't think we're out of the woods, still [05:41:54] * Oshwah panics and starts screaming [05:41:55] kk np [05:42:04] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:42:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:42:10] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:42:10] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:42:12] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:42:28] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:42:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:42:42] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:43:06] Oshwah: https://www.youtube.com/watch?v=VnT7pT6zCcA [05:43:26] <_joe_> Izhidez: oh that's something I lived better not seeing :DD [05:43:35] lol [05:43:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:43:44] lolol [05:43:44] Oh, beeker... [05:43:54] Beaker* [05:46:24] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [05:47:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:47:26] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [05:47:30] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:47:54] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [05:49:18] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [05:51:05] * TheSandDoctor has never seen a wild panicking @Oshwah before :D :P [05:55:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:56:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:57:12] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [05:58:22] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:03:46] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [06:05:42] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:20:58] (03CR) 10Smalyshev: wdqs: add WDQS restart cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [06:28:54] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:46:25] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) Thanks Chris! Once @robh has added the production DNS entries I can take over and install them myself :-) [07:06:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:06:58] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:07:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:07:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:07:46] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:07:46] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [07:07:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:07:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:07:52] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:08:02] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [07:08:10] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:10:54] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [07:11:42] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:11:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:11:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:11:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:12:06] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:12:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:12:20] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:16:28] !log cp1089: varnish-backend-restart [07:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:54] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [07:17:10] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [07:17:26] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [07:18:05] 10Operations, 10netops: cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6) - https://phabricator.wikimedia.org/T222424 (10elukey) [07:21:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:22:14] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:39:53] (03Abandoned) 10Vgutierrez: authdns: Avoid caching dns-01 challenges [puppet] - 10https://gerrit.wikimedia.org/r/503935 (https://phabricator.wikimedia.org/T219414) (owner: 10Vgutierrez) [07:43:40] 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [07:43:57] (03PS1) 10Ema: cache: reimage cp4024 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507915 (https://phabricator.wikimedia.org/T219967) [07:45:52] !log depool cp4024 and reimage as upload_ats T219967 [07:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:57] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [07:47:07] (03CR) 10Ema: [C: 03+2] cache: reimage cp4024 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507915 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [07:50:34] PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:50:47] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp4024.ulsfo.wmnet'] ` The log can be... [07:50:56] (03CR) 10Vgutierrez: "basically a NOOP for ATS instances with plain text inbound: https://puppet-compiler.wmflabs.org/compiler1002/16305/" [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [07:55:48] RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational [07:59:15] (03CR) 10Vgutierrez: "cosmetic changes in records.config for existing instances with caching enabled caused by regrouping of cache-related settings in the templ" [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:00:40] (03CR) 10Vgutierrez: [C: 03+2] config: Move ACMEChiefConfig to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507801 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:00:59] (03CR) 10Vgutierrez: [C: 03+2] dns: Move DNS operations to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507802 (owner: 10Vgutierrez) [08:01:04] (03CR) 10Vgutierrez: [C: 03+2] CI: Run tests with minimum and latest dependencies [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507803 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [08:01:07] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Prevalidate CN/SNI list [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507804 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:01:11] (03CR) 10Vgutierrez: [C: 03+2] Release 0.17 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507805 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:02:12] (03Merged) 10jenkins-bot: config: Move ACMEChiefConfig to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507801 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:02:33] (03Merged) 10jenkins-bot: dns: Move DNS operations to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507802 (owner: 10Vgutierrez) [08:03:45] (03CR) 10jenkins-bot: config: Move ACMEChiefConfig to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507801 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:03:47] (03Merged) 10jenkins-bot: CI: Run tests with minimum and latest dependencies [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507803 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [08:03:49] (03Merged) 10jenkins-bot: acme_chief: Prevalidate CN/SNI list [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507804 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:03:51] (03Merged) 10jenkins-bot: Release 0.17 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507805 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:04:07] (03CR) 10jenkins-bot: dns: Move DNS operations to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507802 (owner: 10Vgutierrez) [08:06:08] (03CR) 10jenkins-bot: acme_chief: Prevalidate CN/SNI list [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507804 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:06:10] (03CR) 10jenkins-bot: CI: Run tests with minimum and latest dependencies [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507803 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez) [08:06:13] (03CR) 10jenkins-bot: Release 0.17 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507805 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:06:59] 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [08:09:04] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [08:13:02] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:27:24] !log starting table recompression on new backup source hosts on eqiad and codfw (stop replication) T220572 [08:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:29] T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 [08:33:05] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4024.ulsfo.wmnet'] ` and were **ALL** successful. [08:38:57] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [08:44:04] (03Abandoned) 10Awight: [DNM] Enable Extension:JADE in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440124 (https://phabricator.wikimedia.org/T183381) (owner: 10Awight) [08:47:40] !log pool cp4024 w/ ATS backend T219967 [08:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:44] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [08:49:37] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1139 and db1140 [puppet] - 10https://gerrit.wikimedia.org/r/507925 (https://phabricator.wikimedia.org/T220572) [08:58:23] RECOVERY - EDAC syslog messages on db1107 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1107&var-datasource=eqiad+prometheus/ops [09:03:41] (03PS17) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [09:05:25] (03CR) 10Mathew.onipe: icinga: create and apply cirrus config check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [09:12:20] (03CR) 10ArielGlenn: [C: 03+2] Update DCAT-AP link [dumps/dcat] - 10https://gerrit.wikimedia.org/r/497435 (owner: 10Awight) [09:17:48] (03PS18) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [09:20:59] !log ban elastic2038 from elastic clusters pending memory issue investigation - T217398 [09:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:03] T217398: elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 [09:24:46] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Gehel) It looks like we need to investigate this a bit more >>! In T217398#4999270, @Papaul wrote: > The next step will be to monitor the system and see if we have the s... [09:25:04] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Gehel) 05Resolved→03Open [09:26:13] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 26.77 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:27:08] 10Operations, 10Discovery-Search (Current work): Decrease shard alert threshold for omega and psi elasticsearch clusters - https://phabricator.wikimedia.org/T222432 (10Gehel) [09:30:29] (03PS19) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [09:30:33] (03PS1) 10Ammarpad: Add localized project logo for sahwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507931 (https://phabricator.wikimedia.org/T222065) [09:31:07] (03CR) 10Mathew.onipe: "PCC is Ok: https://puppet-compiler.wmflabs.org/compiler1002/16308/" [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [09:35:04] (03PS1) 10星耀晨曦: Enable FlaggedRevisions on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 [09:35:36] 10Operations, 10Discovery-Search (Current work): Decrease shard alert threshold for omega and psi elasticsearch clusters - https://phabricator.wikimedia.org/T222432 (10Gehel) 05Open→03Invalid Actually, the check timed out. Which make sense if it was routed to the problematic server, before it was marked in... [09:35:55] (03CR) 10jerkins-bot: [V: 04-1] Enable FlaggedRevisions on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (owner: 10星耀晨曦) [09:37:06] the alert above is due to a incoming traffic spike we got in eqsin-text: https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1556869777118&to=1556876197358&var-site=eqsin&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4 [09:39:23] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 99.46 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:42:47] (03PS2) 10星耀晨曦: Enable FlaggedRevisions on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (https://phabricator.wikimedia.org/T221933) [09:43:14] (03PS3) 10Jbond: prometheus: add timeout paramter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 [09:43:48] (03PS4) 10Jbond: prometheus: add timeout paramter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 [09:44:25] (03CR) 10Jbond: "> Patch Set 1:" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [09:45:00] (03PS5) 10Jbond: Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 [09:45:49] (03PS1) 10Ema: cache: reimage cp4025 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507935 (https://phabricator.wikimedia.org/T219967) [09:49:24] !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=parsoid,dc=codfw,name=wtp201[3-4].* [09:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:27] !log depool cp4025 and reimage as upload_ats T219967 [09:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:31] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [09:49:55] (03CR) 10jerkins-bot: [V: 04-1] Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [09:51:06] (03CR) 10Ema: [C: 03+2] cache: reimage cp4025 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507935 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:52:12] heads-up: I’m about to deploy a backport for T222347 (UBN!, as far as I know those are okay to fix on Fridays) [09:52:13] T222347: wbsearchentities now returns an error with type=lexeme - https://phabricator.wikimedia.org/T222347 [09:55:16] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp4025.ulsfo.wmnet'] ` The log can be... [09:57:31] backport is on mwdebug1002, currently testing [10:00:19] uploading files on Commons still works [10:00:22] and so does editing captions [10:00:35] so I assume the gate-and-submit failure on that backport did not indicate a real problem [10:00:36] syncing [10:01:36] great :) [10:02:20] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.3/extensions/WikibaseLexemeCirrusSearch/: [[gerrit:507847|Fix reference to classes that moved (T222347)]] (duration: 00m 55s) [10:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:24] T222347: wbsearchentities now returns an error with type=lexeme - https://phabricator.wikimedia.org/T222347 [10:03:27] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [10:05:19] PROBLEM - PHP7 rendering on mw1275 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 330 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:07:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10aborrero) ` => ctrl slot=0 pd all show status physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 146 GB): OK physicaldrive 1I:1:2 (port 1I:box 1:bay 2,... [10:10:01] (03PS2) 10Giuseppe Lavagetto: Allow proxyfetch to check more than one url at a time [debs/pybal] - 10https://gerrit.wikimedia.org/r/507740 [10:10:48] (03CR) 10Giuseppe Lavagetto: Allow proxyfetch to check more than one url at a time (035 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/507740 (owner: 10Giuseppe Lavagetto) [10:11:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1007.eqiad.wmnet ` The log can be... [10:13:31] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:13:52] checking mw1275 [10:17:33] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:17:44] (03CR) 10Filippo Giunchedi: initial attempt at a varnishkafka exporter (032 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [10:19:51] PROBLEM - Check Varnish expiry mailbox lag on cp3038 is CRITICAL: CRITICAL: expiry mailbox lag is 2086526 https://wikitech.wikimedia.org/wiki/Varnish [10:25:30] !log Depool mw1275 [10:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1007.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudvirt1007.eqiad.wmnet'] ` [10:33:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1007.eqiad.wmnet ` The log can be... [10:33:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1007.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudvirt1007.eqiad.wmnet'] ` [10:34:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1007.eqiad.wmnet ` The log can be... [10:34:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1007.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudvirt1007.eqiad.wmnet'] ` [10:35:21] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:37:39] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4025.ulsfo.wmnet'] ` and were **ALL** successful. [10:40:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1007.eqiad.wmnet ` The log can be... [10:41:28] 10Operations, 10serviceops, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [10:42:14] !log T220380 upload zull_2.5.1-wmf7 to jessie-wikimedia [10:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:19] T220380: Upload Zuul 2.5.1-wmf7 package to apt.wikimedia.org - https://phabricator.wikimedia.org/T220380 [10:43:28] (03PS7) 10Filippo Giunchedi: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) [10:43:30] (03PS7) 10Filippo Giunchedi: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) [10:43:49] !log T220380 remove zull_2.5.0-8-gcbc7f62-wmf4jessie1 from jessie-wikimedia/thirdparty [10:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:31] PROBLEM - Varnish traffic logger - varnishstatsd on cp4025 is CRITICAL: NRPE: Command check_varnishstatsd not defined https://wikitech.wikimedia.org/wiki/Varnish [10:44:33] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp4025 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined https://wikitech.wikimedia.org/wiki/Confd [10:44:45] PROBLEM - IPsec on cp4025 is CRITICAL: NRPE: Command check_IPsec not defined [10:44:51] ignore, that's me ^ [10:44:53] 10Operations, 10Continuous-Integration-Infrastructure: Upload Zuul 2.5.1-wmf7 package to apt.wikimedia.org - https://phabricator.wikimedia.org/T220380 (10jbond) 05Open→03Resolved a:03jbond I have updated the new package and removed the thirdparty package let me know if you see any issues ` contint2001 ~... [10:44:59] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp4025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 1982 bytes in 0.840 second response time https://wikitech.wikimedia.org/wiki/Varnish [10:45:21] will be fixed in a minute when puppet runs on icinga1001 [10:46:21] (icinga still thinks cp4025 runs varnish as the backend software, ha) [10:46:36] 10Operations, 10Continuous-Integration-Infrastructure: Upload Zuul 2.5.1-wmf7 package to apt.wikimedia.org - https://phabricator.wikimedia.org/T220380 (10hashar) Looks all good. Thank you very much :) [10:47:49] !log pool cp4025 w/ ATS backend T219967 [10:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:53] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [10:50:06] (03PS1) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) [10:51:32] (03PS24) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [10:56:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:57:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10fgiunchedi) >>! In T196066#5153205, @Ottomata wrote: > Hm but also, whatever we replace varnishkafka with... [10:57:27] (03PS17) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [10:57:29] (03PS4) 10Vgutierrez: prometheus: Support several instances of the trafficserver exporter [puppet] - 10https://gerrit.wikimedia.org/r/506659 (https://phabricator.wikimedia.org/T221217) [10:57:31] (03PS3) 10Vgutierrez: nagios_common: Provide check_https_hostheader_port_url check [puppet] - 10https://gerrit.wikimedia.org/r/507006 (https://phabricator.wikimedia.org/T221594) [10:57:33] (03PS6) 10Vgutierrez: trafficserver: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217) [10:57:35] (03PS29) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [10:57:49] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:58:27] looks like a spike, already recovered [10:59:06] (03PS2) 10Jcrespo: mariadb: Reenable notifications on db1139 and db1140 [puppet] - 10https://gerrit.wikimedia.org/r/507925 (https://phabricator.wikimedia.org/T220572) [10:59:08] (03PS1) 10Jcrespo: mariadb: Add db1139 and db1140 mysql instances to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/507942 (https://phabricator.wikimedia.org/T220572) [10:59:26] (03PS30) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [10:59:50] (03PS1) 10Matthias Mullie: SDC: Enable feature flag for depicts in UW on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507943 (https://phabricator.wikimedia.org/T217024) [11:00:52] (03CR) 10Jcrespo: [C: 03+2] mariadb: Add db1139 and db1140 mysql instances to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/507942 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo) [11:01:01] (03PS2) 10Jcrespo: mariadb: Add db1139 and db1140 mysql instances to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/507942 (https://phabricator.wikimedia.org/T220572) [11:01:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:06:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:08:15] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) > I wouldn't spend much time on jessie problems but rather focusing moving to stretch or buster at this point. That kind of contr... [11:10:29] <_joe_> !log purging opcache on mw1275 [11:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:07] RECOVERY - PHP7 rendering on mw1275 is OK: HTTP OK: HTTP/1.1 200 OK - 81126 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:11:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10aborrero) The trick was to manually set PXE boot for the debian installer and while the installer is working manually switch to disk boot. ` hpiLO... [11:16:00] 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10elukey) >>! In T222112#5153856, @CDanis wrote: > I think you should just be able to remove the "custom all value" in the dashboard settings and hav... [11:18:29] (03PS1) 10Jcrespo: backups: Decommission dbstore1001, dbstore2001 and dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/507944 (https://phabricator.wikimedia.org/T220002) [11:25:57] (03CR) 10Cparle: [C: 03+1] SDC: Enable feature flag for depicts in UW on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507943 (https://phabricator.wikimedia.org/T217024) (owner: 10Matthias Mullie) [11:36:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1007.eqiad.wmnet'] ` and were **ALL** successful. [11:48:56] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) 05Open→03Resolved a:03hashar So my primary concern was having a freshly created WMCS instance to be broken on creation. I th... [11:50:05] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:50:43] 10Operations: cron-spam: /usr/local/sbin/check-cumin-aliases - https://phabricator.wikimedia.org/T222443 (10jbond) p:05Triage→03Normal [11:52:13] (03PS1) 10Jbond: cumin: update list of cumin_masters [puppet] - 10https://gerrit.wikimedia.org/r/507948 (https://phabricator.wikimedia.org/T222443) [11:53:59] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:06:24] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: repool cloudvirt1007 [puppet] - 10https://gerrit.wikimedia.org/r/507949 (https://phabricator.wikimedia.org/T221047) [12:07:19] (03CR) 10Arturo Borrero Gonzalez: "Please andrew +1 (or even +2 and merge :-P)" [puppet] - 10https://gerrit.wikimedia.org/r/507949 (https://phabricator.wikimedia.org/T221047) (owner: 10Arturo Borrero Gonzalez) [12:08:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10aborrero) [12:08:46] (03PS20) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [12:08:48] (03PS1) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) [12:09:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10aborrero) [12:09:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10aborrero) 05Open→03Resolved [12:15:15] (03PS2) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) [12:17:32] (03PS3) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) [12:17:34] (03PS21) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [12:21:23] !log cp3038: varnish-backend-restart [12:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:35] RECOVERY - Check Varnish expiry mailbox lag on cp3038 is OK: OK: expiry mailbox lag is 0 https://wikitech.wikimedia.org/wiki/Varnish [12:26:07] !log replaying 30 minutes of eqiad search traffic on codfw - T221121 [12:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:11] T221121: Capacity planning for elastic search - https://phabricator.wikimedia.org/T221121 [12:28:34] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10fgiunchedi) No worries at all @hashar ! This upgrade problem was unexpected and rather annoying for sure, my expectation is that subseque... [12:30:41] (03PS22) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [12:36:53] (03CR) 10Mathew.onipe: "PCC is ok: https://puppet-compiler.wmflabs.org/compiler1002/16313/" [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [12:40:30] (03PS1) 10Ema: varnish: retry requests upon 502 errors [puppet] - 10https://gerrit.wikimedia.org/r/507953 (https://phabricator.wikimedia.org/T219967) [12:44:02] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Tarrow) @akosiaris and @WMDE-leszek I think the needed changes to proceed with the helm chart are now done. Feel free to poke m... [12:48:03] (03PS2) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) [12:48:57] (03PS3) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) [12:49:48] (03PS4) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) [12:54:26] (03PS5) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) [13:01:28] (03PS6) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) [13:10:17] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [13:13:44] (03CR) 10Mathew.onipe: wdqs: add WDQS restart cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [13:13:57] (03PS2) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) [13:22:48] (03PS2) 10Jbond: refactor: Refactor script and use the PyYAML [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 [13:33:15] PROBLEM - proton endpoints health on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [13:33:21] PROBLEM - Check whether ferm is active by checking the default input chain on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:33:49] PROBLEM - Check size of conntrack table on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [13:34:01] PROBLEM - Check systemd state on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [13:34:03] PROBLEM - Disk space on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [13:34:05] PROBLEM - configured eth on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [13:34:13] PROBLEM - dhclient process on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [13:34:33] PROBLEM - DPKG on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [13:37:01] PROBLEM - puppet last run on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [13:39:18] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10jijiki) [13:39:29] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10jijiki) p:05Triage→03Normal [13:45:23] PROBLEM - Check the NTP synchronisation status of timesyncd on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [13:46:52] 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10elukey) Note to self: remember that doing the above breaks all the kafka graphs [13:49:08] !log Restart npre on proton1001 [13:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:37] RECOVERY - Check size of conntrack table on proton1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [13:49:51] RECOVERY - Check systemd state on proton1001 is OK: OK - running: The system is fully operational [13:49:53] RECOVERY - Disk space on proton1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [13:49:55] RECOVERY - configured eth on proton1001 is OK: OK - interfaces up [13:50:03] RECOVERY - dhclient process on proton1001 is OK: PROCS OK: 0 processes with command name dhclient [13:50:25] RECOVERY - DPKG on proton1001 is OK: All packages OK [13:50:27] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [13:50:31] RECOVERY - Check whether ferm is active by checking the default input chain on proton1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:50:31] PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:52:08] (03PS7) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) [13:52:59] RECOVERY - puppet last run on proton1001 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures [13:55:47] RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational [14:07:25] 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10elukey) It seems that the following happens when using the default all value (using the current cpu usage query since it ends up in the same proble... [14:11:00] 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10elukey) This works with the default all value: ` avg by (instance) (irate(node_cpu{cluster="$cluster",mode!="idle",instance=~"($kafka_broker).*"}[... [14:13:03] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10Reedy) [14:15:41] RECOVERY - Check the NTP synchronisation status of timesyncd on proton1001 is OK: OK: synced at Fri 2019-05-03 14:15:40 UTC. [14:17:47] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10Lucas_Werkmeister_WMDE) That deployment (fixing T222347) was a backport ([I3e4bf4b12d](https://gerrit.wikimedia.org/r/507847)), which I force-submitted because... [14:19:54] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10jijiki) @Lucas_Werkmeister_WMDE We will look into it, it only happened on a single server so we believe, for now, that it could not be related to the change pe... [14:26:15] 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10elukey) Swapped all the occurrences of `instance=~"$kafka_broker"` with `instance=~"($kafka_broker).*"`, and the dashboard seems loading faster now... [14:32:28] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) I don't think Magnus would build it into librdkafka itself. But, a 3rd party C library could b... [14:42:34] (03CR) 10Ottomata: "Nit: I think this repo might better belong in operations/software, rather than operations/debs. I think operations/debs exists for packa" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [14:43:10] 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10CDanis) 05Open→03Resolved a:03CDanis It does seem //much// faster now, thanks @elukey ! Impact of loading 30 days on Prometheus is also mini... [14:45:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:45:53] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:53:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:55:05] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:55:43] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10Joe) some things from my very initial analysis: - I tried to purge first the directory that the deployment had invalidated, the error didn't go away - I tried p... [14:56:14] <_joe_> !log repool mw1275 [14:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:05] <_joe_> !log repooling the wtp* servers depooled in codfw for load testing [15:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:56] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=parsoid,dc=codfw [15:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:51] any ops member who would be able to help me with https://phabricator.wikimedia.org/T222033 ? [15:29:04] (03PS1) 10ArielGlenn: make page ranges for stubs be ints [dumps] - 10https://gerrit.wikimedia.org/r/507976 [15:29:06] (03PS1) 10ArielGlenn: convert exception error strings to utf8, thanks a lot python3 [dumps] - 10https://gerrit.wikimedia.org/r/507977 [15:31:18] _joe_: Any interest in an apache change for revi? :P [15:31:35] well realized not right now because doing some OS works [15:33:06] <_joe_> Reedy: on friday? [15:33:17] If you're feeling brave [15:33:22] <_joe_> you were trolling right [15:33:24] If not, scheduling to do it next week :) [15:33:28] <_joe_> no that would be feeling stupid [15:33:47] <_joe_> apache on friday, never [15:33:49] <_joe_> did once [15:33:53] <_joe_> ruined my weekend [15:34:21] at least gimme some timeframe :P [15:48:25] _joe_: can we at least agree on that we can (definitely) do it next week? :P [15:48:46] <_joe_> revi: what change? [15:48:51] <_joe_> and sure next week [15:48:52] https://phabricator.wikimedia.org/T222033 [15:49:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/506895 [15:53:26] (03CR) 10Alex Monk: [C: 04-1] elastalert: new module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [15:54:07] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:55:27] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:55:34] <_joe_> uh, utf-8 [16:00:44] buuuu memcached [16:00:54] 10Operations, 10media-storage, 10observability: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications - https://phabricator.wikimedia.org/T222362 (10Dzahn) p:05Triage→03Normal [16:05:29] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) Yup looks all right so far :-] [16:05:39] PROBLEM - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:14] anybody working on cp2009? [16:06:29] (03CR) 10Dzahn: "yea, that's right. there used the redirects.conf and redirects.dat and you had to upload both but that changed and is not the case anymore" [puppet] - 10https://gerrit.wikimedia.org/r/506895 (https://phabricator.wikimedia.org/T222033) (owner: 10Revi) [16:06:30] CC: ema --^ [16:07:34] (03CR) 10Hashar: "And that is apparently a noop as far as CI/zuul/jenkins-bot are concerned :]" [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar) [16:07:51] elukey: i tried to get on mgmt and interesting response: [16:07:51] /usr/bin/clpd: Input/output error [16:08:05] very nice from cp2009 [16:08:15] elukey: I'm not working on it, no. It's part of the ATS test cluster though, so no serving prod traffic [16:08:23] ack then :) [16:08:25] ty [16:08:46] should we make a ticket for the mgmt of it? [16:09:03] might need DRAC reset or something [16:09:15] (03PS3) 10Cwhite: initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) [16:09:28] (03CR) 10Hashar: [C: 03+1] puppet_compiler: add cron to delete old output files [puppet] - 10https://gerrit.wikimedia.org/r/507623 (https://phabricator.wikimedia.org/T222072) (owner: 10Dzahn) [16:11:09] when we have alerts for hosts that are part of test clusters it makes me think if we should have puppet code that skips icinga if defined as test cluster [16:11:39] but then.. you might want to know when test hosts have iesues too.. for the purpose of testing [16:12:09] mutante: to be fair, that *was* a prod cluster and is now a 'test' cluster simply because no frontends are routed to those hosts [16:12:21] they're gonna become actually test hosts once I find the time to reimage them :) [16:12:53] ACKNOWLEDGEMENT - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn not serving prod traffic [16:12:57] ema: ah! alright, got it [16:13:48] (03CR) 10Alex Monk: "Needs rebasing, there is now a object_replicator_concurrency parameter to swift::storage which doesn't necessarily fit in well with this?" [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [16:16:50] 10Operations, 10ops-codfw: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10Dzahn) [16:17:10] (03CR) 10Cwhite: initial attempt at a varnishkafka exporter (031 comment) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [16:22:49] (03PS1) 10Paladox: Merge tag 'v2.16.8' into HEAD [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507987 [16:24:15] 10Operations, 10ops-codfw: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10Dzahn) [16:24:40] 10Operations, 10ops-codfw: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10Dzahn) [16:26:17] 10Operations, 10ops-codfw: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10Dzahn) And yea.. why did the host die? Is it possible to reboot ? [16:28:11] 10Operations, 10ops-codfw, 10Traffic: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10Dzahn) p:05Triage→03Normal [16:36:09] 10Operations, 10ops-codfw: pull decom hardware and ship to Harry/OIT @ SF office - https://phabricator.wikimedia.org/T222383 (10HMarcus) Hi all, While in the process of swapping out our old equipment with new DC hardware, I noticed there are several spare SSDs with you guys as well. Is there any chance we can... [16:36:54] (03CR) 10Paladox: [V: 03+2 C: 03+2] Merge tag 'v2.16.8' into HEAD [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507987 (owner: 10Paladox) [16:40:30] (03PS1) 10Paladox: Remove plugin quota [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507990 [16:41:28] (03CR) 10Paladox: [V: 03+2 C: 03+2] Remove plugin quota [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507990 (owner: 10Paladox) [16:53:35] 10Operations, 10ops-codfw: PDUs with Infeed < 0.5Amps - https://phabricator.wikimedia.org/T222464 (10ayounsi) [16:53:51] (03PS1) 10Paladox: Update plugins for stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507991 [16:54:11] 10Operations, 10cloud-services-team: labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10aborrero) a:03aborrero I'm taking a look. [16:54:30] 10Operations, 10cloud-services-team (Kanban): labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10aborrero) [17:07:26] !log T222148 drop udev from openstack-mitaka-jessie/jessie-wikimedia (related to T216497) [17:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:31] T222148: labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 [17:07:32] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [17:07:39] (03PS2) 10CRusnov: Ganeti module: Add timeout support [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 [17:08:19] !log T222148 drop libudev1 from openstack-mitaka-jessie/jessie-wikimedia (related to T216497) [17:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:18] !log T222148 aborrero@labtestpuppetmaster2001:~ $ sudo apt-get install libudev1 udev systemd systemd-sysv libsystemd0 [17:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:22] (03CR) 10CRusnov: "Thanks for the review as always :)" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov) [17:10:57] !log T222148 aborrero@labpuppetmaster1001:~ $ sudo apt-get install libudev1 udev systemd systemd-sysv libsystemd0 [17:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:45] !log T222148 aborrero@labpuppetmaster1002:~ $ sudo apt-get install libudev1 udev systemd systemd-sysv libsystemd0 [17:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:42] (03CR) 10CRusnov: "> Patch Set 1:" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov) [17:13:39] (03PS4) 10Cwhite: initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) [17:15:46] !log T222148 aborrero@labstore1004:~ $ sudo apt-get install libudev1 udev systemd systemd-sysv libsystemd0 [17:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:50] T222148: labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 [17:16:43] !log T222148 aborrero@labstore1005:~ $ sudo apt-get install libudev1 udev systemd systemd-sysv libsystemd0 [17:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:29] 10Operations, 10cloud-services-team (Kanban): labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10aborrero) 05Open→03Resolved thanks @MoritzMuehlenhoff for the heads up. All seems fixed now. [17:19:52] (03PS13) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [17:23:22] (03PS14) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [17:23:49] (03CR) 10Cwhite: initial attempt at a varnishkafka exporter (032 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [17:28:19] 10Operations, 10Cloud-VPS, 10Traffic, 10netops, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179 (10aborrero) 05Open→03Stalled I moved this task to the `Graveyard` column in our kanban board because there are no ac... [17:28:22] (03CR) 10CRusnov: "Puppet compiler looks good." [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [17:29:30] (03CR) 10Andrew Bogott: [C: 03+2] openstack: eqiad1: repool cloudvirt1007 [puppet] - 10https://gerrit.wikimedia.org/r/507949 (https://phabricator.wikimedia.org/T221047) (owner: 10Arturo Borrero Gonzalez) [17:29:42] (03CR) 10CRusnov: Add a check_netbox_report icinga check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [17:29:57] (03PS15) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [17:32:34] (03CR) 10Rush: [C: 03+1] "Let's ask Mortiz to eyeball this if he has a moment but I believe this is OK." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506582 (https://phabricator.wikimedia.org/T221888) (owner: 10Andrew Bogott) [17:33:24] (03PS1) 10Andrew Bogott: Enable alerts for cloudvirt1007 [puppet] - 10https://gerrit.wikimedia.org/r/507997 [17:33:31] PROBLEM - puppet last run on proton1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:34:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Enable alerts for cloudvirt1007 [puppet] - 10https://gerrit.wikimedia.org/r/507997 (owner: 10Andrew Bogott) [17:34:45] (03CR) 10Dzahn: [C: 03+1] "i see nothing wrong with the usage of nrpe::monitor_service and the rest of the puppet/hiera part if it compiles, and it does, then that l" [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [17:35:34] win 5 [17:35:39] is there a generic 'upgrade away from jessie' task somewhere? [17:38:18] i am aware of one to "track remaining trusty servers" and several individual tasks to upgrade things to stretch.. but not of a "tracking" task to link all those [17:38:51] i think after trusty is killed that will trigger a "remaining jessie" one [17:39:33] ok [17:39:39] Krenair: haha, well https://phabricator.wikimedia.org/T168494 [17:39:47] (03CR) 10CRusnov: Add a check_netbox_report icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [17:39:48] that question seemed familiar [17:40:13] (03CR) 10CRusnov: [C: 03+2] Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [17:40:22] closed as invalid when i suggested it once [17:40:40] well it has been a couple of years since then [17:40:44] I'm looking at my list of jessie stuff [17:40:45] haha "removed a project: Tracking-Neverending." [17:41:03] I think that project got renamed since then FWIW mutante, it might not have said neverending at the time [17:41:14] oh.. yea. likely [17:41:35] well, feel free to reopen or make a new one and we will see [17:42:14] in particular I'm wondering about the irc server [17:42:20] it went precise -> jessie, skipping trusty [17:42:33] so I'm wondering if it'll skip stretch and go straight to buster [17:43:40] (03PS2) 10Jforrester: [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 [17:44:29] i don't know. i dont think that the past skip is an indication for future upgrade [17:44:36] (03CR) 10jerkins-bot: [V: 04-1] [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 (owner: 10Jforrester) [17:46:11] (03PS3) 10Jforrester: [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 [17:47:04] (03CR) 10jerkins-bot: [V: 04-1] [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 (owner: 10Jforrester) [17:47:31] 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494 (10Krenair) I'm wondering if now it is time to reopen this as jessie -> stretch/buster and track misc hosts at least... From the looks of things the following misc hosts are jessie: `actinium alcyone alsafi aluminium darms... [17:48:22] (03PS3) 10Dzahn: puppet_compiler: add cron to delete old output files [puppet] - 10https://gerrit.wikimedia.org/r/507623 (https://phabricator.wikimedia.org/T222072) [17:49:19] (03CR) 10Dzahn: [C: 03+2] puppet_compiler: add cron to delete old output files [puppet] - 10https://gerrit.wikimedia.org/r/507623 (https://phabricator.wikimedia.org/T222072) (owner: 10Dzahn) [17:49:30] (03PS4) 10Jforrester: [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 [17:50:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [17:50:19] 10Operations, 10netops: cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6) - https://phabricator.wikimedia.org/T222424 (10ayounsi) We don't monitor our IX peers much. We probably should configure BGP route damping, https://www.juniper.net/documentation/en_US/junos/topics/usage-guidelines/policy-using-routing... [17:53:49] at first glance in that list I'm seeing stuff relating to mailman, otrs, ldap, irc, gerrit [17:54:16] (03PS2) 10Jforrester: CommonSettings: Factor out load of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507727 [17:54:18] (03PS2) 10Jforrester: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728 [17:54:20] (03PS2) 10Jforrester: [WIP] writeToStaticCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 [17:56:20] (03CR) 10jerkins-bot: [V: 04-1] [WIP] writeToStaticCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (owner: 10Jforrester) [17:57:29] (03PS1) 10CRusnov: profile::netbox: Deploy config for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/508000 [18:00:15] mutante, know anything about schleifenbauer? [18:00:41] Krenair: that's the company making the PSUs (used in esams) [18:00:50] ah [18:00:50] power supplies + stuff [18:01:34] (03PS2) 10CRusnov: profile::netbox: Deploy config for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/508000 [18:02:30] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Deploy config for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/508000 (owner: 10CRusnov) [18:04:30] RECOVERY - puppet last run on proton1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [18:06:18] (03PS3) 10CRusnov: profile::netbox: Deploy config for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/508000 [18:09:25] 10Operations, 10puppet-compiler, 10Jenkins, 10Patch-For-Review: compiler1002.puppet-diffs.eqiad.wmflabs disk is full - https://phabricator.wikimedia.org/T222072 (10Dzahn) cron deployed on compiler1002 and tested to run manually. works. removed about 2G (from 44G to 42G size of the output dir). is there mo... [18:09:32] (03CR) 10CRusnov: "Compiles to expected output." [puppet] - 10https://gerrit.wikimedia.org/r/508000 (owner: 10CRusnov) [18:11:59] (03CR) 10Dzahn: profile::netbox: Deploy config for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508000 (owner: 10CRusnov) [18:15:45] (03CR) 10Dzahn: [C: 03+1] profile::netbox: Deploy config for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/508000 (owner: 10CRusnov) [18:18:54] (03CR) 10CRusnov: [C: 03+2] profile::netbox: Deploy config for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508000 (owner: 10CRusnov) [18:19:09] (03PS4) 10CRusnov: profile::netbox: Deploy config for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/508000 [18:20:59] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10Papaul) a:05Papaul→03RobH @RobH you mentioned that the port was disable it looks like it is not https://librenms.wikimedia.org/alerts/ [18:22:11] (03PS6) 10Herron: WIP puppetmaster-standalone - add dynamic envs that map to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/507846 [18:23:44] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 50.12, 25.14, 18.06 [18:24:38] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10RobH) a:05RobH→03Papaul @papaul: Please trace and disable the port, as it is unclear on the stack which port it was. [18:26:22] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 13.77, 22.15, 18.33 [18:29:00] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:29:15] ^ that's me, working on it [18:33:12] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/16328/" [puppet] - 10https://gerrit.wikimedia.org/r/506557 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:33:47] (03CR) 10Dzahn: [V: 03+1 C: 03+2] kafkatee: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506557 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:33:59] (03PS2) 10Dzahn: kafkatee: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506557 (https://phabricator.wikimedia.org/T194724) [18:40:41] (03PS1) 10CRusnov: profile::netbox: Remove explicit directory creation causing failure [puppet] - 10https://gerrit.wikimedia.org/r/508002 [18:40:50] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:42:36] (03CR) 10Dzahn: [C: 03+1] "yep, per our IRC talk the conflict comes from the uwsgi class creating /etc/${title} and 'netbox' is the title. Though an existing "!defin" [puppet] - 10https://gerrit.wikimedia.org/r/508002 (owner: 10CRusnov) [18:43:32] (03CR) 10Dzahn: "noop on analytics1030 and mwlog1001 - not even a compiler change on an-coord" [puppet] - 10https://gerrit.wikimedia.org/r/506557 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:44:10] (03CR) 10CRusnov: [C: 03+2] profile::netbox: Remove explicit directory creation causing failure [puppet] - 10https://gerrit.wikimedia.org/r/508002 (owner: 10CRusnov) [18:45:56] (03CR) 10Dzahn: [C: 03+1] LibreNMS, change $install_dir to group librenms [puppet] - 10https://gerrit.wikimedia.org/r/507909 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [18:46:01] (03CR) 10CDanis: [C: 03+1] admins: remove ability to run commands as user 'apache' [puppet] - 10https://gerrit.wikimedia.org/r/506750 (https://phabricator.wikimedia.org/T78076) (owner: 10Dzahn) [18:50:12] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:51:26] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:02:53] 10Operations, 10ops-codfw: PDUs with Infeed < 0.5Amps - https://phabricator.wikimedia.org/T222464 (10Dzahn) p:05Triage→03Normal [19:08:36] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10Papaul) ` papaul@asw-a-codfw# run show interfaces ge-5/0/15 descriptions Interface Admin Link Description ge-5/0/15 down down DI... [19:20:05] (03CR) 10Ayounsi: [C: 03+2] LibreNMS, change $install_dir to group librenms [puppet] - 10https://gerrit.wikimedia.org/r/507909 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [19:20:09] (03PS1) 10Dzahn: vagrant::mediawiki: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/508009 (https://phabricator.wikimedia.org/T194724) [19:20:18] (03PS2) 10Ayounsi: LibreNMS, change $install_dir to group librenms [puppet] - 10https://gerrit.wikimedia.org/r/507909 (https://phabricator.wikimedia.org/T207706) [19:20:49] (03CR) 10jerkins-bot: [V: 04-1] vagrant::mediawiki: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/508009 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [19:21:09] 10Operations, 10Wikimedia-Mailing-lists: Create a "Wikimedians of Florida" mailing list - https://phabricator.wikimedia.org/T222473 (10Gaurav) [19:23:28] (03PS2) 10Dzahn: vagrant::mediawiki: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/508009 (https://phabricator.wikimedia.org/T194724) [19:26:09] (03PS1) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 [19:26:41] (03CR) 10jerkins-bot: [V: 04-1] prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (owner: 10CDanis) [19:27:38] (03PS2) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 [19:28:09] (03CR) 10jerkins-bot: [V: 04-1] prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (owner: 10CDanis) [19:29:06] (03CR) 10Eevans: [C: 03+1] "Tested locally; Works for me. Thanks!" [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac) [19:30:19] (03PS3) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 [19:30:22] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/16330/app-editor-tasks.mobile.eqiad.wmflabs/change.app-editor-tasks.mobile.eqiad.wmflabs." [puppet] - 10https://gerrit.wikimedia.org/r/508009 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [19:30:56] (03CR) 10Effie Mouzeli: [C: 03+1] admins: remove ability to run commands as user 'apache' [puppet] - 10https://gerrit.wikimedia.org/r/506750 (https://phabricator.wikimedia.org/T78076) (owner: 10Dzahn) [19:31:14] (03CR) 10jerkins-bot: [V: 04-1] prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (owner: 10CDanis) [19:32:22] (03PS4) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 [19:32:55] (03CR) 10jerkins-bot: [V: 04-1] prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (owner: 10CDanis) [19:33:47] (03PS5) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 [19:37:53] 10Operations, 10ops-codfw: PDUs with Infeed < 0.5Amps - https://phabricator.wikimedia.org/T222464 (10ayounsi) [19:40:16] (03PS1) 10Ayounsi: LibreNMS, follow the symlink to $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/508014 (https://phabricator.wikimedia.org/T207706) [19:41:11] (03PS2) 10Ayounsi: LibreNMS, follow the symlink to $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/508014 (https://phabricator.wikimedia.org/T207706) [19:42:11] (03PS6) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 [19:43:24] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/16335/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/508014 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [19:44:12] (03CR) 10Ayounsi: [C: 03+2] LibreNMS, follow the symlink to $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/508014 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [19:47:57] (03PS3) 10Jforrester: Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 [19:48:09] (03PS4) 10Jforrester: Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 [19:49:35] (03CR) 10Jforrester: Invariant config cleanup: I - Initial DB and performance items (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 (owner: 10Jforrester) [19:49:39] (03PS5) 10Jforrester: Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 [20:09:11] 10Operations, 10Wikimedia-Mailing-lists: Create a "Wikimedians of Florida" mailing list - https://phabricator.wikimedia.org/T222473 (10Dzahn) a:03Dzahn [20:17:58] 10Operations, 10Wikimedia-Mailing-lists: Create a "Wikimedians of Florida" mailing list - https://phabricator.wikimedia.org/T222473 (10Dzahn) 05Open→03Resolved Hello Gaurav , Done. I see you guys are already running the Wikimedians in Colorado list, so i assume you are familiar and i will keep it short.... [20:26:37] 10Operations, 10ops-eqiad, 10Gerrit, 10serviceops, 10Release-Engineering-Team (Watching / External): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Dzahn) a:03Dzahn [20:39:56] 10Operations, 10Discovery-Search: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Dzahn) p:05Triage→03Normal [20:40:33] 10Operations, 10netops: cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6) - https://phabricator.wikimedia.org/T222424 (10Dzahn) p:05Triage→03Normal [20:40:37] 10Operations, 10netops, 10observability: cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6) - https://phabricator.wikimedia.org/T222424 (10Dzahn) [22:09:28] !log clear v4 BGP to AS17451 on cr1-eqsin/cr4-ulsfo [22:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:02] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:16:02] PROBLEM - BFD status on cr2-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:16:40] XioNoX [22:16:58] thx [22:17:22] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:17:22] RECOVERY - BFD status on cr2-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:17:30] seems like unplanned [22:17:34] woop [22:27:20] cr1-eqiad cr2-eqdfw might alert as well [22:27:53] Seems like GTT is having issues [22:49:21] 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 3 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10Dzahn) [23:48:43] threads have been piling up all day in gerrit. Seems to be getting worse instead of better. I think I'm going to give gerrit a restart before I walk away for the weekend. [23:49:37] !log gerrit restart due to threads piling up [23:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:48] any indication as to the cause? [23:50:06] not yet [23:50:51] !log gerrit back [23:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:02] chaomodus: I've been keeping notes as I go on https://phabricator.wikimedia.org/T221026 [23:51:37] I've been tweaking things here and there, but there has been no silver bullet. [23:52:11] Mmh [23:52:23] threaddumps, gc monitor: everything looks mostly normal afaict (after all the tweaks anyway) [23:52:37] This is often the java experience (tm), I used to admin Solr with similar experiences [23:53:08] though the threads did look alot better after the java upgrade [23:53:21] since they go up then come back down now [23:53:30] not today :( [23:53:33] instead of just staying stuck (and graduating going up) [23:53:54] I see g1 has been considered, properly tuned it may have a lot of benifit if full GCs can be avoided [23:54:07] or has that already been deployjd [23:54:13] chaomodus: we are actually currently using g1 and it has been a big boon [23:54:18] nice [23:54:33] (03PS1) 10CRusnov: ganeti-netbox sync: Sync host status also [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 [23:54:44] there were many concurrent full GCs happening with the parallel GC. Lots of pause time. [23:54:52] ^^ [23:55:23] well that's something at least [23:55:31] it's still getting some sort of thread deadlock though? [23:56:02] it appears threads are going higher then nromal [23:56:30] e.g going up to 17.0 today [23:56:31] seemingly. Although I haven't really caught anything in thread dumps. [23:56:44] (03PS2) 10CRusnov: ganeti-netbox sync: Sync host status also [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 [23:56:50] well. Nothing since the latest rounds of changes anyway. [23:57:22] http threads were getting stuck behind a sendemail thread, but there was some java bug that may have sorted that one out [23:57:51] now it seems like: http threads are all runnable, but there are a lot of them stacking up for no reason I can see [23:58:07] like the connections aren't timing out or something? [23:58:38] no timeouts, no locked resources between threads, no change in traffic at that time [23:58:58] I mean it's notcleanning up complete connections maybe [23:59:07] idk much about what ? tomcat? [23:59:11] jetty [23:59:14] ah