[00:12:24] (03PS1) 10GTirloni: clush: Add sge group [puppet] - 10https://gerrit.wikimedia.org/r/487314 [00:16:11] (03CR) 10GTirloni: [C: 03+2] clush: Add sge group [puppet] - 10https://gerrit.wikimedia.org/r/487314 (owner: 10GTirloni) [00:16:29] PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:19:45] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10tstarling) [00:31:33] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) Agreed. There is zero need for a subdomain: this is why CSS exists and has been supported since at least 1998. [00:45:02] 10Puppet, 10MobileFrontend, 10Readers-Web-Backlog (Tracking), 10User-Jdlrobson: Mobile site does not automatically redirect to desktop version (and not possible to use browser "use desktop view") - https://phabricator.wikimedia.org/T60425 (10Krenair) See also {T214998} [00:48:17] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:18:35] gerrit seems really slow? [01:35:59] very slow for me as well. (everything works, it just takes several seconds) [01:46:38] Me too [01:46:44] Happening to others [01:46:57] * paladox creates a task [01:47:41] 10Operations, 10Gerrit: Gerrit losing slowly - https://phabricator.wikimedia.org/T215004 (10Paladox) [01:47:47] 10Operations, 10Gerrit: Gerrit losing slowly - https://phabricator.wikimedia.org/T215004 (10Paladox) p:05Triage→03Unbreak! [01:49:27] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=cobalt&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc [01:49:37] Cpu and load has gone up [01:51:09] 10Operations, 10Gerrit: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004 (10AntiCompositeNumber) [01:51:16] 10Operations, 10Gerrit: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004 (10TheDragonFire) I get intermittent `plugin failed to load` errors, as well as this in console: ` Thu Jan 31 01:49:32 GMT+000 2019 Class$82 SEVERE: TypeError: a is null Class$S43: TypeError: a is null at Unknown.TP(htt... [01:53:39] Anyone able to contact ops / releng? [01:55:28] ze servers r on fire!!! [01:56:23] Ops Clinic Duty: [01:56:28] Well that's discouraging. [01:57:40] * gtirloni is taking a look [01:57:57] Thanks gtirloni [01:58:15] I’m wondering what’s taking that much cpu [01:59:35] paladox: do you have the task #? [01:59:45] Yup [01:59:48] https://phabricator.wikimedia.org/T215004 [02:01:55] !log T215004 restarted gerrit (using 1200% cpu, 71% mem) [02:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:01:59] T215004: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004 [02:02:19] Thanks! [02:03:29] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [02:04:36] paladox: does it look okay now? [02:04:37] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [02:04:55] gtirloni: yup [02:04:59] Much better thanks! [02:05:04] cool, I'll blame java :) [02:05:23] Lol [02:05:29] I don't know what's our error budget for gerrit but I'll let someone else more knowledgeable about gerrit investigate the root cause... a restart seems fine for now [02:05:45] Thanks :) [02:05:53] thanks for reporting ti [02:05:55] The bug could likely have been fixed already upstream [02:06:07] But they had such a big refactor hard to tell [02:06:30] James_F: thanks for reporting it too [02:06:33] paladox: got it [02:06:44] :) [02:06:49] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 7 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [02:06:55] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [02:07:02] That should be recovering soon [02:07:07] Always happens after a restart [02:07:13] ok [02:07:25] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [02:08:18] 10Operations, 10Gerrit: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004 (10GTirloni) It seems the issue is gone after a reboot. I didn't see anything special in /var/log/gerrit/error_log. Thanks for reporting this. [02:08:34] 10Operations, 10Gerrit: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004 (10GTirloni) 05Open→03Resolved p:05Unbreak!→03High a:03GTirloni [02:08:43] Thanks! [02:29:59] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:31:05] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:33:19] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:33:23] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:33:55] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:41:05] 10Operations, 10Cloud-VPS, 10Discovery-Search, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10Mathew.onipe) a:05Gehel→03Mathew.onipe [02:41:36] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10Discovery-Search (Current work): rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10Mathew.onipe) [03:23:58] (03CR) 10Mathew.onipe: cloudelastic: Add cloudelastic configs (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [03:25:13] (03PS3) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [05:43:53] (03CR) 10Sayant Mahato: "Any updates on this project? We do have namespaces with colon like Visarga(https://en.wikipedia.org/wiki/Visarga). Here are some examples-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [06:13:04] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Legoktm) > The m. subdomain, as in en.m.wikipedia.org, is annoying. Links shared on social media are randomly mobile or non-mobile, and so des... [06:46:53] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) Alternate proposal: create - [langcode].braille.[project].org - [langcode].embossed.[project].org - [langcode].handheld.[project]... [06:57:45] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:57:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:57:57] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:00:23] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:00:27] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:05:47] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:12:17] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Funnyjokes2019) Commonly we attain confused when we finally hear the exact terms website development service in addition to internet develope... [07:45:08] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) p:05Triage→03Unbreak! [07:45:38] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) Escalated to unbreak now to bring attention to the vandal: https://phabricator.wikimedia.org/p/Funnyjokes2019/ I've never seen vandali... [08:03:39] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:04:47] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 75630 bytes in 0.112 second response time [08:09:32] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Ammarpad) p:05Unbreak!→03Triage The account has been disabled. [08:12:33] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10tstarling) >>! In T214998#4919908, @Koavf wrote: > Alternate proposal: create > - [langcode].braille.[project].org I assume this sarcastic... [08:24:10] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) >>! In T214998#4919947, @tstarling wrote: >>>! In T214998#4919908, @Koavf wrote: >> Alternate proposal: create >> - [langcode].braill... [08:58:59] PROBLEM - Backup of s7 in codfw on db1115 is CRITICAL: Backup for s7 at codfw taken more than 8 days ago: Most recent backup 2019-01-23 08:35:54 [08:59:59] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10tstarling) Yeah, very funny. In case it needs to be said, your comment was an attempt to support my proposal by reductio ad absurdum, via a r... [09:11:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:14:03] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:15:09] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:34:29] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10tstarling) My problem is that you are making fun of MobileFrontend, and by implication, the engineers who designed MobileFrontend, who may wel... [09:36:38] jynus: a cloud user is reporting an issue with EventStream [09:40:59] the api, the irc? [09:41:35] (03PS1) 10Jcrespo: mariadb: Rename sdb instance to staging [puppet] - 10https://gerrit.wikimedia.org/r/487339 (https://phabricator.wikimedia.org/T210478) [09:43:39] I see high load there: https://grafana.wikimedia.org/d/000000336/eventstreams?refresh=1m&orgId=1&from=1546335808416&to=1548927808416&panelId=1&fullscreen&var-stream=All&var-topic=All&var-scb_host=All [09:54:31] !log restarting pdfrender on scb1001 [09:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:23] !log restarting pdfrender on scb1002,3,4 [10:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:04] jynus: I asked the person to open a phab ticket [10:20:20] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) It is ok, will that be next week, e.g. Tuesday? [10:22:43] !log resetting to defaults innodb consistency options for db2048 T188327 [10:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:47] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [10:31:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10aborrero) >>! In T209029#4919182, @Cmjohnson wrote: > @aborrero can this server be re-installed.....there is a risk that removing /dev/sda will kill the OS.... [10:34:58] (03PS2) 10Jcrespo: mariadb: Rename sdb instance to staging [puppet] - 10https://gerrit.wikimedia.org/r/487339 (https://phabricator.wikimedia.org/T210478) [10:35:44] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Rename sdb instance to staging [puppet] - 10https://gerrit.wikimedia.org/r/487339 (https://phabricator.wikimedia.org/T210478) (owner: 10Jcrespo) [10:37:57] (03PS3) 10Jcrespo: mariadb: Rename sdb instance to staging [puppet] - 10https://gerrit.wikimedia.org/r/487339 (https://phabricator.wikimedia.org/T210478) [10:42:29] fuzheado in -cloud has said eventstreams is broken [10:42:31] the example at https://wikitech.wikimedia.org/wiki/EventStreams#Python leads to HTTP 502 [10:42:36] requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://stream.wikimedia.org/v2/stream/recentchange [10:43:05] ah I see it was mentioned further up the chat log [10:43:24] if it's 502ing why are no alarms going off? [10:45:10] GET / gives HTTP 503 [10:53:19] depends if there is even monitoring setup to alert [11:09:48] so eventstreams is working [11:09:59] I can curl http://scb1003.eqiad.wmnet:8092/v2/stream/recentchange [11:10:08] internally [11:10:35] and lvs endpoints are up [11:11:45] https://stream.wikimedia.org/?spec works [11:11:46] PROBLEM - Host cloudvirt1015 is DOWN: PING CRITICAL - Packet loss = 100% [11:12:55] ??? [11:15:04] I can also curl https://stream.wikimedia.org/v2/stream/recentchange [11:15:36] but only from the inside [11:17:33] it could be hitting some limitation (MAX_CONCURRENT_STREAMS == 128) [11:18:10] ACKNOWLEDGEMENT - Host cloudvirt1015 is DOWN: PING CRITICAL - Packet loss = 100% Arturo Borrero Gonzalez Investigating [11:21:47] I am going to restart one eventstreams service [11:22:43] !log restart eventstreams on scb1001 [11:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:42] Krenair: it works now? [11:24:29] !log restart eventstreams on scb1002,3,4 [11:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:32] !log T215012 reboot cloudvirt1015 [11:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:35] T215012: cloudvirt1015: server down - https://phabricator.wikimedia.org/T215012 [11:26:17] did the reporter file a ticket? [11:27:09] jynus: I just asked the reporter to open a ticket [11:27:18] RECOVERY - Host cloudvirt1015 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [11:29:03] PROBLEM - ensure kvm processes are running on cloudvirt1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 [11:30:29] !log T215012 icinga downtime cloudvirt1015 for 4h while investigating issues [11:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:33] T215012: cloudvirt1015: server down - https://phabricator.wikimedia.org/T215012 [11:32:27] RECOVERY - Backup of s7 in eqiad on db1115 is OK: Backup for s7 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-01-31 10:10:01 from db1116.eqiad.wmnet:3317 (101 GB) [11:36:53] RECOVERY - ensure kvm processes are running on cloudvirt1015 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 [11:37:28] (03CR) 10Raz Shuty: [C: 03+1] "wait for SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487076 (https://phabricator.wikimedia.org/T213975) (owner: 10Ladsgroup) [11:54:03] (03PS1) 10Jcrespo: mariadb: Grant m5 access to testreduce databases to scandium [puppet] - 10https://gerrit.wikimedia.org/r/487348 (https://phabricator.wikimedia.org/T214740) [11:56:09] (03CR) 10Jcrespo: [C: 03+2] mariadb: Grant m5 access to testreduce databases to scandium [puppet] - 10https://gerrit.wikimedia.org/r/487348 (https://phabricator.wikimedia.org/T214740) (owner: 10Jcrespo) [11:56:17] (03PS2) 10Jcrespo: mariadb: Grant m5 access to testreduce databases to scandium [puppet] - 10https://gerrit.wikimedia.org/r/487348 (https://phabricator.wikimedia.org/T214740) [12:08:14] (03PS1) 10Jcrespo: mariadb: Style fixes for scandium grants [puppet] - 10https://gerrit.wikimedia.org/r/487349 (https://phabricator.wikimedia.org/T214740) [12:09:01] (03PS2) 10Jcrespo: mariadb: Style fixes for scandium grants [puppet] - 10https://gerrit.wikimedia.org/r/487349 (https://phabricator.wikimedia.org/T214740) [12:09:38] (03CR) 10Jcrespo: [C: 03+2] mariadb: Style fixes for scandium grants [puppet] - 10https://gerrit.wikimedia.org/r/487349 (https://phabricator.wikimedia.org/T214740) (owner: 10Jcrespo) [12:12:11] !log apply new grants to m5-master with replication T214740 [12:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:15] T214740: Provide access to testreduce* databases on scandium + revoke from ruthenium - https://phabricator.wikimedia.org/T214740 [12:13:15] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [12:17:20] arturo: cloudvirt1019 may be a WIP-to-setup host ? [12:24:52] jynus: yes, is in hardware limbo [12:25:11] can I ack the service alert? [12:25:35] PROCS CRITICAL: 0 processes with regex args 'qemu-system-x86_64' ? [12:26:16] or downtime it for some days/weeks ? [12:26:53] it has notification disabled, see T196507 for context [12:26:54] T196507: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 [12:27:02] feel free to ack anything [12:27:09] yes, but it shows on unacknoledged alerts [12:27:13] thanks [12:28:12] it is not a problem, I am just tring to triage existing alerts [12:29:30] cool [12:54:54] !log stop, upgrade and restart db2044 [12:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:49] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:18:25] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [13:18:54] !log mvolz@deploy1001 scap-helm zotero upgrade staging -f zotero-values-staging.yaml stable/zotero [namespace: zotero, clusters: staging] [13:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:14] !log mvolz@deploy1001 scap-helm zotero upgrade staging -f zotero-values-staging.yaml --version=0.0.1 stable/zotero [namespace: zotero, clusters: staging] [13:19:15] !log mvolz@deploy1001 scap-helm zotero cluster staging completed [13:19:15] !log mvolz@deploy1001 scap-helm zotero finished [13:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:52] !log mvolz@deploy1001 scap-helm zotero upgrade production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [13:31:53] !log mvolz@deploy1001 scap-helm zotero cluster eqiad completed [13:31:53] !log mvolz@deploy1001 scap-helm zotero finished [13:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:23] !log mvolz@deploy1001 scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [13:34:25] !log mvolz@deploy1001 scap-helm zotero cluster codfw completed [13:34:25] !log mvolz@deploy1001 scap-helm zotero finished [13:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:39] 10Operations, 10Citoid, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Decreased internationalisation of automatic citations as a result of switch to new translation-server - https://phabricator.wikimedia.org/T210806 (10Mvolz) [13:41:02] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz) [13:41:07] 10Operations, 10Citoid, 10Core Platform Team Backlog (Watching / External), 10Services (watching), 10VisualEditor (Current work): Decreased internationalisation of automatic citations as a result of switch to new translation-server - https://phabricator.wikimedia.org/T210806 (10Mvolz) 05Open→03Resolve... [13:47:55] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10TheDragonFire) I don't think there is any significant animosity toward any engineers involved in building what was at the time a pragmatic sol... [14:13:13] (03PS4) 10Marostegui: mariadb: Rename sdb instance to staging [puppet] - 10https://gerrit.wikimedia.org/r/487339 (https://phabricator.wikimedia.org/T210478) (owner: 10Jcrespo) [14:14:20] (03CR) 10Marostegui: [C: 03+2] mariadb: Rename sdb instance to staging [puppet] - 10https://gerrit.wikimedia.org/r/487339 (https://phabricator.wikimedia.org/T210478) (owner: 10Jcrespo) [14:16:43] (03CR) 10Marostegui: [C: 03+2] analytics-grants.sql: Remove unused grants [puppet] - 10https://gerrit.wikimedia.org/r/487059 (https://phabricator.wikimedia.org/T214469) (owner: 10Marostegui) [14:16:50] (03PS2) 10Marostegui: analytics-grants.sql: Remove unused grants [puppet] - 10https://gerrit.wikimedia.org/r/487059 (https://phabricator.wikimedia.org/T214469) [14:20:51] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:27:21] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [14:39:16] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "All good, but I don't have the right to do a +2 in this codebase." [puppet] - 10https://gerrit.wikimedia.org/r/487157 (https://phabricator.wikimedia.org/T201491) (owner: 10MarcoAurelio) [14:46:50] (03CR) 10Addshore: [C: 03+1] Fix typo 'neccessary' [puppet] - 10https://gerrit.wikimedia.org/r/487157 (https://phabricator.wikimedia.org/T201491) (owner: 10MarcoAurelio) [15:06:02] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Me, Niklas and Aaron had a long chat face... [15:30:04] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:33:49] (03PS1) 10Mathew.onipe: maps: migrate maps2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/487360 (https://phabricator.wikimedia.org/T198622) [15:46:54] RECOVERY - Backup of s7 in codfw on db1115 is OK: Backup for s7 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-01-31 10:12:32 from dbstore2001.codfw.wmnet:3317 (101 GB) [15:50:46] 10Operations, 10Gerrit, 10Release-Engineering-Team (Watching / External): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10Paladox) This would have been usage for debugging T215004 , should we triage as high? [15:51:19] I am about to restart db1117- that will cause some proxies to detect it as down- I have downtimed then [15:52:06] they should recover when db1117 restarts fully [15:54:20] 10Operations, 10Gerrit: Investigate why icinga did not report high cpu/load - https://phabricator.wikimedia.org/T215033 (10Paladox) [15:54:23] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Koavf) >>! In T214998#4920019, @tstarling wrote: > My problem is that you are making fun of MobileFrontend, and by implication, the engineers... [15:54:26] 10Operations, 10Gerrit: Investigate why icinga did not report high cpu/load - https://phabricator.wikimedia.org/T215033 (10Paladox) p:05Triage→03High [15:54:53] !log stop, upgrade and restart db1117 [15:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:34] 10Operations, 10Gerrit: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Paladox) [15:57:00] proxies complaining, as expected [15:57:14] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [15:57:50] monitoring just in case we have a double failure [15:58:00] as we are on reduced availability right now [16:00:03] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: cold-migrate: exclude rotated console.log files [puppet] - 10https://gerrit.wikimedia.org/r/487362 [16:05:21] and proxies are back to normal [16:07:13] (03CR) 10Andrew Bogott: [C: 03+1] "Assuming that --exclude handles wildcards like that, this is a definite improvement." [puppet] - 10https://gerrit.wikimedia.org/r/487362 (owner: 10Arturo Borrero Gonzalez) [16:11:50] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:14:12] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6461/IPv6: Active, AS6461/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:14:16] (03CR) 10Thiemo Kreuz (WMDE): Stop NavPopups gadget conflict with PagePreviews on Wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) (owner: 10D3r1ck01) [16:14:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "> Assuming that --exclude handles wildcards like that, this is a" [puppet] - 10https://gerrit.wikimedia.org/r/487362 (owner: 10Arturo Borrero Gonzalez) [16:29:42] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 22, down: 0, shutdown: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:32:36] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:59:22] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:02:43] mutante subbu: should we add notifications_enabled:0 to scandium ^until it is setup ? [17:26:36] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [17:35:44] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1030, Errmsg: Error Got error 22 Invalid argument from storage engine TokuDB on query. Default database: metawiki. [Query snipped] [17:39:17] (03PS1) 10Arturo Borrero Gonzalez: openstack: reallocate openstack wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/487367 [17:40:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: reallocate openstack wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/487367 (owner: 10Arturo Borrero Gonzalez) [17:44:12] !log running alter table on metawiki.revision_actor_temp, trying to fix TokuDB horrible bugs [17:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:06] (03PS1) 10Arturo Borrero Gonzalez: wmcs: prefix all of our customs scripts with 'wmcs-' [puppet] - 10https://gerrit.wikimedia.org/r/487368 [17:49:41] (03CR) 10jerkins-bot: [V: 04-1] wmcs: prefix all of our customs scripts with 'wmcs-' [puppet] - 10https://gerrit.wikimedia.org/r/487368 (owner: 10Arturo Borrero Gonzalez) [17:50:14] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1043.04 seconds [17:54:14] PROBLEM - ensure kvm processes are running on cloudvirt1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 [17:57:05] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1015: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/487370 (https://phabricator.wikimedia.org/T215012) [17:57:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1015: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/487370 (https://phabricator.wikimedia.org/T215012) (owner: 10Arturo Borrero Gonzalez) [18:00:48] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Arturo Borrero Gonzalez Server is depooled due to T215012 [18:01:48] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:29:02] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:29:20] 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security: Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10D3r1ck01) [18:41:26] 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security: Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10D3r1ck01) [18:45:41] 10Operations, 10MediaWiki-Shell, 10Wikimedia-General-or-Unknown, 10Security: Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10D3r1ck01) [18:55:40] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [18:59:11] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10stjn) If you add some redirection, add a way to circumvent it please (via some specific query?). Right now testing for mobile version involves... [19:04:18] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 295.66 seconds [19:22:52] (03CR) 10GTirloni: [C: 03+1] wmcs: prefix all of our customs scripts with 'wmcs-' [puppet] - 10https://gerrit.wikimedia.org/r/487368 (owner: 10Arturo Borrero Gonzalez) [19:28:05] (03CR) 10D3r1ck01: "Thanks! :)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) (owner: 10D3r1ck01) [19:29:25] (03PS3) 10D3r1ck01: Stop NavPopups gadget conflict with PagePreviews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) [19:40:01] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Stop NavPopups gadget conflict with PagePreviews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) (owner: 10D3r1ck01) [19:46:03] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10pmiazga) @Jhernandez I'll update documentation once all Tgr questions are answered. [20:03:37] 10Operations, 10Gerrit: Gerrit loading slowly - https://phabricator.wikimedia.org/T215004 (10greg) Thanks @GTirloni [20:11:34] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10tstarling) >>! In T214998#4920810, @stjn wrote: > If you add some redirection, add a way to circumvent it please (via some specific query?). R... [20:29:39] 10Operations, 10MobileFrontend, 10Traffic: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10stjn) Wasn’t aware of this link, thanks for addressing my concern. (‘Mobile view’ link sets cookies that make mobile version permanent for a v... [20:40:44] 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Peachey88) [20:41:57] (03PS1) 10CRusnov: management.py: trivial change to adapt for Netbox 2.5+ [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487380 (https://phabricator.wikimedia.org/T212524) [21:21:39] 10Operations, 10Wikibase-Containers, 10Wikidata, 10serviceops, and 2 others: Create a wmf production ready nginx image - https://phabricator.wikimedia.org/T209292 (10Ladsgroup) I would be in favor of not using nginx and turning WDQS gui to a proper npm package so we can have real deployment pipeline (right... [21:29:01] 10Operations: dbtree.wikimedia.org down - https://phabricator.wikimedia.org/T215040 (10Marostegui) [21:34:16] (03CR) 10Marostegui: [C: 03+1] "Although we the module isn't used anymore on core or misc I have run the PPC against a core host, misc, labs, sanitarium, dbstore...just i" [puppet] - 10https://gerrit.wikimedia.org/r/482693 (https://phabricator.wikimedia.org/T162070) (owner: 10Ottomata) [21:45:16] (03CR) 10Volans: [V: 03+2 C: 03+2] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487380 (https://phabricator.wikimedia.org/T212524) (owner: 10CRusnov) [21:59:28] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:02:30] (03CR) 10CRusnov: [V: 03+2 C: 03+2] management.py: trivial change to adapt for Netbox 2.5+ [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/487380 (https://phabricator.wikimedia.org/T212524) (owner: 10CRusnov) [22:07:30] I'm looking for a sysadmin to supervise a rename of an account with more than 100,000 edits [22:07:36] anyone here who can help me? [22:09:04] most of the foundation is in SF [22:09:51] 2PM there so they're probably doing stuff, maybe someone is watching IRC and can watch that, idk [22:12:08] then I guess we could use more European sysadmins... [22:12:25] you misunderstand [22:12:27] it is night at Europe [22:12:29] this week was all-hands [22:12:53] the annual thing that has them all fly to SF [22:15:31] what is all-hands? [22:15:54] but anyway... no sysadmin around? :( [22:16:50] Nope, they are all busy doing all hands stuff. [22:17:20] hmm, then I guess it shouldd wait [22:17:22] -d [22:17:44] is there an email address on which I can reach them? [22:17:54] Trijnstel: you could open a phabricator task [22:18:21] ah, and what or who could I tag? [22:18:26] for emergencies, i'm sure folks would get to work at fixing them, but i hope this is not an emergency [22:18:44] (i'm also at all hands, taking a break from the activities :) but not a sysadmin) [22:18:51] it's not an emergency, but I prefer to have this done on irc - live with them [22:19:33] Trijnstel: on monday everything should be back to normal [22:19:41] hmm k [22:19:53] wrong timing I guess :p [22:26:58] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [22:46:16] (03CR) 10Marostegui: [C: 03+1] "We also need to remove the GRANTs from MySQL itself" [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff) [22:52:40] 10Operations: dbtree.wikimedia.org down - https://phabricator.wikimedia.org/T215040 (10Marostegui) 05Open→03Resolved From what I can see it was failing on the call to: google.setOnLoadCallback(drawChart); It is now working, so it might have been a punctual issue. [22:54:35] 10Operations, 10Analytics, 10Product-Analytics, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10elukey) Quick idea that could alleviate this issue: if we create a cgroup like https://wiki.archlinux.org/index.php/cgroups#Matlab on the stat/notebook... [23:11:51] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10elukey) @Andrew if you have time can we chat about the labswiki steps stated above? [23:32:51] (03PS1) 10Elukey: Fix ports for wmcs/labs' Prometheus Memcached exporters [puppet] - 10https://gerrit.wikimedia.org/r/487453 [23:40:30] (03CR) 10Elukey: [C: 04-1] "Still WIP" [puppet] - 10https://gerrit.wikimedia.org/r/487453 (owner: 10Elukey) [23:53:13] 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Aklapper) Please include context. Is this related to T215004? [23:56:30] 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Paladox) Yup, it's related to that. [23:59:58] (03PS2) 10Elukey: Fix ports for wmcs/labs' Prometheus Memcached exporters [puppet] - 10https://gerrit.wikimedia.org/r/487453