[00:08:40] (03CR) 10Vgutierrez: [C: 03+2] admin: Add alternative key for phuedx [puppet] - 10https://gerrit.wikimedia.org/r/486995 (owner: 10Phuedx) [00:10:15] phuedx: you got it :D [00:13:11] 10Operations, 10ops-codfw, 10Traffic: cp2014 host down - https://phabricator.wikimedia.org/T214872 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez everything got back to normal after a reboot [00:13:36] (03PS1) 10Marostegui: analytics-dbstore.sql: Initial research user role [puppet] - 10https://gerrit.wikimedia.org/r/487000 (https://phabricator.wikimedia.org/T214469) [00:15:53] (03CR) 10Marostegui: [C: 04-1] "Testing required - this is a draft" [puppet] - 10https://gerrit.wikimedia.org/r/487000 (https://phabricator.wikimedia.org/T214469) (owner: 10Marostegui) [00:34:16] (03PS2) 10Marostegui: analytics-dbstore.sql: Initial research user role [puppet] - 10https://gerrit.wikimedia.org/r/487000 (https://phabricator.wikimedia.org/T214469) [00:37:48] (03PS1) 10Marostegui: analytics-grants.sql: Remove non used grants [puppet] - 10https://gerrit.wikimedia.org/r/487002 [00:38:21] (03CR) 10Marostegui: [C: 03+2] analytics-grants.sql: Remove non used grants [puppet] - 10https://gerrit.wikimedia.org/r/487002 (owner: 10Marostegui) [00:48:44] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) Notes taken from a chat between me, Nuria, Miriam and Erik at the all hands: * ROCm is the opensource version of AMD drivers that are shipped by AMD itself.... [01:49:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:50:37] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [01:53:15] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [01:53:28] those two are most likely due to an eqiad-esams link going down [01:53:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:48:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:48:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:49:05] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:50:55] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:54:05] Request from 95.222.239.25 via cp3040 cp3040, Varnish XID 1048810462 Error: 503, Backend fetch failed at Tue, 29 Jan 2019 02:53:48 GMT [02:55:43] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:57:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:02:47] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:03:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:19:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:20:03] PROBLEM - Host lvs3001.mgmt is DOWN: CRITICAL - Time to live exceeded (10.21.0.147) [03:22:15] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:22:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:23:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:23:39] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:24:06] cp3040 cp3040, Varnish XID 40234074 Error: 503, Backend fetch failed at Tue, 29 Jan 2019 03:23:20 GMT [03:24:28] ACKNOWLEDGEMENT - MD RAID on cp3030 is CRITICAL: connect to address 10.20.0.165 port 5666: No route to host nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T214879 [03:24:32] 10Operations, 10ops-esams: Degraded RAID on cp3030 - https://phabricator.wikimedia.org/T214879 (10ops-monitoring-bot) [03:25:33] RECOVERY - Host lvs3001.mgmt is UP: PING WARNING - Packet loss = 37%, RTA = 82.94 ms [03:26:54] cp3032 cp3032, Varnish XID 917996101 Error: 503, Backend fetch failed at Tue, 29 Jan 2019 03:26:44 GMT [03:27:31] cp3042 cp3042, Varnish XID 166716513 [03:27:31] Error: 503, Backend fetch failed at Tue, 29 Jan 2019 03:27:22 GMT [03:27:44] looking [03:27:53] Better! [03:28:01] I was just going to ping you [03:28:55] one esams-eqiad link is flapping [03:28:59] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:29:06] thanks :) (cp3041 cp3041, Varnish XID 987792204 Error: 503, Backend fetch failed at Tue, 29 Jan 2019 03:28:42 GMT) [03:29:23] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - No response from remote host 91.198.174.244 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:30:14] could probably be related to the cp3032 cache error [03:32:27] !log bump cr2-esams-cr2-eqiad ospf cost to 2000 for level3 link flapping [03:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:31] so that link can still be flapping and having issues, but its ospf cost has been bumped so it's not prefered anymore, only there as backup [03:33:35] I'll monitor it [03:33:52] but don't hesitate to ping/page me [03:33:57] alright then. Thanks! [03:34:03] thank you [03:35:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:37:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:38:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:39:25] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:39:39] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:42:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:46:19] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:49:53] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:51:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:52:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:55:23] (03PS4) 10Mathew.onipe: icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) [03:59:09] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:01:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:40:23] 10Operations, 10MediaWiki-General-or-Unknown, 10Multimedia, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10User100100) I think it's best to do nothing and close this task. [05:45:00] 10Operations, 10MediaWiki-General-or-Unknown, 10Multimedia, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10User100100) It's better leave database as it is. Problem is solved. So this bug should be closed. [06:17:07] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:17:13] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:21:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:21:07] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:11] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:57:37] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:19:02] (03PS1) 10D3r1ck01: Stop NavPopups gadget conflict with PagePreviews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) [07:22:15] (03CR) 10D3r1ck01: "@Jdlrobson, if I remember correct per T203981, we have this happening only on enwikivoyage right? As page previews has been enabled only t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) (owner: 10D3r1ck01) [07:56:50] 10Operations, 10MediaWiki-General-or-Unknown, 10Multimedia, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10jcrespo) 05Open→03Resolved a:03jcrespo [07:58:45] 10Operations, 10MediaWiki-General-or-Unknown, 10Multimedia, 10media-storage, 10User-revi: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10jcrespo) a:05jcrespo→03revi [08:02:55] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb: Update list of core tables and its primary keys [software] - 10https://gerrit.wikimedia.org/r/486872 (owner: 10Jcrespo) [08:13:28] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486525 (owner: 10Jcrespo) [08:14:36] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486525 (owner: 10Jcrespo) [08:17:06] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486525 (owner: 10Jcrespo) [08:17:25] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1114 after crash (duration: 00m 52s) [08:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:09] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:20:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:21:33] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:45] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:23:13] I think that was the expected carrier(s) maintenance [08:24:51] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) db1114 is repooled. [08:38:44] !log stop, upgrade and restart db2056 [08:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:01] (03PS3) 10Mathew.onipe: elasticsearch_cluster: fix issues from test result [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) [08:52:57] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:53:05] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:55:33] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:55:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:57:39] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:58:39] !log stop, upgrade and restart db2041 [08:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:57] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:59:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:59:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:12:40] !log stopping, upgrading and restarting db2035, this will cause lag on codfw-s2 [09:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:43] !log stop, upgrade and restart db2058 [09:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:57] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:32:59] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:38:45] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:41:35] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:42:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:43:59] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:46:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:47:53] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:52:42] mmm ^^^ ?? [09:59:52] (03CR) 10Mathew.onipe: "PCC output is Ok: https://puppet-compiler.wmflabs.org/compiler1002/14493/" [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [10:05:59] !log stop, upgrade and restart db2065 [10:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:10] arturo: there was network maintenance, but it should have stopped 15 minutes ago [10:15:45] ack [10:16:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:16:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:16:53] ^either it got extended, or our checks have some lag [10:21:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:22:57] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:46:00] (03PS4) 10Lucas Werkmeister (WMDE): Fix Wikidata base URI in client config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) [11:48:36] (03CR) 10Lucas Werkmeister (WMDE): "Rebased on top of master so that this isn’t blocked on I8f6a6d67d7." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) (owner: 10Lucas Werkmeister (WMDE)) [11:50:28] (03PS5) 10Lucas Werkmeister (WMDE): Fix Wikidata base URI in client config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) [11:51:03] (03CR) 10Addshore: [C: 03+1] Fix Wikidata base URI in client config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) (owner: 10Lucas Werkmeister (WMDE)) [11:56:38] (03PS1) 10MarcoAurelio: WIP: Fix typo 'occured' on some files [puppet] - 10https://gerrit.wikimedia.org/r/487018 [11:57:26] (03PS4) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477521 [11:57:36] (03CR) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477521 (owner: 10Lucas Werkmeister (WMDE)) [11:58:23] (03PS2) 10MarcoAurelio: Fix typo 'occured' on some files [puppet] - 10https://gerrit.wikimedia.org/r/487018 (https://phabricator.wikimedia.org/T201491) [12:05:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:05:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:00:45] Hi! The Wikidocumentaries project requires more RAM for the VPS to deal with the needs of Wikibase. [13:04:47] ^ arturo [13:05:17] susannaanas: please, let's use the #wikimedia-cloud channel for discussing this? [13:07:27] OK, learning! [13:11:44] thanks jynus [13:20:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Fix typo 'occured' on some files [puppet] - 10https://gerrit.wikimedia.org/r/487018 (https://phabricator.wikimedia.org/T201491) (owner: 10MarcoAurelio) [13:26:50] * apergos peers blurry-eyed into the channel for a moment and then wanders off again [14:34:54] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10Discovery-Search (Current work): Setup elasticsearch on cloudelastic100[1-4] - https://phabricator.wikimedia.org/T214921 (10Mathew.onipe) p:05Triage→03Normal [14:36:49] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10Discovery-Search (Current work): Setup elasticsearch on cloudelastic100[1-4] - https://phabricator.wikimedia.org/T214921 (10Mathew.onipe) [14:57:45] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:59:05] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:01:38] ^could this explain weird network issues with wikis and WMCS? [15:02:09] no, that is a different datacenter [15:02:18] which issues are you suffering? [15:03:30] not sure they're related, things are slow on toolforge, commons timeouts [15:04:18] "commons timeouts" is a bit generic- what are your trying to reach, curl https://commons.wikimedia.org ? [15:06:05] or the databases? or something else? [15:06:08] `ping 208.80.155.163 ` 38% loss [15:06:20] commons was general web access [15:06:29] pings icmp is highly filtered [15:06:46] but yeah, curl is still not responding to curl from my notebook as well [15:06:54] *commons is still [15:07:25] that ip is 208.80.155.163, I think it should respond [15:07:49] might be something on my end, but everything else works [15:09:32] https://www.irccloud.com/pastebin/jkGVu3vk/ [15:09:40] ^it just hangs after that [15:10:20] `curl https://pt.wikipedia.org` works fine [15:10:50] I think you have an issue with your dns, commons should not try to connect to that ip [15:12:12] * chicocvenancio checks dns stuff [15:12:31] can you resolve for me commons.... on your side? [15:12:44] it should resolve to the same ip than pt.wiki [15:13:00] (I don't know which one because it depends on your location) [15:13:18] you are getting a cloud ip [15:13:40] so maybe you setup some vpn/ssh tunnel/bridge/ etc [15:14:06] https://www.irccloud.com/pastebin/yJS3wPvQ/ [15:15:20] it is the same as ptwiki [15:15:29] https://www.irccloud.com/pastebin/gnzSlS0p/ [15:17:35] (03PS1) 10Mathew.onipe: admin: create new ldap groups for cloudelastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/487040 [15:17:51] jynus: huh, change the dns to 8.8.8.8 and it now works [15:18:05] I think for you, you should get 208.80.154.224 [15:18:39] not that one, that one resolves to me to a host that is ours but doesn't serve that traffic [15:19:05] it could not be you, it could be your ISP [15:19:28] seems like it is, but that was the ip all along [15:19:36] :) [15:20:21] now that I have access to commons I'll bug the WMCS team over in their channel, thanks jynus [15:32:07] PROBLEM - HHVM jobrunner on mw1293 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:32:30] (03CR) 10Gehel: [C: 04-1] "This change is ready for review." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487040 (owner: 10Mathew.onipe) [15:34:45] RECOVERY - HHVM jobrunner on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.006 second response time [15:38:25] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10Discovery-Search (Current work): Setup elasticsearch on cloudelastic100[1-4] - https://phabricator.wikimedia.org/T214921 (10Mathew.onipe) [15:42:08] 10Operations, 10Cloud-VPS, 10SRE-Access-Requests, 10cloud-services-team, 10Discovery-Search (Current work): Create cloudelastic-root group - https://phabricator.wikimedia.org/T214922 (10Mathew.onipe) p:05Triage→03Normal [15:42:43] (03PS2) 10Mathew.onipe: admin: create new system groups for cloudelastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) [15:42:55] (03CR) 10Mathew.onipe: elasticsearch_cluster: fix issues from test result (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [15:42:57] (03CR) 10Gehel: [C: 04-1] elasticsearch_cluster: fix issues from test result (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [15:43:51] (03CR) 10Mathew.onipe: admin: create new system groups for cloudelastic nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [15:46:12] (03PS4) 10Mathew.onipe: elasticsearch_cluster: fix issues from test result [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) [15:46:28] (03CR) 10Mathew.onipe: "Thanks for this!" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [15:46:34] (03CR) 10Gehel: [C: 04-1] "minor issues inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [15:47:34] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [15:48:32] (03CR) 10Gehel: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [16:02:43] (03PS5) 10Mathew.onipe: icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) [16:03:18] (03CR) 10Mathew.onipe: icinga: enable check for psi and omega clusters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [16:03:49] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.002 second response time [16:04:39] PROBLEM - Nginx local proxy to jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.010 second response time [16:05:07] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.005 second response time [16:05:57] RECOVERY - Nginx local proxy to jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.024 second response time [16:07:08] (03CR) 10Mathew.onipe: "PCC Output is Ok: https://puppet-compiler.wmflabs.org/compiler1002/14494/" [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [16:31:11] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:32:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:37:31] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:51:02] !log T214499 update Netbox status for cloudvirt1023/1024/1025/1026/1027 from PLANNED to ACTIVE. These servers are actually providing services already. [16:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:06] T214499: wmcs: refresh hardware tracking docs - https://phabricator.wikimedia.org/T214499 [17:03:18] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) @aaron, @Nikerabbit - if you guys have ti... [17:08:29] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:24:23] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:36:53] (03CR) 10Gehel: [C: 04-1] "Hopefully, that's the last comment!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [17:49:12] (03PS1) 10Marostegui: analytics-grants.sql: Remove unused grants [puppet] - 10https://gerrit.wikimedia.org/r/487059 (https://phabricator.wikimedia.org/T214469) [17:52:13] 10Operations, 10Packaging, 10Performance-Team: Build .deb package of python3-aiokafka - https://phabricator.wikimedia.org/T189741 (10kchapman) 05Open→03Declined [17:52:52] 10Operations, 10Packaging, 10Performance-Team: Build .deb package of python3-typing for jessie - https://phabricator.wikimedia.org/T189729 (10kchapman) 05Stalled→03Declined [17:52:59] 10Operations, 10Packaging, 10Performance-Team: Build .deb package of python3-aiokafka - https://phabricator.wikimedia.org/T189741 (10Krinkle) >>! In T189741#4061010, @MoritzMuehlenhoff wrote: > I have created https://gerrit.wikimedia.org/r/operations/debs/python-aiokafka, could you import your package there... [17:53:18] 10Operations, 10WMF-Legal, 10Graphite, 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10Aklapper) [17:53:54] 10Operations, 10Packaging, 10Performance-Team: Build .deb package of python3-aiokafka - https://phabricator.wikimedia.org/T189741 (10Paladox) That's a clone link. So the UI will report a 404 or not found. [17:54:10] 10Operations, 10Packaging, 10Performance-Team: Build .deb package of python3-aiokafka - https://phabricator.wikimedia.org/T189741 (10Krinkle) I've marked the repo read-only, and updated its description to be "archived" pointing to this task. [17:56:36] 10Operations, 10WMF-Legal, 10Graphite, 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10Krenair) this seems relevant: https://meta.wikimedia.org/wiki/Wikilegal/Database_Rights it sounds to me like the argument that this would not be copyrighte... [17:58:56] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Aklapper) >>! In T197624#4913247, @Dzahn wrote: > So i just got 250 surprise notifications over night and it means it's hard to see the actual notifications i would like to read. @Dzahn:... [18:28:33] 10Operations, 10WMF-Legal, 10Graphite, 10Performance-Team (Radar), 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10Krinkle) [18:32:03] (03PS6) 10Mathew.onipe: icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) [18:32:29] (03CR) 10Mathew.onipe: icinga: enable check for psi and omega clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [18:35:30] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 (10TJones) [18:37:13] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335 (10debt) [18:37:32] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, and 3 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335 (10debt) [18:38:08] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089 (10TJones) [18:39:28] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Decrease time required to fully restart the Cirrus elasticsearch clusters - https://phabricator.wikimedia.org/T145065 (10TJones) 05Open→03Resolved [18:40:36] (03PS1) 10Tim Starling: Use excimer to set a graceful wall clock time limit in PHP 7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 [19:23:51] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog: Fix node vs nodejs dependency issue - https://phabricator.wikimedia.org/T214153 (10TJones) [19:25:19] (03PS1) 10Ladsgroup: Populate wmgWikibaseRepoSpecialSiteLinkGroups for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487076 (https://phabricator.wikimedia.org/T213975) [19:27:05] 10Operations, 10Discovery-Search, 10Elasticsearch, 10monitoring: Elasticsearch health check for shards icinga check shows OK status when cluster health is yellow - https://phabricator.wikimedia.org/T210668 (10Gehel) 05Open→03Resolved a:03Gehel We have specific checks for things we actually care about... [19:34:34] (03PS6) 10Daimona Eaytoy: Enable $wgAbuseFilterRuntimeProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039) [19:34:41] (03PS7) 10Daimona Eaytoy: Enable $wgAbuseFilterRuntimeProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039) [19:39:59] PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [19:40:11] PROBLEM - Nginx local proxy to videoscaler on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.013 second response time [19:40:15] PROBLEM - Nginx local proxy to jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.008 second response time [19:41:17] RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.015 second response time [19:41:31] RECOVERY - Nginx local proxy to videoscaler on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.013 second response time [19:41:35] RECOVERY - Nginx local proxy to jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 287 bytes in 0.027 second response time [19:54:07] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:54:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:55:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:56:43] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:06:28] 10Operations: Add favicon to icinga and tendril - https://phabricator.wikimedia.org/T204110 (10jcrespo) Tendril is WIP {F28074537} [20:52:47] Centurylink ticket 15816094 opened forthe esams-eqiad link flapping/down [21:11:09] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 464.50 seconds [21:20:40] (03PS1) 10Paladox: Disable link to "Reviewers" [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487086 [21:35:51] (03Abandoned) 10Paladox: Disable link to "Reviewers" [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487086 (owner: 10Paladox) [21:37:41] (03PS1) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 [21:44:41] 10Operations, 10Discovery-Search: Google Search Console access for Search Platform team - https://phabricator.wikimedia.org/T188453 (10TJones) [21:51:59] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10jijiki) [21:52:55] !log Depooling thumbor2002 due to disc failure - T214813 [21:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:59] T214813: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 [21:56:07] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10jijiki) p:05Triage→03Normal [21:56:30] 10Operations, 10Cloud-VPS, 10Discovery-Search, 10cloud-services-team: Setup elasticsearch on cloudelastic100[1-4] - https://phabricator.wikimedia.org/T214921 (10TJones) [21:56:40] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Add more metrics to upstream's elasticsearch exporter. - https://phabricator.wikimedia.org/T214547 (10TJones) [21:57:13] 10Operations, 10Discovery-Search, 10Operations-Software-Development, 10Wikidata, 10Wikidata-Query-Service: Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10TJones) [21:58:54] 10Operations, 10Discovery-Search, 10Epic: Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10TJones) [21:59:03] 10Operations, 10Discovery-Search, 10Elasticsearch: fix broken visualizations in Elasticsearch Node comparison dashboard - https://phabricator.wikimedia.org/T212831 (10TJones) [21:59:29] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10RobH) a:03Papaul This system just left warranty this month, so any disk swaps will have to be done with on-site spares (so nothing to do remotely, no support cases to file,... [21:59:41] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 8.30 seconds [22:00:02] 10Operations, 10Discovery-Search, 10Reading-Infrastructure-Team-Backlog, 10Maps (Tilerator): Log slow queries on postgresql / maps - https://phabricator.wikimedia.org/T204106 (10TJones) [22:00:16] 10Operations, 10Cloud-VPS, 10Discovery-Search, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10TJones) [22:28:29] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Level3 ticket: 15816094 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:28:29] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Level3 ticket: 15816094 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:34:30] (03PS1) 10Gehel: Proposal: cleanup of management class [software/spicerack] - 10https://gerrit.wikimedia.org/r/487094 [22:38:59] (03CR) 10Gehel: "Good enough to merge, but a few comments inline. Feel free to ignore" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [22:40:04] (03CR) 10jerkins-bot: [V: 04-1] Proposal: cleanup of management class [software/spicerack] - 10https://gerrit.wikimedia.org/r/487094 (owner: 10Gehel) [22:40:42] 10Operations, 10Wikidata, 10Wikidata-Query-Service: wdqs updater should be better isolated from blazegraph and common workload should be shared between servers - https://phabricator.wikimedia.org/T207837 (10Smalyshev) [22:41:35] (03PS2) 10Gehel: Proposal: cleanup of management class [software/spicerack] - 10https://gerrit.wikimedia.org/r/487094 [22:42:40] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10monitoring, and 2 others: upgrade prometheus-blazegraph-exporter to python3 - https://phabricator.wikimedia.org/T213305 (10Smalyshev) 05Open→03Resolved [22:45:05] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:45:07] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:53:01] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:59:31] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:00:51] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:03:29] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:08:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:16:39] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:22:03] 10Operations, 10SRE-Access-Requests: Requesting access to researchers group - https://phabricator.wikimedia.org/T214957 (10phuedx) [23:22:25] 10Operations, 10SRE-Access-Requests: Requesting access to researchers group - https://phabricator.wikimedia.org/T214957 (10phuedx) [23:34:09] ACKNOWLEDGEMENT - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Zayo Ticket number TTN-0003012740 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:46:13] 10Operations, 10Cloud-VPS, 10Discovery-Search, 10SRE-Access-Requests, and 2 others: Create cloudelastic-root group - https://phabricator.wikimedia.org/T214922 (10TJones) [23:54:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:54:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:56:50] (03PS2) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089