[00:09:30] 10Operations, 10User-jbond: CAS SSO: failed u2f registration - https://phabricator.wikimedia.org/T242438 (10crusnov) Here's a separate avenue that may be related to this. So the traceback from Java shows a bad signature, and I dug into this a bit with u2f testing sites. https://u2f.bin.coffee/ works entirely... [00:21:19] (03CR) 10CRusnov: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/567169 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [00:22:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1097:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10288 and previous config saved to /var/cache/conftool/dbconfig/20200129-002203-marostegui.json [00:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:08] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [00:22:48] (03CR) 10CRusnov: [C: 03+1] "looks good" [software/spicerack] - 10https://gerrit.wikimedia.org/r/567164 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [00:26:25] PROBLEM - MariaDB Slave Lag: s8 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 680.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:35:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1097:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10289 and previous config saved to /var/cache/conftool/dbconfig/20200129-003507-marostegui.json [00:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:11] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [00:35:43] PROBLEM - MariaDB Slave Lag: s8 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.74 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:40:17] RECOVERY - MariaDB Slave Lag: s8 on dbstore1005 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:58:21] 503 errors on phab: Request from [snip] via cp4031 frontend, Varnish XID 355380283 [00:59:17] and on wiki [00:59:26] Request from [snip] via cp4031 frontend, Varnish XID 359830779 [00:59:40] ema vgutierrez ^ [01:00:35] literally everything - also foundationwiki & wikitech, but not beta cluster... can't find a way to check if this was reported already [01:01:00] DannyS712: ping forwarded in RL [01:01:03] ack [01:01:17] RL? [01:01:28] I'm taking a look with ema [01:01:51] can we update the channel status so that reports aren't duplicated? [01:02:29] nevermind ema is talking with vgutierrez [01:03:07] !log depool cp4031 [01:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:24] DannyS712: in person (real life), traffic is on it [01:05:38] !log Disable notifications for dbstore1005:3318 slave lag - T243871 [01:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:42] T243871: Long query running on dbstore1005:3318 - https://phabricator.wikimedia.org/T243871 [01:06:44] DannyS712: working better since that server was depooled? [01:09:45] !log repool cp4031 [01:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:38] it should be fixed now [01:11:00] !log varnish-frontend restarted on cp4031 [01:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:21] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep - https://phabricator.wikimedia.org/T243226 (10Krenair) @jijiki: Hi, deployment-mediawiki-07.deployment-prep.eqiad.wmflabs has puppet disabled since approx `Mon Jan 20 10:56:33 UTC 2020 (12417 minutes ago)` with the comment `effie`... [01:55:53] Sorry, had to step away. Everything working now. Is there a phab task I can follow for updates re cause? [02:17:16] DannyS712: https://phabricator.wikimedia.org/T243634 [02:17:24] TBH we need to debug further the issue [02:17:36] we do know that restarting varnish-fe solves it [02:17:53] but we don't know yet why varnish is behaving like that [02:38:49] Hello, IT? Have you tried turning it off and on again? [03:32:40] (03PS1) 10Elukey: Add role to mc-gp100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/568258 (https://phabricator.wikimedia.org/T241795) [03:34:48] (03CR) 10Elukey: [C: 03+2] Add role to mc-gp100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/568258 (https://phabricator.wikimedia.org/T241795) (owner: 10Elukey) [04:34:48] 10Operations, 10Release-Engineering-Team, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance Issue, 10Wikimedia-database-error: WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki - https://phabricator.wikimedia.org/T240405 (10Krinkle) (re-taggi... [04:51:43] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.247e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [04:58:10] (03CR) 10Krinkle: [C: 04-1] webperf: switch xhgui_host from tungsten to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) (owner: 10Dzahn) [05:09:57] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [06:32:39] 10Operations, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10elukey) 05Open→03Resolved [06:32:42] 10Operations, 10serviceops: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10elukey) [06:33:33] 10Operations, 10serviceops: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10elukey) [07:58:32] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Strainu) >>! In T243701#5834731, @Addshore wrote: > Should tests be using the production api and site?... [08:11:45] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 37 probes of 516 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:52:25] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 33 probes of 516 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:51:07] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Uh, nice! https://codesearch.wmflabs.org/search/?q=MaxGeneratedPPNodeCount" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567157 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [10:52:05] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Yup! https://codesearch.wmflabs.org/search/?q=preprocessorClass" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567155 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [10:55:13] (03CR) 10Thiemo Kreuz (WMDE): Remove old APCBagOStuff reference (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567162 (owner: 10Aaron Schulz) [10:57:06] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "As of now, it still exists: https://codesearch.wmflabs.org/search/?q=MachineVisionDepictsSetter. Can you please link the patch that remove" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566860 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie) [10:58:40] (03CR) 10Thiemo Kreuz (WMDE): Remove handler deleted from the MachineVision extension on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566859 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie) [11:30:33] 10Operations: Request to block ActionApi client (based on a specific user agent header) - https://phabricator.wikimedia.org/T243858 (10jijiki) p:05Triage→03Normal [12:02:49] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:03:35] ^ looking [12:04:53] nothing serious for the time being [12:07:50] likely wdqs spike [12:08:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:42:00] (03PS1) 10Arturo Borrero Gonzalez: CloudVPS: set domain for VM instances using nova-api metadata [puppet] - 10https://gerrit.wikimedia.org/r/568473 (https://phabricator.wikimedia.org/T243556) [12:48:34] (03PS2) 10Arturo Borrero Gonzalez: CloudVPS: set domain for VM instances using nova-api metadata [puppet] - 10https://gerrit.wikimedia.org/r/568473 (https://phabricator.wikimedia.org/T243556) [12:52:53] (03CR) 10Arturo Borrero Gonzalez: "PCC run: https://puppet-compiler.wmflabs.org/compiler1003/20577/" [puppet] - 10https://gerrit.wikimedia.org/r/568473 (https://phabricator.wikimedia.org/T243556) (owner: 10Arturo Borrero Gonzalez) [13:58:05] (03PS1) 10Arturo Borrero Gonzalez: openstack: puppet-enc: allow hostnames from the new domain [puppet] - 10https://gerrit.wikimedia.org/r/568493 (https://phabricator.wikimedia.org/T243556) [13:59:31] (03PS2) 10Arturo Borrero Gonzalez: openstack: puppet-enc: allow hostnames from the new domain [puppet] - 10https://gerrit.wikimedia.org/r/568493 (https://phabricator.wikimedia.org/T243556) [14:01:45] (03PS3) 10Arturo Borrero Gonzalez: openstack: puppet-enc: allow hostnames from the new domain [puppet] - 10https://gerrit.wikimedia.org/r/568493 (https://phabricator.wikimedia.org/T243556) [14:03:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: puppet-enc: allow hostnames from the new domain [puppet] - 10https://gerrit.wikimedia.org/r/568493 (https://phabricator.wikimedia.org/T243556) (owner: 10Arturo Borrero Gonzalez) [14:05:20] (03PS3) 10Arturo Borrero Gonzalez: CloudVPS: set domain for VM instances using nova-api metadata [puppet] - 10https://gerrit.wikimedia.org/r/568473 (https://phabricator.wikimedia.org/T243556) [14:07:32] (03PS4) 10Arturo Borrero Gonzalez: CloudVPS: set domain for VM instances using nova-api metadata [puppet] - 10https://gerrit.wikimedia.org/r/568473 (https://phabricator.wikimedia.org/T243556) [14:08:36] (03PS5) 10Arturo Borrero Gonzalez: CloudVPS: set domain for VM instances using nova-api metadata [puppet] - 10https://gerrit.wikimedia.org/r/568473 (https://phabricator.wikimedia.org/T243556) [14:10:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] CloudVPS: set domain for VM instances using nova-api metadata [puppet] - 10https://gerrit.wikimedia.org/r/568473 (https://phabricator.wikimedia.org/T243556) (owner: 10Arturo Borrero Gonzalez) [14:46:30] jynus: wdqs spike causes mediawiki exceptions and fatals? (just read up) [14:48:17] ah, no [14:48:27] I saw just a minor spike on 5XX [14:48:37] aah okay :) [14:48:39] but very minor [14:48:40] just checking! [14:49:47] the mw may have been something else [15:00:01] 10Operations: Add annotations from ops vendor maintenance calendar to Grafana - https://phabricator.wikimedia.org/T223934 (10Volans) Maybe we could converge this into T222826 [15:31:00] I need an Ops to follow WD Oversight operations [1] for Q60616887 [15:31:01] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:31:01] [1]: https://www.wikidata.org/wiki/Wikidata:Oversight#Notes [15:31:51] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:32:08] revi: I am afraid you need to ping someone from wmde [15:32:16] addshore: ^ [15:32:23] ^ Amir1, Lucas_WMDE, addshore [15:32:30] * addshore looks [15:32:38] I would love to help you, but I am afraid I won't be able to :) [15:32:45] * Amir1 reads [15:32:52] or ideally just make it sync /joke [15:33:08] revi: how can be of help? [15:33:19] (feel free to pm) [15:33:20] seems like it needs to be removed from the WDQS DB [15:33:45] This is my first time doing OS on item/property, so :P [15:33:46] it should propagate slowly [15:33:47] I dont believe i can do that? [15:34:13] me neither [15:34:21] Hmm, reading it twice, I should contact #wikimedia-discovery and then -ops here [15:34:32] sounds about right! :) [15:35:04] Amir1: not sure about that, see https://phabricator.wikimedia.org/T105427 [15:35:09] revi: almost everyone is at all hands in -discovery [15:35:15] sh*t [15:35:22] then I probably need ops lol [15:35:27] yeah I forgot about allhands [15:35:47] https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Updating_specific_ID [15:35:49] I am ops, but I do know know much about WDQS :/ [15:35:50] I mean, I can go and make a pointless edit to the entity, which would tirgger a wdqs update if desired :) [15:35:53] (03CR) 10Jcrespo: [C: 03+1] "Let me know when some of you are around so I can merge right away while you make sure nothing horrible breaks." [puppet] - 10https://gerrit.wikimedia.org/r/568117 (https://phabricator.wikimedia.org/T243762) (owner: 10Dave Pifke) [15:36:02] addshore: it is already suppressed [15:36:07] effie: if you have access to the wdqs following https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Updating_specific_ID should be good [15:36:08] which means... you can't make an edit [15:36:13] revi: ah, whole entity, i see :) [15:36:19] ah lready suppressed but ow needs to be flshed out of the wdqs dbs, I see [15:36:23] *now flushed [15:36:39] :thumbs-up: [15:37:01] so this should work ```runUpdate.sh -n wdq -N -- --ids Q60616887``` for whoever wants to run it :) [15:37:11] yeah, but on which server [15:37:17] The runUpdate.sh script is located in the root of WDQS deployment directory. Note that each server needs to be updated separately, they do not share databases. [15:37:20] all of them [15:37:21] ;) [15:38:10] https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Hardware [15:38:12] looks like wdqs1004,5,6 [15:38:20] hmm, both dcs too [15:38:25] and both internal and external clusters [15:38:27] the problem is, if something for any reason does not go as planned [15:38:27] ideally [15:38:29] oh there's the internal ones [15:38:30] yeah, should be 200* too I believe [15:38:47] it will not be easy to find someone to fix it [15:38:52] if something goes wrong no one here will know how to remedy it, indeed [15:39:16] wdqs[2001-2006].codfw.wmnet,wdqs[1003-1010].eqiad.wmnet [15:39:18] It had label and description removed before the suppression happened [15:39:33] (by someone else) [15:40:06] if we want to be "extra safe" just do it 1 machine at a time, and make sure that the main update process continues / lag doesnt increase [15:40:39] It should be a pretty safe operation to perform though :) [15:41:02] ideally https://phabricator.wikimedia.org/T105427 should be fixed but it's stalled so [15:41:52] revi: +1 [15:41:58] +1 [15:43:23] I will give it a little time until I find at least someone from -discovery [15:43:30] and then I can run it [15:43:58] LGTM [15:48:57] people in sf will be coming on line in the next 1-2 hours, and this particular issue has been pending for a couple days so I think it's ok to wait a bit [15:50:12] I agree [15:50:23] also LGTM :P [16:00:57] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10revi) >>! In T226727#5592292, @Qgil wrote: > > I tried to find information about https://lists.wikimedia.org/ policies and governance in order to know wha... [16:12:19] revi: I'm going to run this update in couple minutes [16:12:24] thanks! [16:12:46] ACKNOWLEDGEMENT - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP CDanis Zayo TTN-0003851533 - The acknowledgement expires at: 2020-01-31 16:12:02. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:12:46] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Zayo TTN-0003851533 - The acknowledgement expires at: 2020-01-31 16:12:02. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:36:09] (03PS3) 10Jcrespo: Fix log spam from arclamp-generate-svgs [puppet] - 10https://gerrit.wikimedia.org/r/568117 (https://phabricator.wikimedia.org/T243762) (owner: 10Dave Pifke) [16:39:13] (03CR) 10Jcrespo: [C: 03+2] Fix log spam from arclamp-generate-svgs [puppet] - 10https://gerrit.wikimedia.org/r/568117 (https://phabricator.wikimedia.org/T243762) (owner: 10Dave Pifke) [17:09:39] (03PS2) 10Krinkle: Remove old APCBagOStuff reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567162 (owner: 10Aaron Schulz) [17:09:44] (03CR) 10Krinkle: [C: 03+2] Remove old APCBagOStuff reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567162 (owner: 10Aaron Schulz) [17:10:47] (03Merged) 10jenkins-bot: Remove old APCBagOStuff reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567162 (owner: 10Aaron Schulz) [17:19:53] !log krinkle@deploy1001 Synchronized wmf-config/etcd.php: Ice8dad2 (duration: 01m 10s) [17:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:38] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: introduce basic keys for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/568535 (https://phabricator.wikimedia.org/T243556) [17:28:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: openstack: introduce basic keys for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/568535 (https://phabricator.wikimedia.org/T243556) (owner: 10Arturo Borrero Gonzalez) [18:06:45] (03CR) 10Volans: "> Patch Set 2: Code-Review+1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/567168 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [18:13:22] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] "Got +1 from Faidon at All-Hands" [homer/public] - 10https://gerrit.wikimedia.org/r/562698 (owner: 10Ayounsi) [18:17:40] !log move knams netflow sampling to cr3-knams [18:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:01] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:18:49] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:19:09] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS2914/IPv4: Connect, AS2914/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:19:35] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:22:48] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:30:11] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki. - https://phabricator.wikimedia.org/T243599 (10Umherirrender) >>! In T243599#5837194, @Dzahn wrote: > from `[mwmaint1002:/var/log/mediawiki/updateSpecialPages/s6@16-AncientP... [19:08:11] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:08:21] 10Operations: Add annotations from ops vendor maintenance calendar to Grafana - https://phabricator.wikimedia.org/T223934 (10ayounsi) See also T230835. [19:13:35] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:13:39] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:14:23] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:14:37] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:41:57] 10Operations, 10Arc-Lamp, 10Performance-Team, 10Wikimedia-production-error: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null - https://phabricator.wikimedia.org/T243762 (10Krinkle) 05Open→03Resolved [19:42:02] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Krinkle) [19:48:39] 10Operations, 10User-jbond: CAS SSO: failed u2f registration - https://phabricator.wikimedia.org/T242438 (10jbond) I have tried to do a bit of research on this and from my reading by default [[ https://developers.yubico.com/WebAuthn/WebAuthn_Developer_Guide/Attestation.html | attestation signatures ]] are not... [20:11:56] (03PS1) 10Cmjohnson: adding mgmt dns for es1019-1025 [dns] - 10https://gerrit.wikimedia.org/r/568577 (https://phabricator.wikimedia.org/T241359) [20:15:37] (03CR) 10RobH: [C: 03+1] adding mgmt dns for es1019-1025 [dns] - 10https://gerrit.wikimedia.org/r/568577 (https://phabricator.wikimedia.org/T241359) (owner: 10Cmjohnson) [20:18:37] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:33:57] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active, AS2914/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:42:47] (03PS7) 10ArielGlenn: write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [20:48:27] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:26:49] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 24955856 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:28:37] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 31896 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:40:57] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for niedzielski - https://phabricator.wikimedia.org/T243924 (10Niedzielski) [21:55:05] (03Abandoned) 10TheDJ: Add decimal seek offset for videos [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/524330 (https://phabricator.wikimedia.org/T228467) (owner: 10TheDJ) [21:57:08] (03PS8) 10ArielGlenn: write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [22:11:27] (03CR) 10Cmjohnson: [C: 03+2] adding mgmt dns for es1019-1025 [dns] - 10https://gerrit.wikimedia.org/r/568577 (https://phabricator.wikimedia.org/T241359) (owner: 10Cmjohnson) [22:12:52] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Cmjohnson) [22:13:46] 10Operations, 10serviceops: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10elukey) Couple of random thoughts: * we should check the diff between our mcrouter version, 0.37, and the last upstream 0.41, to see if any important bug was presen... [22:14:53] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Cmjohnson) updated mgmt dns +es1020 1H IN A 10.65.4.144 +es1021 1H IN A 10.65.4.145 +es1022 1H IN A 10... [22:18:07] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Cmjohnson) [22:32:17] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 38 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:32:38] (03PS1) 10Brion VIBBER: Add decimal seek offset for videos [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/568646 (https://phabricator.wikimedia.org/T228467) [22:37:13] (03PS1) 10Cmjohnson: Add ganeti1009|101[1-8] dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/568652 (https://phabricator.wikimedia.org/T228924) [22:38:07] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 29 probes of 517 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:49:25] (03CR) 10Brion VIBBER: "I'm unable to run the tests (pyexiv2 doesn't run on macOS, and under Linux it complains about python 3, etc). But it looks ok..." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/568646 (https://phabricator.wikimedia.org/T228467) (owner: 10Brion VIBBER) [22:49:49] (03PS2) 10Cmjohnson: Add ganeti1009|101[1-8] dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/568652 (https://phabricator.wikimedia.org/T228924) [22:54:20] (03CR) 10Cmjohnson: [C: 03+2] Add ganeti1009|101[1-8] dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/568652 (https://phabricator.wikimedia.org/T228924) (owner: 10Cmjohnson) [23:02:17] (03PS9) 10ArielGlenn: write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [23:25:30] (03PS1) 10CRusnov: gen-zones.py: Add variable insertion [dns] - 10https://gerrit.wikimedia.org/r/568683 [23:25:50] (03CR) 10jerkins-bot: [V: 04-1] gen-zones.py: Add variable insertion [dns] - 10https://gerrit.wikimedia.org/r/568683 (owner: 10CRusnov) [23:28:08] (03PS2) 10CRusnov: gen-zones.py: Add variable insertion [dns] - 10https://gerrit.wikimedia.org/r/568683 [23:29:19] (03CR) 10CRusnov: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/568683 (owner: 10CRusnov) [23:33:49] (03PS1) 10Marostegui: db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/568685 (https://phabricator.wikimedia.org/T239453) [23:36:22] (03CR) 10Marostegui: [C: 03+2] db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/568685 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [23:37:45] !log Remove partitions from db2087:3317 - T239453 [23:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:49] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453