[00:16:51] 10Operations, 10Traffic, 10Performance Issue, 10Wikimedia-Incident: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage - https://phabricator.wikimedia.org/T243713 (10Pigsonthewing) Again, or something different? (Just now, from Birmingham, England)... [00:25:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:29:07] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:31:27] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:31:28] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:37:33] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1117.95 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:48:56] (03PS1) 10Marostegui: db2085: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/567530 [01:50:18] (03CR) 10Marostegui: [C: 03+2] db2085: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/567530 (owner: 10Marostegui) [01:52:28] 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Dzahn) a:03Dzahn [01:53:51] 10Operations, 10Gerrit, 10Release-Engineering-Team: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Marostegui) [01:53:57] 10Operations, 10Gerrit, 10Release-Engineering-Team: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Marostegui) p:05Triage→03High [01:54:44] 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Dzahn) This user was not created in the linked ticket, it was pre-existing and we are just trying to use it. [01:55:28] 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Marostegui) >>! In T243800#5835375, @Dzahn wrote: > This user was not created in the linked ticket, it was pre-existing and we are just trying to use it. can you double check the password... [01:55:41] 10Operations, 10Gerrit, 10Release-Engineering-Team: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Dzahn) This is not the production services. This is a test setup for the 2.16 upgrade. [01:57:22] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Marostegui) p:05High→03Normal Thanks, I have added a comment to the alert to avoid confusions. [02:00:11] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Dzahn) Tried to avoid alerts on this and turn off monitoring with https://gerrit.wikimedia.org/r/c/operations/puppet/+/562619 but that isn't enough as it still pops up in the web UI for base checks, even if... [02:01:07] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Marostegui) >>! In T243808#5835382, @Dzahn wrote: > Tried to avoid alerts on this and turn off monitoring with https://gerrit.wikimedia.org/r/c/operations/puppet/+/562619 but that isn't enough as it still p... [02:05:33] !log gerrit1002 - gzipping a bunch of /var/log/gerrit/ log files (T243808) [02:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:37] T243808: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 [02:19:17] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Dzahn) @thcipriani ^ This is back to 94% as of right now after ^. And it's been downtime for a month. Is the test instance usable with the current size? Also fwiw, when i looked at /srv and the largest fil... [02:27:56] 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Dzahn) >>! In T243800#5835217, @Paladox wrote: > Couldn't we add the pass here https://github.com/wikimedia/labs-private/tree/master/hieradata/hosts for gerrit1002? That won't work since... [02:38:01] 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Dzahn) Private config including the db_pass is not in (private) Hiera unfortunately. It is in the passwords module and it's not a class parameter either yet. We should move that to Hiera... [02:38:58] 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Dzahn) >>! In T243800#5835376, @Marostegui wrote: > can you double check the password it uses? Turns out the password is hardcoded to the one for the rw user. ^ [02:39:08] 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Dzahn) p:05Triage→03Normal [02:43:09] (03PS1) 10Dzahn: icinga: add twentyafterfour to gerrit contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/567544 [02:48:34] (03PS1) 10Dzahn: icinga: add irc,irc-releng to phabricator contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/567550 [02:53:42] 10Operations, 10Design-Research, 10Domains, 10Traffic: Register wikipersonas.org and redirect URL - https://phabricator.wikimedia.org/T241944 (10Dzahn) 05Open→03Stalled [02:57:53] (03PS1) 10Dzahn: add wikipersonas.org and link to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/567558 (https://phabricator.wikimedia.org/T241944) [02:58:32] 10Operations, 10LDAP-Access-Requests: Request for LDAP access to the WMF group for Sakti Pramudya - https://phabricator.wikimedia.org/T243802 (10Dzahn) a:03Dzahn [03:01:30] (03PS1) 10Dzahn: admins: add Sakti Pramudya to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/567563 (https://phabricator.wikimedia.org/T243802) [03:29:56] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:32:20] (03PS5) 10ArielGlenn: write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [03:48:08] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 54.62 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:06:14] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 78.19 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:11:25] 10Operations, 10Wikimedia-Mailing-lists: Loss of formatting in email to or from Wikimedia-l - https://phabricator.wikimedia.org/T243809 (10Pine) [04:19:06] 10Operations, 10Wikimedia-Mailing-lists: Loss of formatting in email to or from Wikimedia-l - https://phabricator.wikimedia.org/T243809 (10Pine) I'll clarify one point. I'm aware that the archives show the emails in plain text. The bug is not with the archives. The bug is with how the email was handled between... [08:43:56] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:14:32] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:15:58] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:18:06] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:25:02] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:25:58] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Addshore) Its starting to feel like we should just create a better mechanism for this instead of using... [09:38:10] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/references/{title} (Get references of a test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was receive [09:38:10] age/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:41:48] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:43:06] ^ looking [09:47:12] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:49:20] what did you do? or did it recover on its own? [09:50:45] it has been behaving funny since yesterday [09:50:57] I am checking the logs so fr [09:50:59] far* [09:51:13] ok [09:52:01] I will rolling restart mobileapps for now [09:52:06] at leats on codfw [09:54:44] crossing fingers [09:59:38] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:59:56] !log rolling restart mobileapps in codfw [09:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:26] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [10:03:14] 10Operations, 10Wikimedia-Mailing-lists: Loss of formatting in email to or from Wikimedia-l - https://phabricator.wikimedia.org/T243809 (10jijiki) p:05Triage→03Normal [10:10:54] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [10:11:16] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:13:02] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:16:20] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:30:02] 10Operations: mr1-eqiad.oob IPv6 is down - https://phabricator.wikimedia.org/T243821 (10jijiki) [11:30:34] ACKNOWLEDGEMENT - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Effie Mouzeli Filed a task under T243821 [12:44:24] 10Operations, 10Wikimedia-Mailing-lists: Loss of HTML formatting in email to or from Wikimedia-l - https://phabricator.wikimedia.org/T243809 (10Aklapper) [12:48:42] 10Operations, 10Wikimedia-Mailing-lists: Loss of HTML formatting in email to or from Wikimedia-l - https://phabricator.wikimedia.org/T243809 (10Aklapper) I assume the default values for the "Content Filtering" section in the preferences at https://lists.wikimedia.org/mailman/admin/wikimedia-l/contentfilter are... [12:52:08] 10Operations, 10Wikimedia-Mailing-lists: Loss of HTML formatting in email to or from Wikimedia-l - https://phabricator.wikimedia.org/T243809 (10LucasWerkmeister) Yeah, at least it’s not a regression – this is the earliest WMYHTW email I have: {F31536630} Though we do also have lists that permit HTML, such as... [12:53:36] 10Operations: mr1-eqiad.oob IPv6 is down - https://phabricator.wikimedia.org/T243821 (10jijiki) p:05Triage→03High [12:54:02] 10Operations, 10Performance-Team: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null - https://phabricator.wikimedia.org/T243762 (10jijiki) p:05Triage→03Normal [12:54:59] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10jijiki) p:05Triage→03Normal [12:56:22] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki. - https://phabricator.wikimedia.org/T243599 (10jijiki) p:05Triage→03Normal [13:40:13] 10Operations, 10Parsoid-PHP, 10serviceops, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) [14:35:14] (03CR) 10Jdlrobson: [C: 03+1] Add wordmark for etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565549 (https://phabricator.wikimedia.org/T230379) (owner: 10Pikne) [14:39:45] (03PS1) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) [14:41:50] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [15:06:51] !log Start addshore@mwmaint1002:~$ ./T219123.sh # Taking over from @ladsgroup for T219123 [15:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:56] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [15:23:29] (03PS2) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) [15:25:40] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [15:33:00] (03CR) 10Joal: "Thanks for the offer @ArielGlenn :) We are waiting for another related code change to be merged and deplpoyed before merging that one.I'll" [puppet] - 10https://gerrit.wikimedia.org/r/566822 (https://phabricator.wikimedia.org/T243426) (owner: 10Joal) [15:35:54] (03PS3) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) [15:38:02] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [15:46:33] are there docs on wikimedia-operations [15:46:50] 10Operations, 10Parsoid-PHP, 10serviceops, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Arlolra) [15:46:52] it seems like a big set of infrastructure [15:47:00] would be useful to learn how it is managed, incident response, etc [15:47:08] so I can apply it to my own job [15:47:22] and who better to share info than a wiki [15:49:27] Hoffman: what sot of thing do you mean? there are definitely docs on the wikimedia production cluster [15:50:10] the huge 'probably don't want to start here' portal is at https://wikitech.wikimedia.org/wiki/Portal:Wikitech [15:51:22] * Hoffman looks there and sees if it's a good place to start [15:51:25] apergos: thanks [15:52:08] apergos: I was thinking in terms of how work is handled and managed [15:52:10] and distributed [15:52:20] sure! there are presentations that give a general overview of the infrastructure too, some more outdated than others: https://www.mediawiki.org/wiki/Presentations [15:52:34] much of what we do is handled through phabricator [15:52:55] and we have recently transitioned to an OKR system, though I don't know how much that really impacts our workflow [15:53:07] https://phabricator.wikimedia.org/ [15:53:26] thanks again [15:53:35] we generally each have a few outstandign tasks we are working on, rather a lot more we are watching or assisting with [15:53:37] sure thing [15:56:08] (03PS4) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) [16:18:56] 10Operations, 10ops-codfw, 10User-jbond: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10jbond) [16:19:22] 10Operations, 10ops-codfw, 10User-jbond: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10jbond) 05Open→03Resolved This server is now active in production Thanks [16:26:10] 10Operations, 10Performance-Team: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null - https://phabricator.wikimedia.org/T243762 (10Krinkle) I've ruled out any structural failure (e.g. the script looking in the wrong directory) because it does still reliably do what it's... [16:26:34] 10Operations, 10Arc-Lamp, 10Performance-Team: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null - https://phabricator.wikimedia.org/T243762 (10Krinkle) [16:26:45] 10Operations, 10Arc-Lamp, 10Performance-Team, 10Wikimedia-production-error: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null - https://phabricator.wikimedia.org/T243762 (10Krinkle) [16:28:16] Hoffman: This might be of interest - https://wikitech.wikimedia.org/wiki/MediaWiki#Infrastructure [16:40:06] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) [16:47:24] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) All three have the OS installed. Still need the initial puppet run [16:47:35] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) [16:50:22] 10Operations, 10Arc-Lamp, 10Performance-Team, 10Wikimedia-production-error: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null - https://phabricator.wikimedia.org/T243762 (10jcrespo) Assuming no impact on actual functionality, this could have lower priority, not have... [16:52:53] 10Operations, 10Arc-Lamp, 10Performance-Team, 10Wikimedia-production-error: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null - https://phabricator.wikimedia.org/T243762 (10Krinkle) I consider log spam as impact. It's okay, we'll get it fixed. We only have one other... [17:01:17] 10Operations, 10Traffic, 10Performance Issue, 10Wikimedia-Incident: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage - https://phabricator.wikimedia.org/T243713 (10Addshore) >>! In T243713#5835243, @Krinkle wrote: >>>! In T243713#5833717, @Michael w... [17:11:23] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: fix address of ns0.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/567986 (https://phabricator.wikimedia.org/T243766) [17:13:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: fix address of ns0.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/567986 (https://phabricator.wikimedia.org/T243766) (owner: 10Arturo Borrero Gonzalez) [17:17:10] 10Operations, 10Arc-Lamp, 10Performance-Team, 10Wikimedia-production-error: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null - https://phabricator.wikimedia.org/T243762 (10jcrespo) Thanks, let us know how we can help. [18:23:47] (03CR) 10Dzahn: "The reason jenkins-bot downvotes you is because a profile class includes a non-profile (module) class. You can either move everything to " [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [18:29:26] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki. - https://phabricator.wikimedia.org/T243599 (10Dzahn) Looking at what is disabled in general, i see this is not specific to Japanese Wikipedia but globally these are disable... [18:31:33] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki. - https://phabricator.wikimedia.org/T243599 (10Jdforrester-WMF) Yeah, looks like this didn't run in the four times expected since (two in December, two in January). [18:42:02] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki. - https://phabricator.wikimedia.org/T243599 (10Dzahn) Disabling is not at the puppet / cronjob level. These are "sub-jobs" of the single "update_special_pages" job. The scr... [18:44:37] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki. - https://phabricator.wikimedia.org/T243599 (10Dzahn) git blame for these lines brings us to: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/541358 [18:45:14] (03CR) 10Dzahn: "please see https://phabricator.wikimedia.org/T243599" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541358 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [18:46:34] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki. - https://phabricator.wikimedia.org/T243599 (10Jdforrester-WMF) >>! In T243599#5836900, @Dzahn wrote: > git blame for these lines brings us to: > > https://gerrit.wikimedia... [18:47:58] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki. - https://phabricator.wikimedia.org/T243599 (10Dzahn) @Umherirrender Looking at the change above and this ticket, do you know more about why and when these jobs were disabled? [18:48:56] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki. - https://phabricator.wikimedia.org/T243599 (10Jdforrester-WMF) >>! In T243599#5836920, @Dzahn wrote: > @Umherirrender Looking at the change above and this ticket, do you kn... [18:54:41] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki. - https://phabricator.wikimedia.org/T243599 (10Dzahn) >>! In T243599#5836911, @Jdforrester-WMF wrote: > Yeah, does whatever in puppet is parsing config not understand the ne... [19:17:47] 10Operations: mr1-eqiad.oob IPv6 is down - https://phabricator.wikimedia.org/T243821 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks! It's back, I'd guess a transient error on Equinix's network. Not worth investigating it more as it's now up and only OOB. [19:27:34] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.27 ms [19:41:38] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [19:44:11] 10Operations: mr1-eqiad.oob IPv6 is down - https://phabricator.wikimedia.org/T243821 (10jcrespo) 05Resolved→03Open ` [19:41] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% ` [20:01:23] 10Operations: mr1-eqiad.oob IPv6 is down - https://phabricator.wikimedia.org/T243821 (10ayounsi) `name=not working ayounsi@icinga1001:~$ mtr -z --report-wide 2607:f6f0:205::153 Start: Tue Jan 28 19:51:46 2020 HOST: icinga1001 Loss% Snt Last Avg Best Wrst StDev... [20:02:09] ACKNOWLEDGEMENT - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi https://phabricator.wikimedia.org/T243821 - The acknowledgement expires at: 2020-01-30 20:01:47. [20:02:14] 10Operations: Request to block ActionApi-Client - https://phabricator.wikimedia.org/T243858 (10fwunderl) [20:30:18] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki. - https://phabricator.wikimedia.org/T243599 (10Umherirrender) In puppet is `mediawiki::maintenance::updatequerypages::ancientpages` which is running the update with the over... [20:56:15] (03PS6) 10ArielGlenn: write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [21:00:24] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.22 ms [21:06:32] PROBLEM - MariaDB Slave Lag: s8 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 387.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [21:13:44] RECOVERY - MariaDB Slave Lag: s8 on dbstore1005 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [21:49:43] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) [21:51:17] 10Operations, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) a:05Cmjohnson→03elukey @elukey these are ready for you to implement. I am removing ops-eqiad tag. [21:51:47] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/567175 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [21:52:52] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki. - https://phabricator.wikimedia.org/T243599 (10Dzahn) from `[mwmaint1002:/var/log/mediawiki/updateSpecialPages/s6@16-AncientPages.log] $ ` ` 1 --------------------------... [21:54:25] (03CR) 10CRusnov: [C: 03+1] "since cluster is being renamed elsepatch (not directly related to, nor connected to this) is cluster the proper name for this property? Ot" [software/spicerack] - 10https://gerrit.wikimedia.org/r/567168 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [21:59:46] 10Operations: Request to block ActionApi client (based on a specific user agent header) - https://phabricator.wikimedia.org/T243858 (10Aklapper) [21:59:55] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) [22:01:18] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) a:05Cmjohnson→03akosiaris @akosiaris I am assigning this to you, the initial puppet run has been completed. I removed the ops-eqiad tag [22:17:55] 10Operations, 10Wikimedia-Mailing-lists: Loss of HTML formatting in email to or from Wikimedia-l - https://phabricator.wikimedia.org/T243809 (10Quiddity) HTML (rich text) formatting is purposefully disabled on almost all our mailing lists, just as are email attachments, partially for security reasons. This is... [22:29:08] (03PS5) 10Bmansurov: Add recommendation-api chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) [22:29:50] (03CR) 10jerkins-bot: [V: 04-1] Add recommendation-api chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [22:29:56] (03PS6) 10Bmansurov: Add recommendation-api chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) [22:30:37] (03CR) 10jerkins-bot: [V: 04-1] Add recommendation-api chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [22:31:37] 10Operations, 10Wikimedia-Mailing-lists: Loss of HTML formatting in email to or from Wikimedia-l - https://phabricator.wikimedia.org/T243809 (10Aklapper) 05Open→03Invalid I'm closing this task as invalid as the current behavior is intentional. In general, admins of a specific mailing list can change settin... [22:32:26] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1021 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [22:41:36] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1022 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [22:46:48] (03PS1) 10Dave Pifke: Scrape perception survey alerts from Grafana [puppet] - 10https://gerrit.wikimedia.org/r/568092 (https://phabricator.wikimedia.org/T243865) [22:50:02] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1019 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [22:53:05] (03CR) 10Bmansurov: "Alexandros Kosiaris, thanks for the reviews. After adding the config.app, I'm getting an error from Jenkins. Do I need to make configmap.y" [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [22:53:58] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1020 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [22:56:03] (03CR) 10Gilles: [C: 03+1] Scrape perception survey alerts from Grafana [puppet] - 10https://gerrit.wikimedia.org/r/568092 (https://phabricator.wikimedia.org/T243865) (owner: 10Dave Pifke) [22:57:02] (03CR) 10Gilles: [C: 03+1] Scrape perception survey alerts from Grafana (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/568092 (https://phabricator.wikimedia.org/T243865) (owner: 10Dave Pifke) [23:26:59] 10Operations, 10Arc-Lamp, 10Performance-Team, 10Wikimedia-production-error: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null - https://phabricator.wikimedia.org/T243762 (10Gilles) a:03dpifke [23:32:30] (03PS1) 10Dave Pifke: Fix log spam from arclamp-generate-svgs [puppet] - 10https://gerrit.wikimedia.org/r/568117 (https://phabricator.wikimedia.org/T243762) [23:35:33] 10Operations: mr1-eqiad.oob IPv6 is down - https://phabricator.wikimedia.org/T243821 (10ayounsi) 05Open→03Resolved Router bug, confirmed fixed by NTT. [23:36:05] (03CR) 10Krinkle: Fix log spam from arclamp-generate-svgs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/568117 (https://phabricator.wikimedia.org/T243762) (owner: 10Dave Pifke) [23:36:43] (03CR) 10Krinkle: [C: 03+1] Fix log spam from arclamp-generate-svgs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/568117 (https://phabricator.wikimedia.org/T243762) (owner: 10Dave Pifke) [23:37:38] (03PS1) 10Marostegui: db1121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/568121 (https://phabricator.wikimedia.org/T232446) [23:38:56] (03CR) 10Marostegui: [C: 03+2] db1121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/568121 (https://phabricator.wikimedia.org/T232446) (owner: 10Marostegui) [23:40:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1121 T232446', diff saved to https://phabricator.wikimedia.org/P10284 and previous config saved to /var/cache/conftool/dbconfig/20200128-234037-marostegui.json [23:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:43] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [23:42:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Start repooling db1084 with its original weight', diff saved to https://phabricator.wikimedia.org/P10285 and previous config saved to /var/cache/conftool/dbconfig/20200128-234219-marostegui.json [23:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:47] (03PS2) 10Dave Pifke: Fix log spam from arclamp-generate-svgs [puppet] - 10https://gerrit.wikimedia.org/r/568117 (https://phabricator.wikimedia.org/T243762) [23:46:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1097:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10286 and previous config saved to /var/cache/conftool/dbconfig/20200128-234601-marostegui.json [23:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:11] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [23:47:14] (03CR) 10Krinkle: [C: 03+1] Fix log spam from arclamp-generate-svgs [puppet] - 10https://gerrit.wikimedia.org/r/568117 (https://phabricator.wikimedia.org/T243762) (owner: 10Dave Pifke) [23:49:56] (03CR) 10Krinkle: Scrape perception survey alerts from Grafana (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/568092 (https://phabricator.wikimedia.org/T243865) (owner: 10Dave Pifke) [23:51:04] PROBLEM - MariaDB Slave Lag: s8 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 361.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [23:52:38] (03CR) 10Gilles: [C: 03+1] Scrape perception survey alerts from Grafana (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/568092 (https://phabricator.wikimedia.org/T243865) (owner: 10Dave Pifke) [23:53:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1097:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10287 and previous config saved to /var/cache/conftool/dbconfig/20200128-235336-marostegui.json [23:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:40] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [23:56:30] RECOVERY - MariaDB Slave Lag: s8 on dbstore1005 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave