[00:03:14] Hauskatze, possibly a flood? There seem to be some news articles about Wikipedia updating the page about the President https://www.google.com/search?q=venezuela+wikipedia&num=20&safe=off&tbm=nws&source=lnt&tbs=qdr:d [00:03:47] quiddity: probably, but some people said something about DNS [00:15:54] (03PS2) 10MaxSem: Remove old ArticleCreationWorkflows config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462041 (https://phabricator.wikimedia.org/T204016) [00:16:53] RECOVERY - Memory correctable errors -EDAC- on kafka1014 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=kafka1014&var-datasource=eqiad+prometheus/ops [00:17:12] after deleting /etc/apt/preferences.d/wikimedia.pref everything works :p [00:17:30] now to the puppet part [00:17:41] als, "everything" was too early:) [00:19:17] (03Abandoned) 10Paladox: httpd::mpm: Add missing condition to "if $source {" [puppet] - 10https://gerrit.wikimedia.org/r/481907 (owner: 10Paladox) [00:20:09] paladox: because it was already done , right [00:20:15] yup [00:20:19] ack [01:09:07] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): The continuous release pipeline should support more than one service per repo - https://phabricator.wikimedia.org/T210267 (10dduvall) >>! In T210267#4873881, @thcipriani wrote: > > What new... [01:14:11] (03PS1) 10Nuria: Adding default granularities for monthly datasets [puppet] - 10https://gerrit.wikimedia.org/r/483888 (https://phabricator.wikimedia.org/T209103) [01:17:21] (03PS1) 10Dzahn: testreduce: also pin nodejs,nodejs-dev,npm to stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/483889 (https://phabricator.wikimedia.org/T201366) [01:19:10] (03PS2) 10Dzahn: testreduce: also pin nodejs,nodejs-dev,npm to stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/483889 (https://phabricator.wikimedia.org/T201366) [01:20:23] (03PS3) 10Dzahn: testreduce: also pin nodejs,nodejs-dev,npm to stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/483889 (https://phabricator.wikimedia.org/T201366) [01:20:51] (03CR) 10Dzahn: [C: 03+2] testreduce: also pin nodejs,nodejs-dev,npm to stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/483889 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:24:25] (03CR) 10Dzahn: [C: 03+2] "follow-up https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483889/" [puppet] - 10https://gerrit.wikimedia.org/r/482150 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:43:50] (03PS1) 10Dzahn: service::node: do not install nodejs-legacy if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/483891 (https://phabricator.wikimedia.org/T201366) [01:53:33] (03PS2) 10Dzahn: service::node: do not install nodejs-legacy if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/483891 (https://phabricator.wikimedia.org/T201366) [01:58:07] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/14310/" [puppet] - 10https://gerrit.wikimedia.org/r/483891 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:58:48] (03PS1) 10MaxSem: [labs] Remove GuidedTour config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483892 [01:58:50] (03PS1) 10MaxSem: [labs] Remove $wgKartographerUsePageLanguage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483893 [01:58:52] (03PS1) 10MaxSem: [labs] Remove $wmgVisualEditorUseSingleEditTab [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483894 [01:58:54] (03PS1) 10MaxSem: [labs] Remove $wmgVisualEditorTransitionDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483895 [01:58:56] (03PS1) 10MaxSem: [labs] Remove $wmgULSCompactLanguageLinksBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483896 [01:58:58] (03PS1) 10MaxSem: [labs] Remove $wmgGettingStartedRunTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483897 [01:59:00] (03PS1) 10MaxSem: [labs] Remove $wmgUseQuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483898 [01:59:02] (03PS1) 10MaxSem: [labs] Remove $wmgUseElectronPdfService [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483899 [01:59:04] (03PS1) 10MaxSem: [labs] Remove $wmgUseTemplateWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483900 [01:59:06] (03PS1) 10MaxSem: [labs] Remove $wmgUseLoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483901 [02:00:21] (03PS1) 10Andrew Bogott: wmcs: add a script to update VPS proxies [puppet] - 10https://gerrit.wikimedia.org/r/483902 (https://phabricator.wikimedia.org/T213540) [02:01:16] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) After quite some fight we now have nodejs 8 and npm installed via puppet and APT pinning works finally. The next issue is that the service::node class... [02:01:23] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) p:05Normal→03High [02:09:58] (03PS1) 10Legoktm: keys: Add Mukunda's new subkey that was used for the 1.32 release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483903 (https://phabricator.wikimedia.org/T213521) [02:13:26] (03CR) 10Legoktm: [C: 03+2] keys: Add Mukunda's new subkey that was used for the 1.32 release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483903 (https://phabricator.wikimedia.org/T213521) (owner: 10Legoktm) [02:14:30] (03Merged) 10jenkins-bot: keys: Add Mukunda's new subkey that was used for the 1.32 release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483903 (https://phabricator.wikimedia.org/T213521) (owner: 10Legoktm) [02:15:44] (03CR) 10jenkins-bot: keys: Add Mukunda's new subkey that was used for the 1.32 release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483903 (https://phabricator.wikimedia.org/T213521) (owner: 10Legoktm) [02:16:06] !log legoktm@deploy1001 Synchronized docroot/mediawiki.org/keys: Add Mukunda's new subkey that was used for the 1.32 release - T213521 (duration: 00m 47s) [02:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:09] T213521: Unknown key used to sign MediaWiki 1.32.0 tarball - https://phabricator.wikimedia.org/T213521 [03:07:47] (03PS1) 10MaxSem: [labs] Remove $wmgUseCodeMirror [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483904 [03:07:49] (03PS1) 10MaxSem: [labs] Remove $wmgUseAdvancedSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483905 [03:07:51] (03PS1) 10MaxSem: [labs] Remove $wgAdvancedSearchBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483906 [03:07:53] (03PS1) 10MaxSem: [labs] Remove $wmgUsePageViewInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483907 [03:07:55] (03PS1) 10MaxSem: [labs] Remove $wmgUseLinter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483908 [03:07:57] (03PS1) 10MaxSem: [labs] Remove $wgPopupsOptInStateForNewAccounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483909 [03:07:59] (03PS1) 10MaxSem: [labs] Remove $wmgUseTemplateStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483910 [03:08:01] (03PS1) 10MaxSem: [labs] Remove $wgEchoMaxMentionsInEditSummary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483911 [03:08:03] (03PS1) 10MaxSem: [labs] Remove $wmgUseNewWikiDiff2Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483912 [03:14:33] PROBLEM - Check systemd state on ms-be1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:45:45] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 878.27 seconds [04:03:19] RECOVERY - Check systemd state on ms-be1017 is OK: OK - running: The system is fully operational [04:05:44] (03CR) 10Ejegg: [C: 03+1] "Change looks fine, assuming it's not against policy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483044 (https://phabricator.wikimedia.org/T209873) (owner: 10AndyRussG) [04:22:19] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 277.43 seconds [05:56:13] (03Abandoned) 10Smalyshev: Remove rules.log - don't think anything uses it anymore [puppet] - 10https://gerrit.wikimedia.org/r/483033 (https://phabricator.wikimedia.org/T144539) (owner: 10Smalyshev) [05:56:32] (03Abandoned) 10Smalyshev: Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [05:57:42] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380 (10Smalyshev) [05:57:45] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Remove /srv/deployment/wdqs/wdqs/rules.log symlink - https://phabricator.wikimedia.org/T144539 (10Smalyshev) 05Open→03Resolved [09:24:46] PROBLEM - Memory correctable errors -EDAC- on db1068 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [14:30:33] 10Operations, 10Puppet: puppet.git rake fails with ruby 2.5 - https://phabricator.wikimedia.org/T208566 (10hashar) [16:12:59] !log rebooting mw2167 for a test [16:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:42] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:16:46] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:30:00] PROBLEM - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:53] hey [18:31:06] RECOVERY - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 138 bytes in 0.008 second response time [18:31:19] mmm ok [18:31:28] zotero paged the other day too [18:31:50] <_joe_> yes [18:32:06] <_joe_> there is a clear issue with a memory leak in this new version [18:32:15] are we doing that again? [18:32:47] <_joe_> apergos: we need to add a readiness probe to zotero, so that kubernetes can detect and kill unresponsive pods [18:32:58] that would be spiffy [18:33:00] <_joe_> it's not that easy though [18:33:16] in time [18:33:21] <_joe_> it can wait for monday though :) [18:34:37] oh indeed [21:27:28] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [21:28:34] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [21:28:40] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [21:31:08] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [21:32:24] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [21:33:10] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [21:33:22] PROBLEM - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:33:30] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [21:33:32] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [21:34:27] RECOVERY - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 138 bytes in 0.009 second response time [21:34:46] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [21:34:49] <_joe_> uh no icinga bot here [21:35:44] <_joe_> ...and just lag [21:36:52] I 've scheduled downtime for at least until Monday [21:36:52] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [21:37:01] it's not really actionable and recovers on its own [21:37:20] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [21:37:33] also judging from the requests, the ones suffering are mostly monitoring [21:37:35] <_joe_> uhm still in bad shape it seems [21:37:51] <_joe_> which means we don't have that many requests probably [21:38:28] <_joe_> akosiaris: well on monday we need to work on this, but I agree [21:39:05] the most we can do is open a bug upstream. Maybe also increase the node heap, but I am less optimistic about that [21:39:15] anyway, going back to bed [21:39:23] until we won't receive any more pages [21:40:32] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [21:40:52] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [21:41:42] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [21:43:26] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [21:45:52] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [21:46:38] !log restart all zotero pods in eqiad [21:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:00] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [21:47:00] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [22:42:14] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [22:44:34] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [22:53:16] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [22:55:34] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [23:09:06] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [23:10:14] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy