[00:53:19] (03PS3) 10Dzahn: update SPF records from phab1001 to phab1003 IP [dns] - 10https://gerrit.wikimedia.org/r/505332 (https://phabricator.wikimedia.org/T221389) [00:53:31] (03Abandoned) 10Dzahn: update SPF records from phab1001 to phab1003 IP [dns] - 10https://gerrit.wikimedia.org/r/505332 (https://phabricator.wikimedia.org/T221389) (owner: 10Dzahn) [00:55:41] (03Abandoned) 10Dzahn: openldap/offboard-user: add wikitech user deactivation [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [01:01:11] (03PS1) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/520956 [01:01:13] (03PS1) 10Dzahn: ntp/systemd: add notes_urls for timesyncd and systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/520957 (https://phabricator.wikimedia.org/T197873) [01:01:23] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/520956 (owner: 10Dzahn) [01:01:26] (03Abandoned) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/520956 (owner: 10Dzahn) [01:11:25] (03PS1) 10Dzahn: analytics: add notes_urls for cluster client and mysql [puppet] - 10https://gerrit.wikimedia.org/r/520959 [02:28:59] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 16782304 and 0 seconds [02:29:11] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 182070944 and 10 seconds [02:35:33] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 55649264 and 5 seconds [02:38:29] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 44936 and 7 seconds [02:39:13] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 174600 and 53 seconds [02:39:23] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 56920 and 63 seconds [04:01:19] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:02:55] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:24:41] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:26:13] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:33:03] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10greg) [04:33:18] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic: Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10greg) [05:18:12] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Core Platform Team Backlog (Watching / External), and 2 others: Move Graphoid to Kubernetes via the deployment pipeline - https://phabricator.wikimedia.org/T203091 (10greg) [05:22:08] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic: Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10greg) a:05mmodell→03None [05:30:37] (03PS14) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) [05:31:49] (03PS15) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) [05:33:01] (03CR) 10ArielGlenn: "This has been tested on beta and all dumps are produced properly. I plan to deploy it Sunday before the new run starts, unless there are o" [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [05:41:55] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) I keep pretty decent tabs on wikidata growth, because of the dumps. I don't do that for commons entities because I can't even find the proper... [06:29:41] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown) [06:33:07] PROBLEM - puppet last run on logstash1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:33:13] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:48:21] 10Operations, 10DBA: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385 (10ArielGlenn) 05Resolved→03Open Um, it has? I just found it on meta, though empty. wikiadmin@10.64.48.153(metawiki)> select * from edit_page_t... [06:53:10] (03CR) 10Smalyshev: refactor wikidata entity dumps into wikibase + wikidata specific bits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [06:56:55] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:21] RECOVERY - puppet last run on logstash1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:00:27] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:12:37] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [07:15:23] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 25540 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [07:32:57] (03PS16) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) [07:33:19] (03CR) 10ArielGlenn: refactor wikidata entity dumps into wikibase + wikidata specific bits (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [07:49:11] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [07:52:33] PROBLEM - SSH access on cobalt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit [07:53:55] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.15.13-13-gd782b2dd6b (SSHD-CORE-1.6.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [07:56:35] !log restarting gerrit out of heap space [07:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:35] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1529 too small - 1529 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [07:59:37] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:00:31] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26093 bytes in 0.435 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [08:00:43] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.062 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [08:01:03] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational [08:04:09] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [08:04:43] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [08:05:47] PROBLEM - puppet last run on schema1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [08:07:05] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 8 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikibase/wikiba.se-deploy],Exec[git_pull_research/landing-page] [08:15:01] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:31:57] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:33:01] RECOVERY - puppet last run on schema1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:34:17] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [10:31:28] (03CR) 10Elukey: sre.ganeti.makevm: add dns check before creating the vm (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/520897 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [10:37:54] (03PS4) 10Elukey: sre.ganeti.makevm: add dns check before creating the vm [cookbooks] - 10https://gerrit.wikimedia.org/r/520897 (https://phabricator.wikimedia.org/T203963) [10:52:37] (03PS1) 10Elukey: base::monitoring::host: ignore /mnt/hdfs from disk checks [puppet] - 10https://gerrit.wikimedia.org/r/520989 (https://phabricator.wikimedia.org/T226698) [10:55:16] (03PS2) 10Elukey: base::monitoring::host: ignore /mnt/hdfs from disk checks [puppet] - 10https://gerrit.wikimedia.org/r/520989 (https://phabricator.wikimedia.org/T226698) [10:58:24] (03PS3) 10Elukey: base::monitoring::host: ignore /mnt/hdfs from disk checks [puppet] - 10https://gerrit.wikimedia.org/r/520989 (https://phabricator.wikimedia.org/T226698) [12:36:24] (03PS1) 10Urbanecm: Add zh_classicalwiki to commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520992 (https://phabricator.wikimedia.org/T226764) [12:49:47] 10Operations, 10Page-Previews, 10RESTBase-API: Page summary endpoint in RESTBase not updated since about June 27 - https://phabricator.wikimedia.org/T226983 (10TTO) [13:00:48] (03CR) 10Hoo man: [C: 04-1] "Had a quick look only, didn't test. I think this will break the Wikidata JSON dumps." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [13:11:00] (03CR) 10Hoo man: [C: 04-1] "Given this is bash, this might also work given that empty var == empty string, but still, this should be fixed." [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [13:47:54] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10elukey) [13:50:19] 10Operations, 10User-Elukey: memkeys segfaults on Debian Stretch - https://phabricator.wikimedia.org/T223863 (10elukey) [16:32:28] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10Addshore) >>! In T226093#5310687, @ArielGlenn wrote: > I keep pretty decent tabs on wikidata growth, because of the dumps. I don't do that for commons en... [16:38:55] PROBLEM - Check systemd state on mendelevium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:39:05] PROBLEM - clamd running on mendelevium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (clamav), command name clamd [16:43:17] RECOVERY - Check systemd state on mendelevium is OK: OK - running: The system is fully operational [16:43:27] RECOVERY - clamd running on mendelevium is OK: PROCS OK: 1 process with UID = 111 (clamav), command name clamd [16:47:43] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [17:04:27] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:05:45] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 75711 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:56:14] (03PS1) 10Urbanecm: Add several Ukrainian government websites to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520997 (https://phabricator.wikimedia.org/T227366) [18:15:30] 10Operations, 10Machine vision, 10serviceops, 10Service-deployment-requests, 10Services (watching): Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10Multichill) Warning: Minefield ahead. Classifiers are generally quite good when it comes to objectiv... [19:05:48] (03PS1) 10Fomafix: Map the variant URLs /sr-cyrl/ and /sr-latn/ to /sr-ec/ and /sr-el/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520999 (https://phabricator.wikimedia.org/T117845) [20:25:01] (03PS1) 10Aaron Schulz: Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 [22:28:11] PROBLEM - puppet last run on bast4002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [22:55:27] RECOVERY - puppet last run on bast4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:26:41] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors